# Introduction to Machine Learning

## Learning from data

In more established approaches to modelling (e.g. physical based models), our knowledge of the system that is being modelled is encoded as a set of rules. The rules that describe the system comprise a set of parameters that have certain values. Input data is mathematically combined with the parameters to generate predictions. These rules can be based on theory, empircally derived relationships, physical rules or expert judgement. The task of designing a model includes choosing how many parameters are needed to represent the system, what values they should have and how they should be combined with the input data. 

There are many advantages to this approach to modelling. These include having models that are based on a known understanding of how variables are related (e.g. the principles of physics) and being able to explain how a model has arrived at its prediction. 

However, there are also many challenges with this approach to developing models. We may not have a complete understanding of the system we are trying to model. The system might be very complex; for example, it is a non-trivial task requiring lots of expertise to write out the rules and equations that describe the global climate system. The resulting physical-based global climate models are incredibly computationally intensive to run. And, some problems don't lend themselves to being expressed as a series of rules and equations; for example, an object detection model that draws boxes around features such as fishing vessels in aerial images or animals in photographs. 

In these cases, a machine learning approach to developing models might be a suitable alternative. In traditional modelling, input data and a model comprising rules that describe the system are known in advance. The predictions are unknown. In machine learning modelling, the input data and target values are known in advance. The model's rules are unknown. To develop a machine learning model, a computer is shown examples of input data and target values and it learns rules that describe the relationships between the input data and these targets. As the computer sees more examples of input data and targets it updates the model's parameter values to better reflect the relationship between input data and and outputs. Once the model is trained, input data can be fed into the model to generate output predictions. 

To develop a machine learning model, the following are required:

* **Data**
* **Model**
* **Loss function**
* **Optimisation algorithm**
* **Evaluation strategy**
* **Prediction pipeline**

Let's step through each of these concepts, and demonstrate how to develop a machine learning model using Python's <a href="https://scikit-learn.org/stable/" target="_blank">`scikit-learn` package</a>. We will start with a detailed walk through of building a machine learning model to predict plant species richness for different locations in South America. Then, you will need to adapt this workflow to develop a model that predicts a location in Fiji's land cover using spectral reflectance data measured by a sensor on a satellite. 

### Datasets

**South American species richness:** This is the dataset presented in <a href="https://arxiv.org/html/2404.06978v1" target="_blank">Meyer et al. (2024)</a> that includes points across South America representing vegetation surveys where species richness counts were recorded from the <a href="https://onlinelibrary.wiley.com/doi/10.1111/geb.13346" target="_blank">sPlotOpen database</a>. Species richness counts are the target values. Predictor variables included with this dataset are <a href="https://developers.google.com/earth-engine/datasets/catalog/WORLDCLIM_V1_BIO#bands" target="_blank">WorldClim bioclimatic variables</a> and elevation. The bioclimatic variables have the following definitions:

* BIO1: Annual Mean Temperature
* BIO2: Mean Diurnal Range (Mean of monthly (max temp - min temp))
* BIO3: Isothermality (BIO2/BIO7) (×100)
* BIO4: Temperature Seasonality (standard deviation ×100)
* BIO5: Max Temperature of Warmest Month
* BIO6: Min Temperature of Coldest Month
* BIO7: Temperature Annual Range (BIO5-BIO6)
* BIO8: Mean Temperature of Wettest Quarter
* BIO9: Mean Temperature of Driest Quarter
* BIO10: Mean Temperature of Warmest Quarter
* BIO11: Mean Temperature of Coldest Quarter
* BIO12: Annual Precipitation
* BIO13: Precipitation of Wettest Month
* BIO14: Precipitation of Driest Month
* BIO15: Precipitation Seasonality (Coefficient of Variation)
* BIO16: Precipitation of Wettest Quarter
* BIO17: Precipitation of Driest Quarter
* BIO18: Precipitation of Warmest Quarter
* BIO19: Precipitation of Coldest Quarter

**Fiji Ba land use and land cover:** This is a dataset of points collected across the Ba region of Fiji's main island, Viti Levu. Each point has a land use and land cover class label (an integer value corresponding to a class), which is the target value, and a series of predictor variables representing annual median spectral reflectance values, monthly NDVI values, and topographic data. The spectral reflectance values were derived from cloud free <a href="https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2_SR_HARMONIZED" target="_blank">Sentinel-2 satellite images</a>. This dataset is a subset of data available on the <a href="https://pacificdata.org/data/dataset/fiji-land-use-land-cover-labels" target="_blank">Pacific Data Hub</a>. 

## Setup

### Load data

In [None]:
import os
import subprocess

if "data-geoml" not in os.listdir(os.getcwd()):
    subprocess.run('wget "https://github.com/envt-5566/geo-ml/raw/main/data/data-geoml.zip"', shell=True, capture_output=True, text=True)
    subprocess.run('unzip "data-geoml.zip"', shell=True, capture_output=True, text=True)
    if "data-geoml.zip" not in os.listdir(os.getcwd()):
        print("Has a directory called data-geoml been downloaded and placed in your working directory? If not, try re-executing this code chunk")
    else:
        print("Data download OK")

DATA_PATH = os.path.join(os.getcwd())

### Load packages

In [None]:
if 'google.colab' in str(get_ipython()):
    !pip install mapclassify
    !pip install contextily
    !pip install pysal

import os
import numpy as np
import geopandas as gpd
from math import log

# pre-processing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# models
from sklearn.neural_network import MLPRegressor
from sklearn.neural_network import MLPClassifier

# metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score

## Data

Machine learning model development is all about learning from data. Training a machine learning model involves a computer learning relationships and patterns that map input data to target values. Therefore, to train a machine learning model we need input data and corresponding target values. The input data is often described as predictor variables or features. The target values are often termed labels.

In machine learning model development, datasets are often split into a training, validation and test set. 

### Data splits

* **Training data**: This dataset used to train the model and learn rules that map inputs to targets. 
* **Validation data**: This is a dataset that we can use to evaluate the model during development, get feedback on the training process and experiment with tweaks to the model design. 
* **Test data**: This is a dataset that we hold out from the model development process. This dataset is used to generate a series of predictions which are compared to known target values. This comparison is an evaluation of how well the model would perform given data it has not seen in training. 

### Overfitting and generalisation

Machine learning models can be incredibly flexible. This means that during training they might learn to fit the training data exactly rather than learning patterns and rules that let the model generalise across a wider variety of unseen data. The case of a model fitting the training data but not generating good predictions on unseen data is called overfitting. Comparing the training error (i.e. the error of the model's predictions using the training split) to the validation error (i.e. the error of the model's predictions using the validation split) can indicate if the model is overfitting. The common signal of overfitting is the training error rate reducing while the validation error rate remains constant (or increases). 

### Data pre-processing

Often, before data can be used for model training it needs to be pre-processed. This task is sometimes called feature engineering. Pre-processing can include data cleaning to remove noisy samples, merging datasets with different structures or computing new variables. 

A pre-processing step that is necessary for many machine learning tasks is standardising or normalising the input data. For example, it is common practice to standardise data for training neural network models; this involves subtracting the mean of each variable from every data point and dividing by the variable's standard deviation.

#### Data import

Let's read in a spatial dataset of points across South America where there is a plant species richness count (target values) and predictor variables (bioclimatic variables and elevation). This data is read into a <a href="https://geopandas.org/en/stable/getting_started/introduction.html#Concepts" target="_blank">`GeoDataFrame` object</a>, a container for storing spatial vector datasets in Python programs. `GeoDataFrame`'s have a tabular structure with each example represented by a row and each variable is represented by a column with a special column storing a geometry for locational information. 

In [None]:
gdf = gpd.read_file(os.path.join(DATA_PATH, "plant_species_south_america.gpkg"))

Call the `head()` method on the `GeoDataFrame` object, `gdf` to print out the first few rows. This gives us a feel for the data. 

In [None]:
gdf.head()

We can also call the `explore()` method on `gdf` to render the spatial data on a web map. We're setting the `column` argument to `Species_richness`, this will colour each point on the map according to its species richness value. 

In [None]:
gdf.explore(column="Species_richness")

#### Data pre-processing

Now we have read in our data, we need to create training and test splits. The `scikit-learn` package comes with a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank">`train_test_split` function</a>. The requires us to pass in an array of input data (`X`), a vector of output values of the same length as the number of rows in the input data (`y`), the proportion of the data we want to hold out for testing and a random state. Setting the random state is useful for reproducibility; each time we execute this code we will get the same random training and test splits. 

In [None]:
# drop columns not needed for model development
gdf_tmp = gdf.drop(columns=["PlotObservationID", "GIVD_ID", "Country", "Biome", "geometry"])
X = gdf_tmp.drop(columns=["Species_richness"])
y = gdf_tmp.loc[:, "Species_richness"]

# set aside 30% of the data as a test split
# set the random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4) 

Finally, we need to standardise input data before model training. This involves computing the mean and standard deviation for each predictor variable in the training dataset. Then, we subtract the mean from each example and divide by the standard deviation. Crucially, it is important to use the mean and standard deviation computed on the training dataset when standardising the test dataset or new unseen data. Standardising the data helps the model learn and update its parameter values and adjusts for different predictor variables having different scales and ranges of values.

`scikit-learn` has a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html" target="_blank">`StandardScaler()` class</a> with a `fit()` method to compute the means and standard deviation of variables in the training set. It also has a `transform()` method to standardise data. 

In [None]:
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Model

In machine learning, a model is software the can read in data, generate predictions, store parameters and functions for combining input data and parameters. The choice of what input data is passed into the model, the number of model parameters and operations of functions combining parameters and data reflect how the model represents the system in question. 

A simple model is a linear regression model:

$\hat{y} = b + w_{1}x_{1} + w_{2}x_{2}$

This model takes in two input variables, $x_{1}$ and $x_{2}$, which are multiplied by two parameters called weights, $w_{1}$ and $w_{2}$, followed by an addition with a third parameter called the bias, $b$, to predict an output $\hat{y}$. The task of training a linear regression model involves estimating the values that $b$, $w_{1}$ and $w_{2}$ should take; this is a simple machine learning task. This model assumes there is a direct linear transformation of input data to predictions. 

A benefit of many machine learning models is that they are more flexible and can learn a range of non-linear relationships between input predictors and target values. Here, we will work with a machine learning model called a <a href="https://d2l.ai/chapter_multilayer-perceptrons/mlp.html" target="_blank">multi-layer perceptron</a>, a neural network model.

You can represent a linear regression model as a simple neural network of one layer.  

<img src="https://github.com/envt-5566/geo-ml/blob/main/images/linear-model.png?raw=true"></img>

The input variables are multiplied by a weight (equivalent to $w_{n}$) and added to a bias (equivalent to $b$) to create an output.

<img src="https://github.com/envt-5566/geo-ml/blob/main/images/linear-model-detailed.png?raw=true"></img>

A multi-layer perceptron extends this in two ways. First, hidden layers of neurons are added. These hidden layers allow you to transform the input data into new representations before computing the output. These representations of your input data might be more representative of your target values. Second, the output from a neuron is passed through a nonlinear activation function. This lets you learn nonlinear mappings between input data and target values. 

<img src="https://github.com/envt-5566/geo-ml/blob/main/images/mutli-layer-model.png?raw=true"></img>

A more detailed view of how data would flow through and be transformed by a unit in the hidden layer:

<img src="https://github.com/envt-5566/geo-ml/blob/main/images/neuron-nonlinear-activation.png?raw=true"></img>

To create a multi-layer perceptron model we use the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor" target="_blank">`MLPRegressor` class</a> from `scikit-learn`. We can set the number and size of the hidden layers in the model by passing a tuple into the `hidden_layer_sizes` argument. For example, passing in `(20, 20,)` would create two hidden layers each with 20 units. We can also set the `random_state` argument to make our model development reproducible (the weights in the model are randomly initialised, so if we don't set a random state we're not guaranteed to train the same model repeatedly even if using the same training data). 

```
regr = MLPRegressor(hidden_layer_sizes=(50,), random_state=4)
```

When the outcome is numeric (e.g. as in the case of a species richness count here) it is termed a regression machine learning task. This is why we are using `MLPRegressor`. If the outcome is categorical (e.g. a land cover class or a true / false value) it would be a classification task (and we would use <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html" target="_blank">`MLPClassifier`</a> to build our model). 

## Loss functions

Loss functions measure the difference between a predicted output and a ground truth output. Model training involves finding parameter values that minimise the loss function.

### Regression

As the `MLPRegressor` model class is used for regression tasks, it's trained to minimise the mean squared error loss. 

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

Where $n$ is the number of examples in the training dataset, $y_{i}$ is the known target value $\hat{y}_{i}$ is the predicted value by the model. 

### Classification

For classification tasks (i.e. the model is predicting a categorical outcome and not a numeric value), `scikit-learn`'s `MLPClassifier` class uses the cross-entropy loss function (also called the log-loss function). Outcomes in binary classification tasks can have one of two values and outcomes in multi-class classification tasks can have one of many values. 

Cross-entropy is a measure of the difference between two probability distributions. For classification tasks, we can represent the known target values as a vector where each element corresponds to a class and the actual class has a probability of one (e.g. `targets = [0, 0, 1, 0]` for a four class classification task. Each element in this vector would correspond to a category such as cat, dog, mouse, and bird). The model output will be a vector of probability scores representing predictions of the class given the input data (e.g. `predictions = [0.2, 0.1, 0.6, 0.1]`). Both of these vectors are probability distributions. Cross-entropy can be used to measure the difference between these two probability distributions with a smaller cross-entropy score indicating the predictions align with the ground truth target class. During model training, we seek to adjust the model's weights to minimise the cross-entropy between the target and predictions distribution.

As cross-entropy may be less familiar than the mean squared error loss for regression tasks, let's step through computing the cross-entropy using some fake data:

In [None]:
targets = np.array([0, 0, 1, 0])
predictions = np.array([0.2, 0.1, 0.6, 0.1])

The formula for cross-entropy is:

$$
H(p, q) = - \sum_{c} p_{c} \log(q_{c})
$$

Where $p$ is the actual class probability, $q$ is the predicted class probability, and $c$ indexes the possible classes.

The probability values in $q$ have to taken on a value between 0 and 1. The logarithm of a low probability returns a lower negative number than the logarithm of a higher probability. For example, $\log(0.1) = -2.3$ and $\log(0.9) = -0.1$. When the model returns a low probability prediction for the target class (e.g. 0.1) this will return a lower negative number when multiplied by 1 (from $p_{c}$ the actual target probability for that class) than when the model returns a high probability for the target class. This negative number gets multiplied by negative one to make it positive; thus, we get a larger positive value when the predicted probability is lower for the target class and we have a higher loss.

In [None]:
def cross_entropy(targets, preds):
    H = 0
    for i in range(0, len(targets)):
        H = H + targets[i]*log(preds[i])
    H = -1 * H

    return H

In [None]:
cross_entropy(targets, predictions)

#### Activity!

**Change the values in the `predictions` array and compute the cross-entropy. This should give you an intuition as to how the cross-entropy changes as your predictions get better or worse.**

When training a machine learning model, we have more than one example. Therefore, we average the cross-entropy loss across all examples in the training dataset. You can read more about the cross-entropy loss <a href="https://machinelearningmastery.com/cross-entropy-for-machine-learning/" target="_blank">here</a>

## Optimisation algorithms

Optimisation algorithms adjust the model's parameter values to minimise the loss function. That is, these algorithms seek to find a set of parameter values that minimise the difference between the model's predicted outputs and true outputs for a given set of input training data. 

Initially our model starts with random weights. Then, we feed our input data through the model to generate a prediction. This prediction is compared to a known output value and the loss is computed, the difference between the prediction and the true value. The model's weights are then adjusted slightly to minimise the loss. The optimisation algorithm that adjusts the model's weights is called stochastic gradient descent. This process is repeated many times until the model converges or a stopping criterion is reached. Below, we setup our model to use stochastic gradient descent as the optimisation algorithm by passing `"sgd"` to the `solver` argument and set the maximum number of iterations for updating model weights to 500. 

```
regr = MLPRegressor(hidden_layer_sizes=(50,), random_state=4, solver="sgd", max_iter=500)
```

## Training

At this point, we are ready to train the model. The `MLPRegressor` class has a `fit()` method. We pass in our scaled predictor variables and target values to the `fit()` method and the model will learn weight values that minimise the loss.

Now, let's execute the model creation and training code we've slowly been building up:

In [None]:
regr = MLPRegressor(hidden_layer_sizes=(50,), random_state=4, solver="sgd", max_iter=500).fit(X_train_scaled, y_train)

## Evaluation stratey

The final step of the machine learning model development process is to evaluate the model. This evaluation should give us an indication of how well the model has learnt to relate input data to target values and the performance of the model with new unseen examples. This information helps us judge how suitable the model is for a particular task or application. 

The *training error* is the difference between the model's prediction and known values for the training examples. This gives us a biased estimate of the model's performance as during training the model has been optimised to map input data to outcome values in the training dataset. 

What is more relevant to us is the model's *generalisation error* which is an indication of how well the model would perform given a infinite amount of data that is independent of the training dataset but drawn from the same underlying population. We estimate the *generalisation error* using the test split, a sample of data withheld from the model during training. But, note, this is an **estimate** of the generalisation error, it is the test set error, and you should be aware of the limits of your test set. 

To estimate the model's error we compute a range of performance metrics. For classification tasks, accuracy is a common metric. Accuracy is the percentage of examples in the test set the model correctly labelled. 

For regression tasks, mean squared error (MSE), mean absolute error (MAE), the root mean squared error (RMSE), and $R^2$ are common metrics. 

The MSE measures the average of squared distances between the model predicted and true outcome values in the test set. As it measures the squared distance it is more sensitive to cases when the model error is large. 

$$
\text{MSE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2 
$$

The RMSE is the square root of the MSE; it transforms the error metric into the units of the target variable. 

The MAE is similar but it measures the average of absolute distances between the model predicted and true outcome values in the test set; it is less sensitive to cases when the model error is large.

$$
\text{MAE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)
$$

The $R^2$ (also known as the coefficient of determination) measures the amount of variability in the true target values that is explained by the model predictions. The closer the $R^2$ value is to one, the more variability in the test set's true values is captured by the model's predictions. The $R^2$ is a measure of how well the model's predictions fit the true outcomes. 

$$
\text{R}^2 = 1- \frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{\sum_{i=1}^n(y_i - \bar{y})^2} 
$$

$\bar{y}$ is the mean of true outcomes in the test set and $n$ is the number of examples in the test set. 

We can estimate the model's MSE using the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.mean_squared_error.html" target="_blank">`mean_squared_error()` function</a> from `scikit-learn`. First, we need to generate some predictions for the test set. 

In [None]:
y_test_preds = regr.predict(X_test_scaled)

Next, we can compute the MSE. The `mean_squared_error()` function takes in an array of true values as its first argument and an array of predictions as its second argument.

In [None]:
test_mse = mean_squared_error(y_test, y_test_preds)
print(f"The MSE on the test split is: {test_mse}")

It is a similar process to compute the $R^2$:

In [None]:
test_r2 = r2_score(y_test, y_test_preds)
print(f"The R2 on the test split is: {test_r2}")

#### Activity!

<details>
    <summary><strong>Can you use the <code>mean_absolute_error()</code> function to compute the MAE? The docs for the MAE function in <code>scikit-learn</code> are <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.mean_absolute_error.html" target="_blank">here</a>.</strong></summary>

```
test_mae = mean_absolute_error(y_test, y_test_preds)
print(f"The MAE on the test split is: {test_mae}")
```
</details>

In [None]:
## ADD CODE HERE

#### Activity!

<details>
    <summary><strong>Can you use the <code>mean_squared_error()</code> function to compute the training error? Do you expect the training error to be higher or lower than the test error?</strong></summary>

```
y_train_preds = regr.predict(X_train_scaled)
train_mse = mean_squared_error(y_train, y_train_preds)
print(f"The MSE on the test split is: {test_mse}")
print(f"The MSE on the train split is: {train_mse}")
```

As the model has seen the training data in training, we would expect the MSE to be lower on training set. 
</details>

In [None]:
## ADD CODE HERE

## Prediction pipeline

Now that we have trained and tested our model, we can use it to generate predictions. Let's predict the species richness for all of our data points and compare them to observed values. 

Let's create an array of predictor variables for all our data points:

In [None]:
# drop columns not needed for prediction
gdf_preds = gdf.drop(columns=["PlotObservationID", "GIVD_ID", "Country", "Biome", "geometry"])
X_preds = gdf_preds.drop(columns=["Species_richness"])

Then, we need to scale the input data:

In [None]:
X_preds_scaled = scaler.transform(X_preds)

Finally, we can generate predictions of species richness for `X_preds_scaled` and append the predictions as a column in our `GeoDataFrame` `gdf_preds`:

In [None]:
# generate predictions
y_preds = regr.predict(X_preds_scaled)

# append as column back to GeoDataFrame
gdf.loc[:, "Species_richness_preds"] = y_preds

# compute difference between predicted and observed species richness
gdf.loc[:, "Preds_diffs"] = gdf.loc[:, "Species_richness_preds"] - gdf.loc[:, "Species_richness"]

We can plot our predictions on a web map:

In [None]:
gdf.explore(column="Species_richness_preds")

We can also plot the difference between predicted and observed species richness:

In [None]:
gdf.explore(column="Preds_diffs", cmap="RdYlBu")

## Self guided activity! Land cover classification

We have worked through an example of how to train and test a machine learning model for a regression task. That is building a model to predict a continuous numeric outcome. However, often we want to build a model to predict a categorical outcome. Land cover classification is an example of a classification task where want to assign a land cover label (e.g. grass, forest, urban, agriculture) to a location on the Earth's surface. 

Your task here is to adapt the workflow above to build a model that will classify a location's land cover in the Ba region of Fiji based on input predictors derived from satellite images. The input predictors are generated from <a href="https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2_SR_HARMONIZED" target="_blank">Sentinel-2 satellite images</a> which record the spectral reflectance from the Earth's surface with a 10 m spatial resolution and for wavebands spanning the visible, near infrared and shortwave infrared wavelengths. The input predictors also include topographic variables and spectral indices that highlight vegetative, water-related and built-up features. The task is to train a model that relates these input predictors to a land cover label. 

Let's start by reading in the data and exploring it:

In [None]:
gdf_lc = gpd.read_file(os.path.join(DATA_PATH, "fiji_ba_lulc_s2.geojson"))

The land cover class for a point is stored under the `class` column. 

The `class` column is the land cover class label for a point. This column stores integer type values. Each integer corresponds to a land cover class:

1. water
2. mangrove
3. bare earth
4. urban
5. agriculture
6. grassland
7. shrubs
8. trees

Let's reclassify the integer values to text labels and visualise this on a map:

In [None]:
label_map = {
    1: "water",
    2: "mangrove",
    3: "bare earth",
    4: "urban",
    5: "agriculture",
    6: "grassland",
    7: "shrubs",
    8: "trees",
}

gdf_lc.loc[:, "class_label"] = gdf_lc["class"].replace(label_map)

# alphabetical order
cmap = [
    "darkorange",
    "brown",
    "darkseagreen",
    "aquamarine",
    "olivedrab",
    "darkgreen",
    "grey",
    "blue",
]

gdf_lc.explore(column="class_label", categorical=True, cmap=cmap)

Let's quickly inspect the first few rows of the `GeoDataFrame`. The columns with headings starting with `B*` are annual average spectral reflectance values for all cloud free observations over a year; the `B*` labelling corresponds to different wavebands (e.g. `B8` is the near infrared band). The columns `gcvi`, `ndbi`, `ndwi`, and `ndvi` are spectral indices that are sensitive to the presence of vegetation condition, chlorophyll, built-up surfaces, and moisture. 

Integer type labels are used to represent the land cover classes as the model still needs to represent categorical type data in a numeric form. 

In [None]:
gdf_lc.head()

Let's prepare the data for model training by selecting the predictor variables and the outcome labels.

In [None]:
# drop columns not needed for model development
gdf_tmp = gdf_lc.drop(columns=["year", "geometry", "class_label"])
X = gdf_tmp.drop(columns=["class"])
y = gdf_tmp.loc[:, "class"]

#### Activity!

<details>
    <summary><strong>Can you create train and test splits of the predictors and outcome labels to prepare the data for land cover classification model development? Can you use the <code>StandardScaler</code> object to standardise the predictor variables?</strong></summary>

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4)
lc_scaler = StandardScaler().fit(X_train)
X_train_scaled = lc_scaler.transform(X_train)
X_test_scaled = lc_scaler.transform(X_test)
```
</details>

In [None]:
## ADD CODE HERE

#### Activity!

<details>
    <summary><strong>Can you use the dataset that you have just prepared to train a classification model? Here, you will need to use the <code>MLPClassifier</code> from <code>scikit-learn</code>? Use the <code>scikit-learn</code> documentation to see how to adapt the regression workflow implemented above to work with the <code>MLPClassifier</code> model.</strong></summary>

```
cls = MLPClassifier(hidden_layer_sizes=(10,), random_state=4, solver="sgd", max_iter=500).fit(X_train_scaled, y_train)
```
</details>

In [None]:
## ADD CODE HERE

#### Activity!

<details>
    <summary><strong>Can you estimate the accuracy of the trained classifier using the test split? Look at the <code>scikit-learn</code> documentation for computing the accuracy metric. Why is the accuracy a suitable metric for classification tasks?</strong></summary>

```
y_test_preds = cls.predict(X_test_scaled)
test_acc = accuracy_score(y_test, y_test_preds)
print(f"The accuracy on the test split is: {test_acc}")
```

The accuracy is more suitable for classification tasks as the outcome is categorical. The accuracy tells us the proportion of times our model assigned the correct class label to a data point in the test split. Categorical values do not have a scale that lets us measure how close or far they are from a true value (i.e. they are not aligned as numeric values on a ratio scale). Therefore, it does not make sense to use metrics like mean absolute error or mean squared error for categorical data. 
</details>

In [None]:
## ADD CODE HERE