# Machine learning 1

This lab will provide an introduction to key machine learning concepts and also demonstrate how you can use <a href="https://scikit-learn.org/stable/" target="_blank">Scikit-learn</a> to implement machine learning workflows in Python. 

The focus of this lab will be on supervised machine learning. This lab will develop a machine learning workflow that classifies the crop type of fields in India using spectral reflectance values recorded by the Sentinel-2 satellite. It is based on the AgriFieldNet Competition Dataset <a href="https://mlhub.earth/data/ref_agrifieldnet_competition_v1" target="_blank">(Radiant Earth Foundation and IDinsight, 2022)</a> which has been published to encourage people to develop machine learning models that classify a field's crop type from satellite images. 

The dataset includes spectral reflectance values, crop type labels, field id, and geometry for the field's location from cropping landscapes in the Indian States of Odisha, Uttar Pradesh, Bihar, and Rajasthan. The field boundaries and crop type labels were captured by data collectors from IDinsights Data on Demand team and the satellite image preparation was undertaken by the Radiant Earth Foundation.

## Setup

### Run the labs

You can run the labs locally on your machine or you can use cloud environments provided by Google Colab. **If you're working with Google Colab be aware that your sessions are temporary and you'll need to take care to save, backup, and download your work.**

<a href="https://colab.research.google.com/github/geog3300-agri3003/coursebook/blob/main/docs/notebooks/week-5_1.ipynb" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Download data

If you need to download the date for this lab, run the following code snippet. 

In [None]:
import os
import subprocess

if "data_lab-5" not in os.listdir(os.getcwd()):
    subprocess.run('wget "https://github.com/geog3300-agri3003/lab-data/raw/main/data_lab-5.zip"', shell=True, capture_output=True, text=True)
    subprocess.run('unzip "data_lab-5.zip"', shell=True, capture_output=True, text=True)
    if "data_lab-5" not in os.listdir(os.getcwd()):
        print("Has a directory called data_lab-5 been downloaded and placed in your working directory? If not, try re-executing this code chunk")
    else:
        print("Data download OK")

### Working in Colab

If you're working in Google Colab, you'll need to install the required packages that don't come with the colab environment.

In [None]:
if 'google.colab' in str(get_ipython()):
    !pip install rioxarray
    !pip install mapclassify
    !pip install rasterio

## Machine learning

Machine learning is the process of learning from data to make predictions. **Supervised** machine learning models are trained to predict an outcome based on input data (predictors or features). The model is trained to minimise the error in predictions using a training set where both the outcome labels and input data are known. If the outcome is categorical (e.g. land cover type, cloud / no-cloud) then it is a **classification** machine learning task and if the outcome is numeric (e.g. crop yield, temperature) then it is a **regression** machine learning task. 

There are also **unsupervised** machine learning tasks where there are no known outcomes prior to model training. Unsupervised machine learning models typically cluster datasets with similar data points assigned to the same cluster or group. 

Please watch this introduction to machine learning video from Climate Change AI:

In [None]:
%%HTML
'<iframe width="560" height="315" src="https://www.youtube.com/embed/mc9QG2R-rf4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>'

### Scikit-learn

<a href="https://scikit-learn.org/stable/" target="_blank">Scikit-learn</a> is an open-source machine learning package for Python. It provides a range of tools for preprocessing datasets for machine learning, training machine learning models, evaluating model performance, and using a trained model to make predictions. A range of supervised and unsupervised machine learning algorithms can be used with Scikit-learn. 

### Task

The focus of this lab will be on supervised machine learning. This lab will develop a machine learning workflow that classifies the crop type of fields in India using spectral reflectance values recorded by the Sentinel-2 satellite. It is based on the AgriFieldNet Competition Dataset <a href="https://mlhub.earth/data/ref_agrifieldnet_competition_v1" target="_blank">(Radiant Earth Foundation and IDinsight, 2022)</a> which has been published to encourage people to develop machine learning models that classify a field's crop type from satellite images. 

The dataset includes spectral reflectance values, crop type labels, field id, and geometry for the field's location from cropping landscapes in the Indian States of Odisha, Uttar Pradesh, Bihar, and Rajasthan. The field boundaries and crop type labels were captured by data collectors from IDinsights Data on Demand team and the satellite image preparation was undertaken by the Radiant Earth Foundation. 

### Import modules

In [None]:
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
import os

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import GroupShuffleSplit
from sklearn.inspection import permutation_importance
from sklearn import tree

# setup renderer
if 'google.colab' in str(get_ipython()):
    pio.renderers.default = "colab"
else:
    pio.renderers.default = "jupyterlab"

rng = np.random.RandomState(0)

## Load data

In [None]:
data_path = os.path.join(os.getcwd(), 'data_lab-5', 'agrifieldnet_processed_adm4.geojson')
gdf = gpd.read_file(data_path)

## Data pre-processing

Often, after sourcing data, the first task in a machine learning workflow is data preprocessing - transforming the raw data into a format ready for model training or making predictions. These tasks are often referred to as feature engineering - the process of engineering or creating features or predictor variables. 

Let's inspect the data. We can see it is a `GeoDataFrame` with columns corresponding to the `field_id`, `labels` (crop type identifier), spectral reflectance in several wavebands (`B*`) and `ndvi`, the `village` where the field is located in India, and `geometry` `POINT` object for the field centroid. 

In [None]:
print(f"The data is of type: {type(gdf)}")
gdf.head()

Based on the dataset's documentation the below is the mapping between numeric values and crop types in the labels dataset. 

* 1 - Wheat
* 2 - Mustard
* 3 - Lentil
* 4 - No crop/Fallow
* 5 - Green pea
* 6 - Sugarcane
* 8 - Garlic
* 9 - Maize
* 13 - Gram
* 14 - Coriander
* 15 - Potato
* 16 - Bersem
* 36 - Rice

Let's explore how many examples we have of different crop types. We can see that our dataset is dominated by wheat, mustard, and no crop / fallow labels.

In [None]:
# make labels categorical for bar plot
class_mappings = {
    "1": "Wheat",
    "2": "Mustard",
    "3": "Lentil",
    "4": "Fallow",
    "5": "Green pea",
    "6": "Sugarcane",
    "8": "Garlic",
    "9": "Maize",
    "13": "Gram",
    "14": "Coriander",
    "15": "Potato",
    "16": "Bersem",
    "36": "Rice"
}

gdf["labels_cat"] = gdf["labels"].astype("str")
gdf.replace({"labels_cat": class_mappings}, inplace=True)

gdf.groupby("labels_cat").count().loc[:, "field_id"]

We can also explore the spatial distribution of the data. Hover over the points on the map with your cursor. 

In [None]:
gdf.explore("labels_cat", tiles="CartoDB dark_matter", cmap="tab20", categorical=True)

There are some final preprocessing steps required before we are ready to train a model to classify a field's crop type. 

Scikit-learn models expect the input data and outcomes to be `array-like`. Generally, this is in the form of NumPy `ndarray` objects. 

We want the input data (features or predictors) to be in a separate object to the outcomes (labels). Therefore, we'll subset the `GeoDataFrame` object and store just the predictor variables in an `array-like` object `X` and the outcomes in an object `y`. 

Numeric Pandas `Series` or `DataFrame` objects are `array-like` and so we can directly subset columns from the `GeoDataFrame` to create input and output objects.

`X` generally has the shape `(n_samples, n_features)` where each sample is aligned along the rows dimension (or 0-axis in a rank 2 NumPy `ndarray`) and the features (or predictors) are aligned along the columns dimension (or 1-axis in a rank 2 NumPy `ndarray`).

In [None]:
X = gdf.drop(["field_id", "labels", "labels_cat", "index_right", "village", "geometry"], axis=1)
y = gdf.loc[:, "labels"]

In [None]:
X.head()

For classification tasks the values in `y` should be integer and for regression tasks the values in `y` should be floating point. As crop type is a categorical variable values in `y` should be of integer data type.

In [None]:
y.head()

### Train-test splits

For supervised machine learning tasks we need to create training and test datasets. 

The model is trained using the training set which consists of matched features and outcomes. 

The model is then evaluated using the test set. A prediction is made using features in the test set and the prediction is compared with known outcomes for those features. This provides an indication of the model's performance. **It is important that the test set is independent from the training set - an important part of machine learning model development is preventing information from the test set leaking into the training set**. 

Scikit-learn provides a useful `train_test_split()` function which expects `X` and `y` `array-like` objects as inputs and will return 4 `array-like` objects (`X_train`, `X_test`, `y_train`, `y_test`).

We can provide further arguments to `train_test_split()`:

* `test_size` determines the proportion of the input data that is allocated to the test set
* `random_state` is a seed that ensures the same random split of the data occurs each time the code is executed. This is important for reproduciblity of results. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng, test_size=0.3)

In [None]:
print(f"the size of the training features object is {X_train.shape}") 
print(f"the size of the test features object is {X_test.shape}")
print(f"the size of the training outcomes object is {y_train.shape}")
print(f"the size of the test features object is {y_test.shape}")

## Model training

Scikit-learn provides a range of machine learning algorithms that can be trained for different tasks (e.g. classification, regression, text, images, clustering etc.). 

In Scikit-learn terminology each of these algorithms is called an <a href="https://scikit-learn.org/stable/glossary.html#term-estimators" target="_blank">`estimator`</a> - the docs have a useful <a href="https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html" target="_blank">interactive guide</a> to help you select the right `estimator` for your machine learning task.

Each `estimator` object has a `fit()` method. The fit method expects the training data as arguments (`X_train` and `y_train`) and when called learns rules that minimise the error in predicting the outcome labels in `y_train`. This is the *learning* part of the machine learning workflow. 

Here, we will demonstrate how to train a tree-based machine learning model: a random forests classifier. 

First, we create an `estimator` object for the model. Then, we use the `estimator`'s `fit()` method to train the model. 

### Random forests classifiers

Random forests models are an ensemble and tree-based model. They're a tree-based model as they consist of an ensemble of decision tree classifiers.

Please read through this <a href="https://developers.google.com/machine-learning/decision-forests/decision-trees" target="_blank">Google Machine Learning Guide on decision trees and random forests</a>.

<details>
    <summary><b>Detailed notes on tree-based models</b></summary>
    
Decision tree classifiers are trained to learn rules that classify outcome labels based on input features by recursively splitting the feature space. The algorithm starts by finding the value of a feature that splits the dataset into two groups which minimise the "impurity" of outcome labels in each group. Then, that process is repeated by splitting each of the two groups, again to minimise the "impurity" of outcome labels. This process repeats until a stopping criterion is reached. The Gini index is the default metric to measure class impurity in each internal node of the tree. 

The class label associated with each of the terminal nodes of the tree is based on the most commonly occurring class. 

Individual decision tree classifiers are relatively quick to train, can learn non-linear and interactive relationships between input features and outcome labels, and are easy to visualise and interpret. 

However, there are limits to decision tree classifiers. They are often not the most accurate classifiers. They also have high variance; if you train a decision tree classifier on two different samples it will likely learn different relationships and generate different predictions. Large decision trees can also overfit the training data; they can learn to fully represent the structure of the training set but will not generalise well to new and unseen data.
    
Random forests models mitigate the limitations of a single decision tree classifier by:

<b>bagging:</b> training a number (ensemble) of decision trees based on bootstrap samples of the training datasets. The average prediction from many decision tree models reduces the variance in predictions.

<b>sampling features at each split:</b> when training each of the decision trees in the ensemble, a random selection of features are searched for each split within the tree. This prevents a small number of features from dominating the model, enables the model to learn using all the input features, and reduces overfitting. If there are <em>p</em> features, then often the <em>m</em> &radic;<em>p</em> are considered at each split.
    
<b>majority vote:</b> for classification tasks, the final predicted value from a random forest model is the most common prediction of the outcome label across all trees in the ensemble.
      
</details>
<p></p>

Let's create a random forest model `estimator` object using the `RandomForestClassifier()` function. We'll set the `n_estimators` parameter to 20 here; this means the random forest will consist of an ensemble of 20 decision tree classifiers. The `random_state` parameter ensures we learn the same model each time we train it on the same data; this is important for reproducible results.

In [None]:
# create and train a random forests model
rf = RandomForestClassifier(n_estimators=20, random_state=rng)
rf.fit(X_train, y_train.astype(int))

## Model evaluation

After training a model, we need to evaluate it to assess its performance. This evaluation allows us to compare different models and get an indication of how well the model will perform when it is used on new data. 

After training, the model object, `rf` in this case stores rules that map input data to output (predicted) labels. In the case of a random forests model, these rules are stored and expressed as a collection of decision trees. 

It is important that the test data used to evaluate a model is independent of the training data; this is to ensure an unbiased estimate of model performance. 

There are a range of model evaluation metrics for classification tasks:

* **accuracy**: the proportion of correctly classified examples. 
* **recall**: the ratio of true positives to true positives and false negatives - the ability of the classifier to capture all positive cases. $recall = \frac{tp}{tp+fn}$.
* **precision**: the ratio of true positives to true positives and false positives - the classifiers ability not to label something as positive when it is not. $precision = \frac{tp}{tp+fp}$.
* **f1-score**: the f1-score combines the recall and precision scores and takes on the range 0 to 1. $F1 = 2\cdot\frac{precision\cdot{recall}}{precision+recall}$

Scitkit-learn provides a `classification_report()` which can be used to generate performance metrics for each class and the model as a whole. 

The `classification_report()` expects known outcome labels and predicted outcome labels. To generate the predicted labels, we can use the `predict()` method of the `estimator` object and pass in input test data. 

In [None]:
y_pred = rf.predict(X_test)

print(classification_report(y_test, y_pred))

From the classification report, we can ascertain the following:

* the model's overall accuracy is 0.68 - 68% of the examples in the test set were classified correctly.
* the model's performance is better for class labels 1 (wheat), 4 (fallow), and 9 (maize). 
* the performance metric scores for the other classes is lower. 
* the model's performance is best for classes with the most observations in the training dataset. 
* we're getting a warning indicating to us that the precision and f1-score are being set to zero in labels with no predicted samples.

We can also plot a confusion matrix to see if there are patterns of confusion between classes.

In [None]:
labels = ["Wheat", 
          "Mustard", 
          "Lentil", 
          "Fallow", 
          "Green Pea",
          "Sugarcane",
          "Garlic",
          "Maize",
          "Gram",
          "Coriander",
          "Potato",
          "Bersem",
          "Rice"]
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot(text_kw={"fontsize":10}, xticks_rotation="vertical")
plt.show()

From the confusion matrix, we can see:
    
* there is confusion between the mustard and wheat classes.
* a large number of minitory classes as misclassified as wheat or mustard (majority classes).
* there were no successful classifications of coriander, garlic, or bersem. 

#### Recap quiz

**Earlier, we discussed that random forests models should be more accurate than a single decision tree classifier. Can you train a decision tree classifier to test if this the case?**

**The documentation for the decision tree classifier is <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html" target="_blank">here</a>.**

**Use `X_train`, `y_train`, `X_test`, and `y_test` for this task**

In [None]:
## add code here ##

<details>
    <summary><b>answer</b></summary>
    
```python
# import the tree module from scikit-learn
from sklearn import tree

# create a decision tree classifier object
clf = tree.DecisionTreeClassifier(random_state=0)

# train the model
clf.fit(X_train, y_train)

# test model
y_pred_tree = clf.predict(X_test)
print(classification_report(y_test, y_pred_tree))
```
</details>

### Visualising predictions

We have a `DataFrame` `X` which stores all the features before we split the data into training and test splits. We can use `X` to generate a predicted crop type label for each of our data points. 

In [None]:
all_preds = rf.predict(X)

We can then append the predicted crop type labels to our initial `GeoDataFrame` as a new column and visualise these predictions on a map. Hover over points on the map with your cursor to see the actual (`labels_cat`) and predicted (`predicted`) crop types for a field. 

In [None]:
gdf["predicted"] = all_preds.astype("str")
gdf.replace({"predicted": class_mappings}, inplace=True)
basemap = "https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{z}/{y}/{x}"
attribution = "Tiles &copy; Esri &mdash; Source: Esri, i-cubed, USDA, USGS, AEX, GeoEye, Getmapping, Aerogrid, IGN, IGP, UPR-EGP, and the GIS User Community"
gdf.explore(column="predicted", cmap="tab20", categorical=True, tiles=basemap, attr=attribution, tooltip=["labels_cat", "predicted"])

## Class imbalance

Our training and test datasets are clearly imbalanced across the outcome class labels. The majoring of samples are of wheat (1), mustard (2), or fallow (4) classes. 

We can see the imbalance in our dataset using a bar plot.

In [None]:
px.histogram(gdf, x="labels_cat")

In [None]:
print("the number of samples by class in our overall dataset (pre-split) are:")
gdf.groupby('labels_cat').count().loc[:, "field_id"]

#### Recap quiz

<details>
    <summary><b>How could imbalanced data affect model performance?</b></summary>
<ul>
<li>The model will not see enough examples of minority classes to learn rules to discriminate them from the input data</li>
<li>The model will learn it can achieve good overall accuracy by just predicting majority classes</li>
</ul>
</details>

<p></p>

<details>
    <summary><b>What could we do to fix the class imbalance problem?</b></summary>
<ul>
<li>Undersample the majority classes</li>
<li>Oversample the minority classes</li>
<li>Get more data</li>
<li>Pool the minority classes to reduce the total number of classes</li>
</ul>
</details>

## Data leakage

<a href="https://scikit-learn.org/stable/common_pitfalls.html#data-leakage" target="_blank">Data leakage</a> occurs when information in the test set leaks into the training dataset. This means the test set is not truly independent and does not provide an unbiased assessment of the model's performance on new data. 

Spatial correlation occurs when observations close to each other are more similar or disimilar than observations further away. This is encapsulated by Tobler's first law of Geography: "Everything is related to everything else. But near things are more related than distant things." 

Geospatial data is often spatially correlated. This means that data points close to each other are not statistically independent. A random training and test split of spatially correlated data can result in the test dataset not being independent of the training dataset. This is because some of the data in the test set is correlated with data in the training set. Spatial correlation is causing data leakage and the evaluation of model performance using this test set will be biased.

#### Recap quiz

<details>
    <summary><b>How could you generate training and test splits which are not spatially correlated?</b></summary>
First, you could explore the spatial correlation in your dataset using techniques such as Moran's I and Local Moran's I statistics. 
    
If you believe that your samples are not likely to be correlated across administrative boundaries such as villages, counties, states etc. you could randomly split your data at the administrative boundary-level as opposed to the sample-level. That is, instead of taking a random hold-out sample of data points for the test set you would take a random sample of administrative units as the test set and all data points inside those units would be your test set. 
    
An alternative strategy if there are no useful administrative boundaries could be to spatially cluster your samples using their coordinates so proximal data points are allocated to the same cluster and randomly hold-out some clusters as the test set. 
</details>

<p></p>

Let's use the `village` column in our dataset `gdf` as a group to guide generation of training and test sets. We'll ensure that no samples from the same village are in both the training and test sets.

<details>
    <summary><b>What is our assumption when using villages as the grouping variable?</b></summary>
We are assuming that data points in neighbouring villages are not spatially correlated, and, therefore, there is no data leakage from the the test set to the training set. Is this a safe assumption? Do you think villages next to each other will have different agricultural contexts? 
</details>

In [None]:
X_sp = gdf.drop(["field_id", "labels", "labels_cat", "predicted", "index_right", "village", "geometry"], axis=1)
y_sp = gdf.loc[:, "labels"]
groups = gdf.loc[:, "village"]

scikit-learn has a `GroupShuffleSplit` object that has a `split()` method that can be used to generate splits of the dataset. 

First, we need to create an instance of the `GroupShuffleSplit` object specifying the number of different splits of the dataset that we want to create using the `n_splits` argument. We also use the `train_size` argument to define how much of the data should be allocated to the training and test sets. 

Here, we only want to create one split of our dataset at the `village` level so we set `n_splits=1`. 

Then, we call the `split()` method of `gss`, our `GroupShuffleSplit` object, passing in the features (`X_sp`), outcome labels (`y_sp`), and the groups (`groups`). This returns to us a `train_index` and `test_index` specifying the index locations of samples allocated to the training and test set. Passing in `groups` ensures that no samples from the same group (`village`) are in both the training and test sets. 

We then use the index locations in `train_index` and `test_index` to subset `X_sp` and `y_sp` for model training and testing.

In [None]:
gss = GroupShuffleSplit(n_splits=1, train_size=.8, random_state=0)

for i, (train_index, test_index) in enumerate(gss.split(X_sp, y_sp, groups)):
    print(f"processing split {i}")
    X_train_sp = X_sp.iloc[train_index, :]
    X_test_sp = X_sp.iloc[test_index, :]
    y_train_sp = y_sp.iloc[train_index]
    y_test_sp = y_sp.iloc[test_index]
    print(f"the size of the training features object is {X_train_sp.shape}") 
    print(f"the size of the test features object is {X_test_sp.shape}")
    print(f"the size of the training outcomes object is {y_train_sp.shape}")
    print(f"the size of the test outcomes object is {y_test_sp.shape}")

Now we're ready to train and test our model. 

Let's train the model and test its performance.

In [None]:
# create and train a random forests model
rf_sp = RandomForestClassifier(n_estimators=20, random_state=rng)
rf_sp.fit(X_train_sp, y_train_sp.astype(int))

In [None]:
y_pred_sp = rf.predict(X_test_sp)

print(classification_report(y_test_sp, y_pred_sp))