# Filter Training Data


## Background

It is not uncommon that existing training data were collected at different time period than the study period. This means the dataset may not reflect the real ground cover due to temporal changes. FAO adopted a training data filtering method for any given reference year that is within a time span (e.g. 5 years) from an existing baseline, and tested the method in the production of land cover mapping for Lesotho. It is assumed that the majority of reference labels will remain valid from one year to the previous/next. Based on this assumption, the reference labels which have changed are the minority, and should be detectable through the use of outlier detection methods like K-Means clustering. More details on the method and how it works for Lesotho can be found in the published paper ([De Simone et al 2022](https://www.mdpi.com/2072-4292/14/14/3294)).

## Description

This notebook will implement FAO's automatic filtering of a training dataset for a target year using points from a geojson or shapefile and a reference classification map of a previous year. The steps include:
1. Load extracted training features
2. Generate stratified random samples for each class on the reference land cover map using `random_sampling` and extract their features using `collect_training_data`
3. Train K-Means models using the extracted features of the random samples
4. Apply clustering on training features and remove minor clusters
5. Export the filtered training features to disk for use in subsequent scripts

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell.

### Load packages


In [None]:
%matplotlib inline
import os
import datacube
import warnings
import numpy as np
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr
import rioxarray
from odc.io.cgroups import get_cpu_quota
from odc.algo import xr_geomedian
from deafrica_tools.datahandling import load_ard
from deafrica_tools.classification import collect_training_data
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from rasterio.enums import Resampling
from random_sampling import random_sampling # adapted from function by Chad Burton: https://gist.github.com/cbur24/04760d645aa123a3b1817b07786e7d9f



## Analysis parameters
* `training_features_path`: The path to the file containing training features we extracted through the previous module `0_Extract_Training_Features.ipynb`.
* `reference_map_path`: The path to the reference classification map, which will be used as a stratification layer to extract random samples for each class. In this example, we are using the existing national land cover map. **Note that the reference map pixel values should contain the class values existing in the training data.**
* `class_attr`: This is the name of column in your shapefile/geojson file attribute table that contains the class labels. **The class labels must be integers**
* `output_crs`: Output spatial reference system.

In [2]:
training_features_path = 'Results/Mozambique_training_features.geojson'
reference_map_path='Data/rwanda_landcover_2015_scheme_ii_classes_merged.tif'
class_attr = 'LC_Class_I' # class label in integer format
output_crs='epsg:32735' # WGS84/UTM Zone 35S

## Load input data

We now load the training features .txt file using `geopandas`. The pandas dataframe should contain columns `class_attr` identifying class labels and the bi-monthly geomedians of the nine spectral bands and NDVI that we extracted through previous module. It also contains the coordinates and geometry columns.

In [3]:
training_features= gpd.read_file(training_features_path) # Load training features
training_features.head() # Plot first five rows

Unnamed: 0,LC_Class_I,blue_0,blue_1,blue_2,blue_3,blue_4,blue_5,green_0,green_1,green_2,...,swir_2_5,NDVI_0,NDVI_1,NDVI_2,NDVI_3,NDVI_4,NDVI_5,x_coord,y_coord,geometry
0,1.0,757.348816,959.575073,893.934631,285.005371,2297.998779,421.5,830.237305,1061.837524,948.557495,...,587.5,0.678701,0.50286,0.561644,0.77848,0.377108,0.571549,752845.0,9711775.0,POINT (752845.000 9711775.000)
1,1.0,251.000031,277.0,811.000244,532.578247,383.189667,731.0,427.000031,468.0,1098.000122,...,944.0,0.745875,0.665511,0.336706,0.575585,0.649825,0.451096,764255.0,9744915.0,POINT (764255.000 9744915.000)
2,1.0,256.0,769.0,1057.999878,343.681213,401.50415,699.0,397.0,892.5,1142.0,...,631.0,0.792663,0.693928,0.560061,0.702727,0.781781,0.691824,756945.0,9736125.0,POINT (756945.000 9736125.000)
3,1.0,97.000122,543.313477,1147.638916,206.62558,1494.971313,1878.000488,225.000107,649.93988,1173.843506,...,3522.999023,0.830317,0.687891,0.396716,0.803262,0.39837,0.140103,756135.0,9713445.0,POINT (756135.000 9713445.000)
4,1.0,366.000946,534.0,769.930664,1113.999878,419.0,285.0,556.000549,661.5,819.422485,...,746.0,0.682099,0.412163,0.586193,0.40166,0.719376,0.75072,738045.0,9728215.0,POINT (738045.000 9728215.000)


Using the `class_attr` column we can get the class values, which we will use later to process by class:

In [4]:
lc_classes=training_features[class_attr].unique() # get class labels
print('land cover classes:\n',lc_classes)

land cover classes:
 [ 1.  5.  7.  9. 10. 11. 12. 13.]


The training data filtering method also requires a reference land cover map as a stratification layer to generate random training samples, which will be used to train the K-Means models, so We now load the reference map:

In [5]:
# load reference classification map
reference_map = xr.open_dataset(reference_map_path,engine="rasterio").astype(np.uint8)
reference_map=reference_map.to_array().squeeze()
print('Reference land cover classifcation raster:\n',reference_map)

Reference land cover classifcation raster:
 <xarray.DataArray (y: 20992, x: 23234)>
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)
Coordinates:
    band         int64 1
  * x            (x) float64 7.043e+05 7.044e+05 ... 9.367e+05 9.367e+05
  * y            (y) float64 9.887e+06 9.887e+06 ... 9.678e+06 9.678e+06
    spatial_ref  int64 ...
    variable     <U9 'band_data'


## Generate random samples
In many cases there may not be statistically enough samples for some classes in the training data to train the K-Means models. Therefore, we generate some randomly distributed samples for each class from the reference classification map using the `random_sampling` function. This function takes in a few parameters:  
* `n`: total number of points to sample
* `da`: a classified map as a 2-dimensional xarray.DataArray
* `sampling`: the sampling strategy, e.g. 'stratified_random' where each class has a number of points proportional to its relative area, or 'equal_stratified_random' where each class has the same number of points.
* `out_fname`: a filepath name for the function to export a shapefile/geojson of the sampling points into a file. You can set this to `None` if you don't need to output the file.
* `class_attr`: This is the column name of output dataframe that contains the integer class values on the classified map.
* `drop_value`: Pixel value on the classification map to be excluded from sampling.  

The output of the function is a geopandas dataframe of randomly distributed points containing a column `class_attr` identifying class values. Here we also re-assgin the other pixel values absent in the training data to the `drop_value` so that these pixels will not be sampled. In this example we excluded 255 (no data values) and 3 (Sparse Forest). For a quick demonstration let's sample 100 pixels for each class, which corresponds to 800 pixels in total. To fit in memory we sample over only a subset of the map. However in your project you need to sample across your study area to make sure the samples are representative of the classes.

In [6]:
# da=reference_map.where((reference_map!=0)&(reference_map!=3)&(reference_map!=255),np.nan)
da=reference_map.where((reference_map!=3)&(reference_map!=255),0)
gpd_random_samples=random_sampling(da[10000:15000,10000:15000],n=800,sampling='equal_stratified_random',
                                   out_fname=None,class_attr=class_attr,drop_value=0)

Class 1: sampling at 100 coordinates
Class 5: sampling at 100 coordinates
Class 7: sampling at 100 coordinates
Class 9: sampling at 100 coordinates
Class 10: sampling at 100 coordinates
Class 11: sampling at 100 coordinates
Class 12: sampling at 100 coordinates
Class 13: sampling at 100 coordinates


In this example we have generated ~1000 samples for each class across Rwanda, i.e. a total of 8000 random samples were generated. The points are stored in the file 'Results/Rwanda_random_samples.geojson'.

## Extract features
With the random sample points available, we now need to extract features to train the K-Means models. As we will apply clustering on all the training features that were extracted through the previous module `0_Extract_Training_Features.ipynb`, we can re-use the query and feature layer function in the previous notebook to extract the features, i.e. bi-monthly geomedian of the nine spectral bands and NDVI. As we have demonstrated how to extract training features in the previous module, in this example we skip it but use a prepared file of extracted features for the random samples:

In [8]:
rand_samples_features_path='Results/Rwanda_random_samples_features.geojson'
rand_samples_features=gpd.read_file(rand_samples_features_path)
rand_samples_features.head()

Unnamed: 0,LC_Class_I,blue_0,blue_1,blue_2,blue_3,blue_4,blue_5,green_0,green_1,green_2,...,swir_2_5,NDVI_0,NDVI_1,NDVI_2,NDVI_3,NDVI_4,NDVI_5,x_coord,y_coord,geometry
0,1.0,371.5,1201.554321,297.875305,419.874207,912.152344,863.89209,486.5,1250.170654,402.086823,...,906.334961,0.712283,0.250755,0.743313,0.705406,0.592411,0.624546,754395.0,9737915.0,POINT (754395.000 9737915.000)
1,1.0,202.5,372.5,539.539001,537.098083,603.509155,734.3573,343.0,518.5,615.346924,...,628.867432,0.777585,0.774561,0.7415,0.709321,0.717131,0.531703,759065.0,9737935.0,POINT (759065.000 9737935.000)
2,1.0,599.999939,1756.129395,1354.105591,1395.648804,1114.809814,596.0,997.999878,1942.233887,1472.304321,...,1795.0,0.63089,0.451221,0.446173,0.431146,0.448051,0.648904,743685.0,9713015.0,POINT (743685.000 9713015.000)
3,1.0,345.00058,1920.160278,0.0,254.0,690.327026,336.203766,591.000488,2008.07251,0.0,...,867.987549,0.82458,0.336523,0.0,0.851555,0.739029,0.840424,796485.0,9845935.0,POINT (796485.000 9845935.000)
4,1.0,0.0,760.321777,462.334198,535.310181,426.999878,712.5,0.0,848.700256,553.475342,...,871.5,0.0,0.70905,0.69605,0.678284,0.807006,0.72011,746785.0,9725295.0,POINT (746785.000 9725295.000)


## K-Means clustering
Now that we have the features of random samples and training points, we can use them to train and apply the K-Means models. The K-Means model requires a pre-defined number of clusters, which is unknown for many cases. One way to identify the optimal number of clusters is using the Calinski-Harabasz Index. The index is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters, where the index is higher when clusters are dense and well separated. More information about can be checked [here](https://scikit-learn.org/stable/modules/clustering.html#calinski-harabasz-index). In this example we calculate the indices calculated from clustering with a varied number of clusters (e.g. 3 to 20) and retain the clustering with the highest index.  
> Note: You can also use other indices to assess the clustering and choose optimal number of clusterings, see information on other indices [here](https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). Depending on the distribution of you features, different indices may lead to different optimal cluster numbers. 

Here we put the procedures in identifying the optimal clustering into a function where the inputs are the input features, minimum and maximum number of clusters, and the outputs are the optimal number of clusters, trained K-Means model and corresponding clustering labels:

In [12]:
def find_clusters_KMeans(data,min_cluster=3,max_cluster=20):
    highest_score=-999
    n_cluster_optimal=min_cluster
    kmeans_model_optimal=None # initialise optimal model parameters
    labels_optimal=None
    for n_cluster in range(min_cluster,max_cluster):
        kmeans_model = KMeans(n_clusters=n_cluster, random_state=1).fit(data)
        labels=kmeans_model.predict(data)
        score=metrics.calinski_harabasz_score(data, labels)
        print('Calinski-Harabasz score for ',n_cluster,' clusters is: ',score)
        if (highest_score==-999)or(highest_score<score):
            highest_score=score
            n_cluster_optimal=n_cluster
            kmeans_model_optimal=kmeans_model
            labels_optimal=labels
    print('Best number of clusters: %s'%(n_cluster_optimal))
    return n_cluster_optimal,kmeans_model_optimal,labels_optimal

Using the above function, we now cluster the training features for the first class as an example. We first retain the random sample and training sample features:

In [13]:
# get class label
i=lc_classes[0]
# subset random sample features for this class
rand_features_single_class=rand_samples_features[rand_samples_features[class_attr]==i].reset_index(drop=True)
# subset original training points for this class
td_single_class=training_features[training_features[class_attr]==i].reset_index(drop=True)
print('Number of training pints for the class: ',len(td_single_class))

Number of training pints for the class:  1108


We then apply the `find_clusters_KMeans` function to the random sample features to find optimal clustering. Note that K-Means model is sensitive to feature scales, so we need to standardise all features before applying the model. Here we use scikitlearn `StandardScaler` to implement the feature standardisation. Remember to drop coordinates and geometry columns from the features for the clustering.

In [14]:
# initialise standard scaler
scaler = StandardScaler()
# fit random samples
scaler.fit(rand_features_single_class.iloc[:,1:-3])
# transform random samples
rand_features_single_class=scaler.transform(rand_features_single_class.iloc[:,1:-3])
# find optimal clustering
n_cluster_optimal,kmeans_model_optimal,labels_optimal=find_clusters_KMeans(rand_features_single_class,min_cluster=3,max_cluster=20)

Calinski-Harabasz score for  3  clusters is:  135.9566343144762
Calinski-Harabasz score for  4  clusters is:  120.70866589368988
Calinski-Harabasz score for  5  clusters is:  113.52164353413254
Calinski-Harabasz score for  6  clusters is:  113.67973715000495
Calinski-Harabasz score for  7  clusters is:  115.17709017248218
Calinski-Harabasz score for  8  clusters is:  108.15997516906388
Calinski-Harabasz score for  9  clusters is:  103.42409146029637
Calinski-Harabasz score for  10  clusters is:  101.75018074337639
Calinski-Harabasz score for  11  clusters is:  96.71845350932122
Calinski-Harabasz score for  12  clusters is:  92.40628432420684
Calinski-Harabasz score for  13  clusters is:  89.06192019251192
Calinski-Harabasz score for  14  clusters is:  85.10924117909353
Calinski-Harabasz score for  15  clusters is:  82.90965170849793
Calinski-Harabasz score for  16  clusters is:  80.39174306615378
Calinski-Harabasz score for  17  clusters is:  77.94288889191549
Calinski-Harabasz score f

After identifying the optimal clustering, we can apply the optimal K-Means model to our training features. Remember to apply feature standardisation before implementing the clustering. Here we assign the clustering labels to a new column `cluster`:

In [16]:
# normalisation before clustering
model_input=scaler.transform(td_single_class.iloc[:,1:-3])
# predict clustering labels
labels_kmeans = kmeans_model_optimal.predict(model_input)
# append clustering results to pixel coordinates
td_single_class['cluster']=labels_kmeans

## Filtering training features

We now filter the training features/points based on the cluster size. Here we assume cluster size lower than 5% of the overall sample szie are likely to be misclassified or changed samples.    
>Note: Depending on your own training data the K-Means method may not work well, so it is recommanded that you have more understanding on your training points and test on how it works, e.g. check if it successfully filtered out the points you believe were misclassified while keeping good training samples. You should also try to adjust the cluster size threshold if it doesn't effectively remove false samples.

There are also other options for removal of outliers which can be tested on, e.g. check [here](https://scikit-learn.org/stable/modules/outlier_detection.html) for using scikitlearn for outlier detection.

In [18]:
frequency_threshold=0.05 # threshold of cluter frequency
cluster_frequency=td_single_class['cluster'].map(td_single_class['cluster'].value_counts(normalize=True)) # calculate cluster frequencies for the training samples
td_single_class['cluster_frequency']=cluster_frequency # append as a column
td_single_class_filtered=td_single_class[td_single_class['cluster_frequency']>=frequency_threshold] # filter by cluster frequency
print('Number of training data after filtering: ',len(td_single_class_filtered))

Number of training data after filtering:  1108


You can compare the number of training points before and after the filtering and check whether and how many pixels were filtered out. To implement above clustering and filtering training features for all class, let's put the steps together and iterate through all classes. Here we append filtered features for all classes into a single dataframe `training_features_filtered`:

In [21]:
training_features_filtered=None # filtered training data for all classes
scaler = StandardScaler() # initialise standard scaler
frequency_threshold=0.05 # threshold of cluter frequency
for i in lc_classes: # filtering training data for each class
    #i=1 # test for first class
    print('Processing class ',i)
    # subset random sample features for this class
    rand_features_single_class=rand_samples_features[rand_samples_features[class_attr]==i].reset_index(drop=True)
    # subset original training points for this class
    td_single_class=training_features[training_features[class_attr]==i].reset_index(drop=True)
    print('Number of training pints for the class: ',len(td_single_class))
    # fit random samples
    scaler.fit(rand_features_single_class.iloc[:,1:-3])
    # transform random samples
    rand_features_single_class=scaler.transform(rand_features_single_class.iloc[:,1:-3])
    # find optimal clustering
    n_cluster_optimal,kmeans_model_optimal,labels_optimal=find_clusters_KMeans(rand_features_single_class,min_cluster=3,max_cluster=20)

    # normalisation before clustering
    model_input=scaler.transform(td_single_class.iloc[:,1:-3])
    # predict clustering labels
    labels_kmeans = kmeans_model_optimal.predict(model_input)
    # append clustering results to pixel coordinates
    td_single_class['cluster']=labels_kmeans
    # append frequency of each cluster
    cluster_frequency=td_single_class['cluster'].map(td_single_class['cluster'].value_counts(normalize=True))
    td_single_class['cluster_frequency']=cluster_frequency
    # filter by cluster frequency
    td_single_class_filtered=td_single_class[td_single_class['cluster_frequency']>=frequency_threshold]
    print('Number of training data after filtering: ',len(td_single_class_filtered))
    
    # append the filtered training points of this class to final filtered training data
    if training_features_filtered is None:
        training_features_filtered=td_single_class_filtered
    else:
        training_features_filtered=pd.concat([training_features_filtered, td_single_class_filtered])

Processing class  1.0
Number of training pints for the class:  1108
Calinski-Harabasz score for  3  clusters is:  135.9566343144762
Calinski-Harabasz score for  4  clusters is:  120.70866589368988
Calinski-Harabasz score for  5  clusters is:  113.52164353413254
Calinski-Harabasz score for  6  clusters is:  113.67973715000495
Calinski-Harabasz score for  7  clusters is:  115.17709017248218
Calinski-Harabasz score for  8  clusters is:  108.15997516906388
Calinski-Harabasz score for  9  clusters is:  103.42409146029637
Calinski-Harabasz score for  10  clusters is:  101.75018074337639
Calinski-Harabasz score for  11  clusters is:  96.71845350932122
Calinski-Harabasz score for  12  clusters is:  92.40628432420684
Calinski-Harabasz score for  13  clusters is:  89.06192019251192
Calinski-Harabasz score for  14  clusters is:  85.10924117909353
Calinski-Harabasz score for  15  clusters is:  82.90965170849793
Calinski-Harabasz score for  16  clusters is:  80.39174306615378
Calinski-Harabasz scor

## Export filtered training features
Once we've filtered the training signatures, we can write the filtered data to disk, which will allow us to import the data in the next step(s) of the workflow.

In [23]:
# export the filtered training data as geojson file
output_file = "Results/Rwanda_training_features_filtered.geojson"
training_features_filtered.to_file(output_file, driver="GeoJSON")