# Filter Training Data


## Background

It is not uncommon for existing training data to be collected over a different time period than that of the study period. Meaning that a dataset may not reflect the real ground cover due to temporal changes. 

The Food and Agriculture Organization (FAO) adopted a training data filtering method for any given reference year that is within a time span (e.g. 5 years) from an existing baseline, and tested the method in the production of land cover mapping for Lesotho. It is assumed that the majority of reference labels will remain valid from one year to the previous/next. Based on this assumption, the reference labels which have changed are the minority, and should be detectable through the use of outlier detection methods like K-Means clustering. More details on the method and how it works for Lesotho can be found in the published paper ([De Simone et al 2022](https://www.mdpi.com/2072-4292/14/14/3294)).

## Description

This notebook will implement FAO's automatic filtering of a training dataset for a target year using points from a geojson or shapefile and a reference classification map of a previous year. The steps include:
1. Load extracted training features
2. Generate stratified random samples for each class on the reference land cover map using `random_sampling` and extract their features using `collect_training_data`
3. Train K-Means models using the extracted features of the random samples
4. Apply clustering on training features and remove minor clusters
5. Export the filtered training features to disk for use in subsequent scripts

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell.

### Load packages


In [2]:
%matplotlib inline
import warnings
import numpy as np
import geopandas as gpd
import pandas as pd
import xarray as xr
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from random_sampling import random_sampling # adapted from function by Chad Burton: https://gist.github.com/cbur24/04760d645aa123a3b1817b07786e7d9f

## Analysis parameters
* `training_features_path`: The path to the file containing training features we extracted through the previous module `0_Extract_Training_Features.ipynb`.
* `reference_map_path`: The path to the reference classification map, which will be used as a stratification layer to extract random samples for each class. In this example, we are using the existing national land cover map. **Note that the reference map pixel values should contain the class values existing in the training data.**
* `class_attr`: This is the name of column in your shapefile/geojson file attribute table that contains the class labels. **The class labels must be integers**
* `output_crs`: Output spatial reference system.

In [3]:
training_features_path = 'Results/Mozambique_training_features.geojson'
reference_map_path='Data/moz_lulc2016_28082019_final_remapped_clipped_set_nodata_40m.tif'
class_attr = 'LC_Class_I' # class label in integer format
output_crs='epsg:32736' # WGS84/UTM Zone 36S

## Load input data

We now load the training features file using `geopandas`. The pandas dataframe should contain columns `class_attr` identifying class labels and the bi-monthly geomedians of the nine spectral bands and NDVI that we extracted through previous module. It also contains the coordinates and geometry columns.

In [24]:
training_features= gpd.read_file(training_features_path) # Load training features
training_features.head() # Plot first five rows

Unnamed: 0,LC_Class_I,blue_0,blue_1,blue_2,blue_3,blue_4,blue_5,green_0,green_1,green_2,...,swir_2_5,NDVI_0,NDVI_1,NDVI_2,NDVI_3,NDVI_4,NDVI_5,x_coord,y_coord,geometry
0,61.0,1236.0,1139.126953,1048.479736,1063.751343,1165.129028,1292.336792,1942.0,1709.584717,1601.37915,...,4012.578125,0.129363,0.137814,0.144414,0.130792,0.112696,0.090813,692275.0,8605115.0,POINT (692275.000 8605115.000)
1,61.0,1300.0,1173.932007,1038.968506,1065.882446,1190.070557,1337.755859,2044.0,1772.987793,1557.223022,...,4012.829346,0.118633,0.120128,0.146507,0.129145,0.104794,0.085257,692275.0,8605105.0,POINT (692275.000 8605105.000)
2,61.0,1194.0,1129.869385,1024.202515,1071.311401,1102.611938,1209.021973,1852.0,1587.096436,1554.671509,...,3729.845459,0.155747,0.156303,0.156434,0.130459,0.124384,0.104642,692285.0,8605095.0,POINT (692285.000 8605095.000)
3,72.0,1556.0,321.0,280.938965,311.567566,523.496643,538.120789,1910.0,460.0,530.231445,...,1912.091309,0.379334,0.533181,0.703165,0.623306,0.457564,0.485105,690795.0,8647845.0,POINT (690795.000 8647845.000)
4,72.0,1360.0,473.0,265.296082,327.585632,533.296875,515.738525,1636.0,426.0,491.437073,...,1832.404541,0.448956,0.658263,0.708841,0.633997,0.445149,0.493979,690795.0,8647835.0,POINT (690795.000 8647835.000)


Using the `class_attr` column we can get the class values, which we will use later to process by class:

In [25]:
lc_classes=training_features[class_attr].unique() # get class labels
print('land cover classes:\n',lc_classes)

land cover classes:
 [61. 72. 41. 75. 44. 74. 31. 12. 71. 21. 70. 51. 11.]


The training data filtering method also requires a reference land cover map as a stratification layer to generate random training samples, which will be used to train the K-Means models, so We now load the reference map:

In [6]:
# load reference classification map
reference_map = xr.open_dataset(reference_map_path,engine="rasterio").astype(np.uint8)
reference_map=reference_map.to_array().squeeze()
print('Reference land cover classifcation raster:\n',reference_map)

Reference land cover classifcation raster:
 <xarray.DataArray (y: 45111, x: 28654)>
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)
Coordinates:
    band         int64 1
  * x            (x) float64 2.008e+05 2.008e+05 ... 1.347e+06 1.347e+06
  * y            (y) float64 8.832e+06 8.832e+06 ... 7.028e+06 7.028e+06
    spatial_ref  int64 ...
    variable     <U9 'band_data'


## Generate random samples
In many cases there may not be statistically enough samples for some classes in the training data to train the K-Means models. Therefore, we generate some randomly distributed samples for each class from the reference classification map using the `random_sampling` function. This function takes in a few parameters:  
* `n`: total number of points to sample
* `da`: a classified map as a 2-dimensional xarray.DataArray
* `sampling`: the sampling strategy, e.g. 'stratified_random' where each class has a number of points proportional to its relative area, or 'equal_stratified_random' where each class has the same number of points.
* `out_fname`: a filepath name for the function to export a shapefile/geojson of the sampling points into a file. You can set this to `None` if you don't need to output the file.
* `class_attr`: This is the column name of output dataframe that contains the integer class values on the classified map.
* `drop_value`: Pixel value on the classification map to be excluded from sampling.  

The output of the function is a geopandas dataframe of randomly distributed points containing a column `class_attr` identifying class values. Here we also re-assgin the other pixel values absent in the training data to the `drop_value` so that these pixels will not be sampled. In this example we excluded 255 (no data values). For a quick demonstration let's sample 1000 pixels in total. To fit in memory we sample over only a subset (5000 by 5000 pixels) of the map. However in your project you need to sample across your study area to make sure the samples are representative of the classes.

In [7]:
# da=reference_map.where((reference_map!=0)&(reference_map!=3)&(reference_map!=255),np.nan)
da=reference_map.where(reference_map!=255,0)
gpd_random_samples=random_sampling(da[10000:15000,10000:15000],n=1300,sampling='equal_stratified_random',
                                   out_fname=None,class_attr=class_attr,drop_value=0)

Class 11: sampling at 130 coordinates
Class 12: sampling at 130 coordinates
Class 31: sampling at 130 coordinates
Class 41: sampling at 130 coordinates
Class 44: sampling at 130 coordinates
Class 51: sampling at 130 coordinates
Class 61: sampling at 130 coordinates
Class 72: sampling at 130 coordinates
Class 74: sampling at 130 coordinates
Class 75: sampling at 130 coordinates


In this example we have generated ~1000 samples for each class across Rwanda, i.e. a total of 8000 random samples were generated. The points are stored in the file 'Results/Rwanda_random_samples.geojson'.

## Extract features
With the random sample points available, we now need to extract features to train the K-Means models. As we will apply clustering on all the training features that were extracted through the previous module `0_Extract_Training_Features.ipynb`, we can re-use the query and feature layer function in the previous notebook to extract the features, i.e. bi-monthly geomedian of the nine spectral bands and NDVI. As we have demonstrated how to extract training features in the previous module, in this example we skip it but use a prepared file of extracted features for the random samples:

In [40]:
rand_samples_features_path='Results/stratified_random_samples_signatures_using_lulc2016.geojson'
rand_samples_features=gpd.read_file(rand_samples_features_path)
rand_samples_features.head()

Unnamed: 0,LC_Class_I,blue_0,blue_1,blue_2,blue_3,blue_4,blue_5,green_0,green_1,green_2,...,swir_2_3,swir_2_4,swir_2_5,NDVI_0,NDVI_1,NDVI_2,NDVI_3,NDVI_4,NDVI_5,geometry
0,51.0,411.907104,516.0,697.457886,707.468506,876.202271,920.230652,793.171814,740.0,1007.683899,...,2359.454834,2515.819336,2442.764648,0.657608,0.431102,0.208764,0.157779,0.130523,0.091194,POINT (650905.000 8390145.000)
1,51.0,577.0,685.722961,534.333618,632.182129,802.583313,770.282715,921.0,1065.882202,829.994812,...,2593.210693,2902.564697,2952.701904,0.358475,0.479083,0.316647,0.19699,0.164418,0.151651,POINT (652195.000 8358135.000)
2,51.0,592.908264,608.015015,736.447327,787.725098,939.028564,796.394592,1056.39917,1004.525391,1198.688965,...,2577.558105,2981.744141,2647.535645,0.657485,0.487768,0.321184,0.267627,0.235933,0.322459,POINT (641905.000 8372025.000)
3,51.0,528.001343,865.206909,551.653687,720.208862,920.541626,1021.595337,874.000916,986.782837,819.995483,...,2417.149902,2792.781494,2931.050781,0.586413,0.509176,0.368637,0.223229,0.185781,0.185765,POINT (661975.000 8389575.000)
4,51.0,492.87561,570.037659,412.39267,574.462219,731.276001,580.618896,820.530212,769.601318,718.130981,...,2213.340332,2793.541504,2990.641357,0.803368,0.599269,0.521683,0.286149,0.224348,0.259709,POINT (641005.000 8380965.000)


## K-Means clustering
Now that we have the features of random samples and training points, we can use them to train and apply the K-Means models. The K-Means model requires a pre-defined number of clusters, which is unknown for many cases. One way to identify the optimal number of clusters is using the Calinski-Harabasz Index. The index is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters, where the index is higher when clusters are dense and well separated. More information about can be checked [here](https://scikit-learn.org/stable/modules/clustering.html#calinski-harabasz-index). In this example we calculate the indices calculated from clustering with a varied number of clusters (e.g. 3 to 20) and retain the clustering with the highest index.  
> Note: You can also use other indices to assess the clustering and choose optimal number of clusterings, see information on other indices [here](https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). Depending on the distribution of you features, different indices may lead to different optimal cluster numbers. 

Here we put the procedures in identifying the optimal clustering into a function where the inputs are the input features, minimum and maximum number of clusters, and the outputs are the optimal number of clusters, trained K-Means model and corresponding clustering labels:

In [41]:
def find_clusters_KMeans(data,min_cluster=3,max_cluster=20):
    highest_score=-999
    n_cluster_optimal=min_cluster
    kmeans_model_optimal=None # initialise optimal model parameters
    labels_optimal=None
    for n_cluster in range(min_cluster,max_cluster):
        kmeans_model = KMeans(n_clusters=n_cluster, random_state=1).fit(data)
        labels=kmeans_model.predict(data)
        score=metrics.calinski_harabasz_score(data, labels)
        print('Calinski-Harabasz score for ',n_cluster,' clusters is: ',score)
        if (highest_score==-999)or(highest_score<score):
            highest_score=score
            n_cluster_optimal=n_cluster
            kmeans_model_optimal=kmeans_model
            labels_optimal=labels
    print('Best number of clusters: %s'%(n_cluster_optimal))
    return n_cluster_optimal,kmeans_model_optimal,labels_optimal

Using the above function, we now cluster the training features for the first class as an example. We first retain the random sample and training sample features:

In [42]:
# get class label
i=lc_classes[0]
# subset random sample features for this class
rand_features_single_class=rand_samples_features[rand_samples_features[class_attr]==i].reset_index(drop=True)
# subset original training points for this class
td_single_class=training_features[training_features[class_attr]==i].reset_index(drop=True)
print('Number of training pints for the class: ',len(td_single_class))

Number of training pints for the class:  5170


We then apply the `find_clusters_KMeans` function to the random sample features to find optimal clustering. Note that K-Means model is sensitive to feature scales, so we need to standardise all features before applying the model. Here we use scikitlearn `StandardScaler` to implement the feature standardisation. Remember to drop coordinates and geometry columns from the features for the clustering.

In [43]:
# initialise standard scaler
scaler = StandardScaler()
# fit random samples
scaler.fit(rand_features_single_class.iloc[:,1:-1])
# transform random samples
rand_features_single_class=scaler.transform(rand_features_single_class.iloc[:,1:-1])
# find optimal clustering
n_cluster_optimal,kmeans_model_optimal,labels_optimal=find_clusters_KMeans(rand_features_single_class,min_cluster=3,max_cluster=20)

Calinski-Harabasz score for  3  clusters is:  394.3823057283652
Calinski-Harabasz score for  4  clusters is:  371.9323704020279
Calinski-Harabasz score for  5  clusters is:  339.7397807862018
Calinski-Harabasz score for  6  clusters is:  302.84279441348315
Calinski-Harabasz score for  7  clusters is:  277.03937453672455
Calinski-Harabasz score for  8  clusters is:  258.30860348479297
Calinski-Harabasz score for  9  clusters is:  238.42932537098477
Calinski-Harabasz score for  10  clusters is:  226.67365273953874
Calinski-Harabasz score for  11  clusters is:  214.43594280139683
Calinski-Harabasz score for  12  clusters is:  205.14060918943235
Calinski-Harabasz score for  13  clusters is:  194.3908295739775
Calinski-Harabasz score for  14  clusters is:  185.7143128599663
Calinski-Harabasz score for  15  clusters is:  180.60024245728974
Calinski-Harabasz score for  16  clusters is:  171.8730040917805
Calinski-Harabasz score for  17  clusters is:  167.83801606201345
Calinski-Harabasz score

After identifying the optimal clustering, we can apply the optimal K-Means model to our training features. Remember to apply feature standardisation before implementing the clustering. Here we assign the clustering labels to a new column `cluster`:

In [44]:
# normalisation before clustering
model_input=scaler.transform(td_single_class.iloc[:,1:-3])
# predict clustering labels
labels_kmeans = kmeans_model_optimal.predict(model_input)
# append clustering results to pixel coordinates
td_single_class['cluster']=labels_kmeans

## Filtering training features

We now filter the training features/points based on the cluster size. Here we assume cluster size lower than 5% of the overall sample szie are likely to be misclassified or changed samples.    
>Note: Depending on your own training data the K-Means method may not work well, so it is recommanded that you have more understanding on your training points and test on how it works, e.g. check if it successfully filtered out the points you believe were misclassified while keeping good training samples. You should also try to adjust the cluster size threshold if it doesn't effectively remove false samples.

There are also other options for removal of outliers which can be tested on, e.g. check [here](https://scikit-learn.org/stable/modules/outlier_detection.html) for using scikitlearn for outlier detection.

In [45]:
frequency_threshold=0.05 # threshold of cluter frequency
cluster_frequency=td_single_class['cluster'].map(td_single_class['cluster'].value_counts(normalize=True)) # calculate cluster frequencies for the training samples
td_single_class['cluster_frequency']=cluster_frequency # append as a column
td_single_class_filtered=td_single_class[td_single_class['cluster_frequency']>=frequency_threshold] # filter by cluster frequency
print('Number of training data after filtering: ',len(td_single_class_filtered))

Number of training data after filtering:  5170


You can compare the number of training points before and after the filtering and check whether and how many pixels were filtered out. To implement above clustering and filtering training features for all class, let's put the steps together and iterate through all classes. Here we append filtered features for all classes into a single dataframe `training_features_filtered`:

In [46]:
training_features_filtered=None # filtered training data for all classes
scaler = StandardScaler() # initialise standard scaler
frequency_threshold=0.05 # threshold of cluter frequency
for i in lc_classes: # filtering training data for each class
    #i=1 # test for first class
    print('Processing class ',i)
    # subset random sample features for this class
    rand_features_single_class=rand_samples_features[rand_samples_features[class_attr]==i].reset_index(drop=True)
    # subset original training points for this class
    td_single_class=training_features[training_features[class_attr]==i].reset_index(drop=True)
    print('Number of training pints for the class: ',len(td_single_class))
    # fit random samples
    scaler.fit(rand_features_single_class.iloc[:,1:-1])
    # transform random samples
    rand_features_single_class=scaler.transform(rand_features_single_class.iloc[:,1:-1])
    # find optimal clustering
    n_cluster_optimal,kmeans_model_optimal,labels_optimal=find_clusters_KMeans(rand_features_single_class,min_cluster=3,max_cluster=20)

    # normalisation before clustering
    model_input=scaler.transform(td_single_class.iloc[:,1:-3])
    # predict clustering labels
    labels_kmeans = kmeans_model_optimal.predict(model_input)
    # append clustering results to pixel coordinates
    td_single_class['cluster']=labels_kmeans
    # append frequency of each cluster
    cluster_frequency=td_single_class['cluster'].map(td_single_class['cluster'].value_counts(normalize=True))
    td_single_class['cluster_frequency']=cluster_frequency
    # filter by cluster frequency
    td_single_class_filtered=td_single_class[td_single_class['cluster_frequency']>=frequency_threshold]
    print('Number of training data after filtering: ',len(td_single_class_filtered))
    
    # append the filtered training points of this class to final filtered training data
    if training_features_filtered is None:
        training_features_filtered=td_single_class_filtered
    else:
        training_features_filtered=pd.concat([training_features_filtered, td_single_class_filtered])

Processing class  61.0
Number of training pints for the class:  5170
Calinski-Harabasz score for  3  clusters is:  394.3823057283652
Calinski-Harabasz score for  4  clusters is:  371.9323704020279
Calinski-Harabasz score for  5  clusters is:  339.7397807862018
Calinski-Harabasz score for  6  clusters is:  302.84279441348315
Calinski-Harabasz score for  7  clusters is:  277.03937453672455
Calinski-Harabasz score for  8  clusters is:  258.30860348479297
Calinski-Harabasz score for  9  clusters is:  238.42932537098477
Calinski-Harabasz score for  10  clusters is:  226.67365273953874
Calinski-Harabasz score for  11  clusters is:  214.43594280139683
Calinski-Harabasz score for  12  clusters is:  205.14060918943235
Calinski-Harabasz score for  13  clusters is:  194.3908295739775
Calinski-Harabasz score for  14  clusters is:  185.7143128599663
Calinski-Harabasz score for  15  clusters is:  180.60024245728974
Calinski-Harabasz score for  16  clusters is:  171.8730040917805
Calinski-Harabasz sc

## Export filtered training features
Once we've filtered the training signatures, we can write the filtered data to disk, which will allow us to import the data in the next step(s) of the workflow.

In [47]:
# export the filtered training data as geojson file
output_file = "Results/Mozambique_training_features_filtered.geojson"
training_features_filtered.to_file(output_file, driver="GeoJSON")