# Filter Training Data


## Background

It is not uncommon for existing training data to be collected over a different time period than that of the study period. Meaning that a dataset may not reflect the real ground cover due to temporal changes. 

The Food and Agriculture Organization (FAO) adopted a training data filtering method for any given reference year that is within a time span (e.g. 5 years) from an existing baseline, and tested the method in the production of land cover mapping for Lesotho. It is assumed that the majority of reference labels will remain valid from one year to the previous/next. Based on this assumption, the reference labels which have changed are the minority, and should be detectable through the use of outlier detection methods like K-Means clustering. More details on the method and how it works for Lesotho can be found in the published paper ([De Simone et al 2022](https://www.mdpi.com/2072-4292/14/14/3294)).

## Description

This notebook will implement FAO's automatic filtering of a training dataset for a target year using points from a geojson or shapefile and a reference classification map of a previous year. The steps include:
1. Load extracted training features
2. Generate stratified random samples for each class on the reference land cover map using `random_sampling` and extract their features using `collect_training_data`
3. Train K-Means models using the extracted features of the random samples
4. Apply clustering on training features and remove minor clusters
5. Export the filtered training features to disk for use in subsequent scripts

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell.

### Load packages


In [1]:
%matplotlib inline
import warnings
import numpy as np
import geopandas as gpd
import pandas as pd
import xarray as xr
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from random_sampling import random_sampling # adapted from function by Chad Burton: https://gist.github.com/cbur24/04760d645aa123a3b1817b07786e7d9f



## Analysis parameters
* `training_features_path`: The path to the file containing training features we extracted through the previous module `0_Extract_Training_Features.ipynb`.
* `reference_map_path`: The path to the reference classification map, which will be used as a stratification layer to extract random samples for each class. In this example, we are using the existing map [Rwanda Land Cover 2015 Scheme II](http://geoportal.rcmrd.org/layers/servir%3Arwanda_landcover_2015_scheme_ii). **Note that the reference map pixel values should contain the class values existing in the training data.**
* `class_attr`: This is the name of column in your shapefile/geojson file attribute table that contains the class labels. **The class labels must be integers**
* `output_crs`: Output spatial reference system.

In [2]:
training_features_path = 'Results/Rwanda_training_features_2021_from_2015_scheme_ii.geojson'
reference_map_path='Data/rwanda_landcover_2015_scheme_ii.tif'
class_attr = 'LC_Class_I' # class label in integer format
output_crs='epsg:32735' # WGS84/UTM Zone 35S

## Load input data

We now load the training features .txt file using `geopandas`. The pandas dataframe should contain columns `class_attr` identifying class labels and the bi-monthly geomedians of the nine spectral bands and NDVI that we extracted through previous module. It also contains the coordinates and geometry columns.

In [3]:
training_features= gpd.read_file(training_features_path) # Load training features
training_features.head() # Plot first five rows

Unnamed: 0,LC_Class_I,blue_0,blue_1,blue_2,blue_3,blue_4,blue_5,green_0,green_1,green_2,...,swir_2_5,NDVI_0,NDVI_1,NDVI_2,NDVI_3,NDVI_4,NDVI_5,x_coord,y_coord,geometry
0,12.0,746.544312,604.5,531.854614,579.425171,779.834961,735.999939,1008.881714,870.0,825.942627,...,954.000122,0.470825,0.540201,0.455023,0.473951,0.516605,0.368369,842405.0,9771525.0,POINT (842405.000 9771525.000)
1,12.0,359.753418,609.440247,528.122253,427.805145,684.916138,511.0,494.839539,726.885864,684.505676,...,732.0,0.0944,0.297828,0.284677,0.209838,0.157299,0.229703,718565.0,9735295.0,POINT (718565.000 9735295.000)
2,12.0,842.906006,3859.5,808.999939,1100.436279,921.108398,1047.573853,1650.38855,3733.0,1483.999878,...,178.300995,-0.110824,-0.150946,-0.427848,-0.273258,-0.25777,-0.124184,793185.0,9779405.0,POINT (793185.000 9779405.000)
3,12.0,603.348511,2966.0,693.0,831.999695,661.981812,666.0,927.474609,4120.0,989.0,...,932.999817,0.25977,0.001298,0.263473,0.162238,0.508861,0.347551,796505.0,9807125.0,POINT (796505.000 9807125.000)
4,12.0,1164.430542,752.180908,397.646027,455.440857,861.179382,606.5,1218.076782,966.766418,613.811584,...,1118.0,0.518148,0.542375,0.745179,0.720775,0.577472,0.580907,847425.0,9772665.0,POINT (847425.000 9772665.000)


Using the `class_attr` column we can get the class values, which we will use later to process by class. For this extracted training points, the class names corresponding to the class values are: 1: Forest, 5: Grassland, 7: Shrubland, 9: Perennial Cropland, 10: Annual Cropland, 11: Wetland, 12: Water Body, 13: Urban Settlement, 14: Bare Land.

In [4]:
lc_classes=training_features[class_attr].unique() # get class labels
print('land cover classes:\n',lc_classes)

land cover classes:
 [12.  1.  5.  7.  9. 10. 11. 13. 14.]


The training data filtering method also requires a reference land cover map as a stratification layer to generate random training samples, which will be used to train the K-Means models, so we now load the reference map:

In [5]:
# load reference classification map
reference_map = xr.open_dataset(reference_map_path,engine="rasterio").astype(np.uint8)
reference_map=reference_map.to_array().squeeze()
print('Reference land cover classifcation raster:\n',reference_map)

Reference land cover classifcation raster:
 <xarray.DataArray (y: 6992, x: 7697)>
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)
Coordinates:
    band         int64 1
  * x            (x) float64 28.84 28.84 28.84 28.84 ... 30.92 30.92 30.92 30.92
  * y            (y) float64 -1.018 -1.018 -1.018 -1.018 ... -2.91 -2.91 -2.91
    spatial_ref  int64 ...
    variable     <U9 'band_data'


## Generate random samples
In many cases there may not be statistically enough samples for some classes in the training data to train the K-Means models. Therefore, we generate some randomly distributed samples for each class from the reference classification map using the `random_sampling` function. This function takes in a few parameters:  
* `n`: total number of points to sample
* `sampling`: the sampling strategy, e.g. 'stratified_random' where each class has a number of points proportional to its relative area, 'equal_stratified_random' where each class has the same number of points, or 'manual' which allows you to define number of samples for each class.
* `out_fname`: a filepath name for the function to export a shapefile/geojson of the sampling points into a file. You can set this to `None` if you don't need to output the file.
* `class_attr`: This is the column name of output dataframe that contains the integer class values on the classified map. 

The output of the function is a geopandas dataframe of randomly distributed points containing a column `class_attr` identifying class values. As the Rwanda Land Cover 2015 Scheme II map contains more classes than the training data, to make sure we only sample the classes we want, we manually define the numbers of samples for all classes. For a quick demonstration let's sample 100 pixels for a few classes present at a small sample area of the map. However in your project you need to sample all classes across your study area to make sure the samples are representative of the classes, which may take some and require large memory.

In [6]:
# only sample a few of the classes over a small example area
gpd_random_samples=random_sampling(reference_map[2500:3000,5500:6000],n=900,sampling='manual',
                                   manual_class_ratios={'5':100,'7':100,'9':100,'10':100,'11':100,'12':100,'13':100},
                                   out_fname=None,class_attr=class_attr,drop_value=0)

Requested more sample points than population of pixels for class 5, skipping
Class 7: sampled at 100 coordinates
Class 9: sampled at 100 coordinates
Class 10: sampled at 100 coordinates
Class 11: sampled at 100 coordinates
Class 12: sampled at 100 coordinates
Class 13: sampled at 100 coordinates


For this workshop we have generated for you ~1000 samples for each class across Rwanda, i.e. a total of 9000 random samples. The points are stored in the file 'Results/Rwanda_random_samples_all.geojson'.

## Extract features
With the random sample points available, we now need to extract features to train the K-Means models. As we will apply clustering on all the training features that were extracted through the previous module `0_Extract_Training_Features.ipynb`, we can re-use the query and feature layer function in the previous notebook to extract the features, i.e. bi-monthly geomedian of the nine spectral bands and NDVI. As we have demonstrated how to extract training features in the previous module, in this example we skip it but use a prepared file of extracted features for the random samples:

In [7]:
rand_samples_features_path='Results/Rwanda_random_samples_features_all.geojson'
rand_samples_features=gpd.read_file(rand_samples_features_path)
rand_samples_features.head()

Unnamed: 0,LC_Class_I,blue_0,blue_1,blue_2,blue_3,blue_4,blue_5,green_0,green_1,green_2,...,swir_2_5,NDVI_0,NDVI_1,NDVI_2,NDVI_3,NDVI_4,NDVI_5,x_coord,y_coord,geometry
0,14.0,633.110474,1163.5,605.942688,704.407593,1000.478943,475.0,989.714661,1472.0,917.137451,...,1464.0,0.495518,0.497482,0.601012,0.491048,0.353007,0.647586,798045.0,9745385.0,POINT (798045.000 9745385.000)
1,14.0,1223.553223,1019.0,780.588013,758.294128,1097.813843,1052.5,1588.207031,1538.0,1200.09668,...,1753.0,0.397725,0.428166,0.545747,0.44836,0.378705,0.437038,793505.0,9750515.0,POINT (793505.000 9750515.000)
2,14.0,1185.094238,694.001404,925.010498,1078.967896,1150.344116,904.697632,1084.674072,1080.000732,1238.90271,...,1993.535034,0.343229,0.484296,0.302771,0.229523,0.26669,0.354119,812585.0,9726895.0,POINT (812585.000 9726895.000)
3,14.0,595.367004,744.999939,683.0,771.706726,779.988953,1163.5,749.723572,1007.999817,846.0,...,2057.0,0.498805,0.516304,0.532819,0.533361,0.487032,0.432646,776715.0,9731055.0,POINT (776715.000 9731055.000)
4,14.0,740.280273,511.609985,1122.128906,782.549683,647.0,527.0,882.693054,839.46698,1332.434326,...,1177.0,0.527217,0.526064,0.509427,0.283039,0.367705,0.664596,770775.0,9724155.0,POINT (770775.000 9724155.000)


## K-Means clustering
Now that we have the features of random samples and training points, we can use them to train and apply the K-Means models for each class. The K-Means model requires a pre-defined number of clusters, which is unknown for many cases. One way to identify the optimal number of clusters is using the Calinski-Harabasz Index. The index is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters, where the index is higher when clusters are dense and well separated. More information about can be checked [here](https://scikit-learn.org/stable/modules/clustering.html#calinski-harabasz-index). In this example we calculate the indices calculated from clustering with a varied number of clusters (e.g. 2 to 30) and retain the clustering with the highest index.  
> Note: You can also use other indices to assess the clustering and choose optimal number of clusterings, see information on other indices [here](https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). Depending on the distribution of you features, different indices may lead to different optimal cluster numbers. Your number of clusters range will also likely result in different optimal clustering.

Here we put the procedures in identifying the optimal clustering into a function where the inputs are the input features, minimum and maximum number of clusters, and the outputs are the optimal number of clusters, trained K-Means model and corresponding clustering labels:

In [8]:
def find_clusters_KMeans(data,min_cluster=2,max_cluster=10):
    highest_score=-999
    n_cluster_optimal=min_cluster
    kmeans_model_optimal=None # initialise optimal model parameters
    labels_optimal=None
    for n_cluster in range(min_cluster,max_cluster+1):
        kmeans_model = KMeans(n_clusters=n_cluster, random_state=1).fit(data)
        labels=kmeans_model.predict(data)
        score=metrics.calinski_harabasz_score(data, labels)
        print('Calinski-Harabasz score for ',n_cluster,' clusters is: ',score)
        if (highest_score==-999)or(highest_score<score):
            highest_score=score
            n_cluster_optimal=n_cluster
            kmeans_model_optimal=kmeans_model
            labels_optimal=labels
    if min_cluster == max_cluster:
        pass
    else:
        print('Best number of clusters: %s'%(n_cluster_optimal))
    return n_cluster_optimal,kmeans_model_optimal,labels_optimal

Using the above function, we now cluster the training features for the first class as an example. We first retain the random sample and training sample features:

In [9]:
# get class label
i=lc_classes[0]
# subset random sample features for this class
rand_features_single_class=rand_samples_features[rand_samples_features[class_attr]==i].reset_index(drop=True)
# subset original training points for this class
td_single_class=training_features[training_features[class_attr]==i].reset_index(drop=True)
print('Number of training pints for the class: ',len(td_single_class))

Number of training pints for the class:  1000


We then apply the `find_clusters_KMeans` function to the random sample features to find optimal clustering. Note that K-Means model is sensitive to feature scales, so we need to standardise all features before applying the model. Here we use scikitlearn `StandardScaler` to implement the feature standardisation. Remember to drop coordinates and geometry columns from the features for the clustering.

In [None]:
# initialise standard scaler
scaler = StandardScaler()
# fit random samples
scaler.fit(rand_features_single_class.iloc[:,1:-3])
# transform random samples
rand_features_single_class=scaler.transform(rand_features_single_class.iloc[:,1:-3])
# find optimal clustering
n_cluster_optimal,kmeans_model_optimal,labels_optimal=find_clusters_KMeans(rand_features_single_class,min_cluster=2,max_cluster=30)

Calinski-Harabasz score for  2  clusters is:  698.2104051259678
Calinski-Harabasz score for  3  clusters is:  561.6686985763257
Calinski-Harabasz score for  4  clusters is:  461.54796701023037
Calinski-Harabasz score for  5  clusters is:  401.0501212495297
Calinski-Harabasz score for  6  clusters is:  363.2355331567897
Calinski-Harabasz score for  7  clusters is:  328.64855541584444
Calinski-Harabasz score for  8  clusters is:  309.48400225669167
Calinski-Harabasz score for  9  clusters is:  291.9934111225683
Calinski-Harabasz score for  10  clusters is:  285.6206540543185
Calinski-Harabasz score for  11  clusters is:  271.6957968754068
Calinski-Harabasz score for  12  clusters is:  258.32793718607684
Calinski-Harabasz score for  13  clusters is:  253.74123662517127
Calinski-Harabasz score for  14  clusters is:  244.19589899008167
Calinski-Harabasz score for  15  clusters is:  236.89813821886023
Calinski-Harabasz score for  16  clusters is:  229.60494250577455
Calinski-Harabasz score f

Additionally, we can set the came value for `min_cluster` and `max_cluster` (e.g. `min_cluster=2, max_cluster=2`). This will return the clustering score for our set number of clusters, in this example that is 2. 
> Note: Please note that this might not be the optimal clustering as it is only processing the score for the single cluster value.

In [None]:
# find clustering score for single cluster
n_cluster_optimal,kmeans_model_optimal,labels_optimal=find_clusters_KMeans(rand_features_single_class,min_cluster=2,max_cluster=2)

After identifying the optimal clustering, we can apply the optimal K-Means model to our training features. Remember to apply feature standardisation before implementing the clustering. Here we assign the clustering labels to a new column `cluster`:

In [None]:
# normalisation before clustering
model_input=scaler.transform(td_single_class.iloc[:,1:-3])
# predict clustering labels
labels_kmeans = kmeans_model_optimal.predict(model_input)
# append clustering results to pixel coordinates
td_single_class['cluster']=labels_kmeans

## Filtering training features

We now filter the training features/points based on the cluster size. Here we assume cluster size lower than 10% of the overall sample szie are likely to be misclassified or changed samples.    
>Note: Depending on your own training data the K-Means method may not work well, so it is recommanded that you have more understanding on your training points and test on how it works, e.g. check if it successfully filtered out the points you believe were misclassified while keeping good training samples. You should also try to adjust the cluster size threshold if it doesn't effectively remove false samples.

There are also other options for removal of outliers which can be tested on, e.g. check [here](https://scikit-learn.org/stable/modules/outlier_detection.html) for using scikit-learn for outlier detection.

In [None]:
frequency_threshold=0.1 # threshold of cluter frequency
cluster_frequency=td_single_class['cluster'].map(td_single_class['cluster'].value_counts(normalize=True)) # calculate cluster frequencies for the training samples
td_single_class['cluster_frequency']=cluster_frequency # append as a column
td_single_class_filtered=td_single_class[td_single_class['cluster_frequency']>=frequency_threshold] # filter by cluster frequency
print('Number of training data after filtering: ',len(td_single_class_filtered))

You can compare the number of training points before and after the filtering and check whether and how many pixels were filtered out. To implement above clustering and filtering training features for all class, let's put the steps together and iterate through all classes. Here we append filtered features for all classes into a single dataframe `training_features_filtered`:

In [None]:
training_features_filtered=None # filtered training data for all classes
scaler = StandardScaler() # initialise standard scaler
for i in lc_classes: # filtering training data for each class
    #i=1 # test for first class
    print('Processing class ',i)
    # subset random sample features for this class
    rand_features_single_class=rand_samples_features[rand_samples_features[class_attr]==i].reset_index(drop=True)
    # subset original training points for this class
    td_single_class=training_features[training_features[class_attr]==i].reset_index(drop=True)
    print('Number of training pints for the class: ',len(td_single_class))
    # fit random samples
    scaler.fit(rand_features_single_class.iloc[:,1:-3])
    # transform random samples
    rand_features_single_class=scaler.transform(rand_features_single_class.iloc[:,1:-3])
    # find optimal clustering
    n_cluster_optimal,kmeans_model_optimal,labels_optimal=find_clusters_KMeans(rand_features_single_class,min_cluster=2,max_cluster=10)

    # normalisation before clustering
    model_input=scaler.transform(td_single_class.iloc[:,1:-3])
    # predict clustering labels
    labels_kmeans = kmeans_model_optimal.predict(model_input)
    # append clustering results to pixel coordinates
    td_single_class['cluster']=labels_kmeans
    # append frequency of each cluster
    cluster_frequency=td_single_class['cluster'].map(td_single_class['cluster'].value_counts(normalize=True))
    td_single_class['cluster_frequency']=cluster_frequency
    # filter by cluster frequency
    td_single_class_filtered=td_single_class[td_single_class['cluster_frequency']>=frequency_threshold]
    print('Number of training data after filtering: ',len(td_single_class_filtered))
    
    # append the filtered training points of this class to final filtered training data
    if training_features_filtered is None:
        training_features_filtered=td_single_class_filtered
    else:
        training_features_filtered=pd.concat([training_features_filtered, td_single_class_filtered])

## Export filtered training features
Once we've filtered the training signatures, we can write the filtered data to disk, which will allow us to import the data in the next step(s) of the workflow.

In [None]:
# export the filtered training data as geojson file
output_file = "Results/Rwanda_training_features_2021_filtered.geojson"
training_features_filtered.to_file(output_file, driver="GeoJSON")