# Filter Training Data


## Background

It is not uncommon that existing training data were collected at different time period than the study period. This means the dataset may not reflect the real ground cover due to temporal changes. FAO adopted a training data filtering method for any given reference year that is within a time span (e.g. 5 years) from an existing baseline, and tested the method in the production of land cover mapping for Lesotho. It is assumed that the majority of reference labels will remain valid from one year to the previous/next. Based on this assumption, the reference labels which have changed are the minority, and should be detectable through the use of outlier detection methods like K-Means clustering. More details on the method and how it works for Lesotho can be found in the published paper ([De Simone et al 2022](https://www.mdpi.com/2072-4292/14/14/3294)).

## Description

This notebook will implement FAO's automatic filtering of a training dataset for a target year using points from a geojson or shapefile and a reference classification map of a previous year. The steps include:
1. Load extracted training features
2. Generate stratified random samples for each class and extract their features using `random_sampling` and `collect_training_data`
3. Train K-Means models using the features of the random samples
4. Apply clustering on training features and remove minor clusters, i.e. cluster size smaller than 5% of overall sample size
5. Export the filtered training data to disk for use in subsequent scripts

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell.

### Load packages


In [None]:
%matplotlib inline
import os
import datacube
import warnings
import numpy as np
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr
import rioxarray
from odc.io.cgroups import get_cpu_quota
from odc.algo import xr_geomedian
from deafrica_tools.datahandling import load_ard
from deafrica_tools.bandindices import calculate_indices
from deafrica_tools.classification import collect_training_data
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from rasterio.enums import Resampling
from random_sampling import random_sampling # adapted from function by Chad Burton: https://gist.github.com/cbur24/04760d645aa123a3b1817b07786e7d9f

## Analysis parameters
* `training_features_path`: The path to the training features file which we extracted through previous module `0_Extract_Training_Features.ipynb`.
* `reference_map_path`: The path to the reference classification map, which will be used as a stratification layer to extract random samples for each class. In this example, we are using the existing national land cover map in 2016. 
* `class_attr`: This is the name of column in your shapefile attribute table that contains the class labels. **The class labels must be integers**
* `output_crs`: Output spatial reference system.

In [None]:
training_features_path = 'Data/training_features.txt'
reference_map_path='Data/reference_classification_map.tif'
class_attr = 'LC_Class_I' # class label in integer format
output_crs='epsg:32735' # WGS84/UTM Zone 35S

## Load input data

We now load the training features .txt file using `pandas`. The pandas dataframe should contain columns `class_attr` identifying class labels and the bi-monthly geomedians of the nine spectral bands and NDVI that we extracted through previous step.

In [None]:
# Load training features
training_features= pd.read_csv(training_features_path)
# Plot first five rows
training_features.head()

Using the `class_attr` column we can get the class values, which we will use later to process by class

In [None]:
lc_classes=training_features[class_attr].unique() # get class labels
print('land cover classes:\n',lc_classes)

We also retrieve the features from the column names:

In [None]:
measurements=list(training_features)[1:]

The training data filtering method also requires a reference land cover map as a stratification layer to generate random training samples to train the K-Means models, so We now load the reference map.

In [None]:
# load reference classification map
rf_2017_raster = xr.open_dataset(reference_map_path,engine="rasterio").astype(np.uint8).squeeze("band", drop=True)
# # reproject the raster
# rf_2017_raster= rf_2017_raster.rio.reproject(resolution=10, dst_crs=crs,resampling=Resampling.nearest)
rf_2017_raster=rf_2017_raster.band_data
print('Reference land cover classifcation raster:\n',rf_2017_raster) # note: 255 is nodata

## Generating random samples
As we would like to cluster the training features, but for some classes there might not be enough training samples to make the K-Means clustering statistically reliable. Therefore, here we generate random point samples for each class using the reference classification map using the `random_sampling` function. This function takes in a few parameters:  
* `n`: total number of points to sample
* `da`: a classified map as a 2-dimensional xarray.DataArray
* `sampling`: the sampling strategy, e.g. 'stratified_random', where each class has a number of points proportional to its relative area, or 'equal_stratified_random' where each class has the same number of points.
* `out_fname`: a filepath name for the function to export a shapefile/geojson of the sampling points to file. You can set this to `None` if you don't need to output the file.  

The `random_sampling` function will add a `class` attribute to the output points, which identifies the class value of each sample. In this example we are generating ~1000 samples to train the K-Means models, which means we should set n=8000 and use equal stratified random sampling strategy. We also output the samples to a geojson file. To exclude pixel values other than the valid land cover classes (e.g. 0, 255) from the sampling, we can first assign them as NaN.

In [None]:
n=8000
random_samples_path='Results/random_samples.geojson'
da=rf_2017_raster.where((rf_2017_raster!=0)&(rf_2017_raster!=255),np.nan)
gpd_random_samples=random_sampling(da,n,sampling='equal_stratified_random',out_fname=random_samples_path)

## Extract features for the samples

With the random sample points available, we can then extract features to train the K-Means models. As we will apply clustering on the training features we extracted in the previous module, we should extract exactly the same features, i.e. bi-monthly geomedian of the nine spectral bands and NDVI as we did in the previous module. Simply re-use the same query and feature layer function in the previous module:

In [None]:
#set up our inputs to collect_training_data
zonal_stats = None
# Set up the inputs for the ODC query
time = ('2021')
# using all spectral bands with 10~20 m spatial resolution
measurements = ['blue','green','red','red_edge_1','red_edge_2', 'red_edge_3','nir_1','swir_1','swir_2']
resolution = (-10,10)

# define query
query = {
    'time': time,
    'measurements': measurements,
    'output_crs': output_crs,
    'resolution': resolution
}

# define a function to feature layers
def feature_layers(query): 
    # connect to the datacube so we can access DE Africa data
    dc = datacube.Datacube(app='feature_layers')
    
    # load Sentinel-2 analysis ready data
    ds = load_ard(dc=dc,
                  products=['s2_l2a'],
                  group_by='solar_day',
                  verbose=False,
                  **query)
    
    # calculate NDVI
    ds = calculate_indices(ds,
                           index=['NDVI'],
                           drop=False,
                           satellite_mission='s2')
    
    # interpolate nodata using mean of previous and next observation
#     ds=ds.interpolate_na(dim='time',method='linear',use_coordinate=False,fill_value='extrapolate')
#     ds=ds.interpolate_na(dim='time',method='linear',use_coordinate=False)

    # calculate bi-monthly geomedian
    ds=ds.resample(time='2MS').map(xr_geomedian)
    
    # stack multi-temporal measurements and rename them
    n_time=ds.dims['time']
    list_measurements=list(ds.keys())
    list_stack_measures=[]
    for j in range(len(list_measurements)):
        for k in range(n_time):
            variable_name=list_measurements[j]+'_'+str(k)
            measure_single=ds[list_measurements[j]].isel(time=k).rename(variable_name)
            list_stack_measures.append(measure_single)
    ds_stacked=xr.merge(list_stack_measures,compat='override')
    return ds_stacked

Now we can extract the training features for the random sample points:

In [None]:
# detect the number of CPUs
ncpus=round(get_cpu_quota())
print('ncpus = '+str(ncpus))

# collect training data
column_names, model_input = collect_training_data(
    gdf=gpd_random_samples[0:10], # replace with gdf=training_points if you are extracting all the training data
    dc_query=query,
    ncpus=ncpus,
    field='class',
    zonal_stats=None,
    feature_func=feature_layers,
    return_coords=False)

## K-Means clustering and filtering
Now that we have the features of random samples and training points, we can use them to train and apply the K-Means models. The K-Means model requires a pre-defined number of clusters, which is unknown for our case and varied depending on the distribution of the samples. One way to identify the optimal number of clusters is using the Calinski and Harabasz score. The score is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters, where the score is higher when clusters are dense and well separated. More information about the score can be checked [here](https://scikit-learn.org/stable/modules/clustering.html#calinski-harabasz-index). Here we calculate the scores of varied number of clusters and retain the model with highest score.

Note that K-Means model is sensitive to feature scales, so we need to standardise all features before applying the model.

In [None]:
scaler = StandardScaler() # standard scaler for input data standardisation

In [None]:
ncpus=round(get_cpu_quota())
print('ncpus = '+str(ncpus))
tiles_shp='Data/Mozambique_tiles_biggest1.shp'

# get bounding boxes of tiles
tiles=gpd.read_file(tiles_shp).to_crs(crs)
tile_bboxes=tiles.bounds
print('tile boundaries for Mozambique: \n',tile_bboxes)

In [None]:


frequency_threshold=0.05 # threshold of cluter frequency
td2021_filtered=None # filtered training data
# filtering training data for each class
# for i in lc_classes[8:]:
for i in lc_classes:
    #i=1 # test for first class
    print('Processing class ',i)
    gpd_samples=None
    n_total=np.sum(rf_2017_raster.to_numpy()==i)
    # generate randomly sampled data to fit and optimise a kmeans clusterer
    for n in range(len(tile_bboxes)):
        print('stratified random sampling from tile ',n)
        da_mask=rf_2017_raster.rio.clip([tiles.iloc[n].geometry],crs=crs,drop=True)
        da_mask=da_mask.rio.reproject(dst_crs=crs,resampling=Resampling.nearest)
        n_samples_tile=n_samples*np.sum(da_mask.to_numpy()==i)/n_total
        gpd_samples_tile=random_sampling(da_mask,n_samples_tile,sampling='manual',
                                         manual_class_ratios={str(i):n_samples_tile},out_fname=None)
        if gpd_samples is None:
            gpd_samples=gpd_samples_tile
        else:
            gpd_samples=pd.concat([gpd_samples,gpd_samples_tile])
    # get data array
#     da_mask=da_mask.where(da_mask==i,np.nan) # replace other class values as nan so they won't be sampled (comment due to large memory required)
#     gpd_samples=random_sampling(da_mask,n_samples,sampling='stratified_random',manual_class_ratios=None,out_fname=None)
#     gpd_samples=random_sampling(da_mask,n_samples,sampling='manual',manual_class_ratios={str(i):n_samples},out_fname=None)
    gpd_samples=gpd_samples.reset_index(drop=True).drop(columns=['spatial_ref','class']) # drop this attribute derived from random_sampling function
    gpd_samples[class_attr]=i # add attribute field so that we can use collect_training_data function
    if gpd_samples.crs is None:
        gpd_samples=gpd_samples.set_crs(crs)
    print('radomly sampled points for class ',i,'\n',gpd_samples)
    # extract data for the random samples
    column_names, sampled_data = collect_training_data(gdf=gpd_samples,
                                                          dc_query=query,
                                                          ncpus=ncpus,
#                                                           ncpus=1,
                                                          field=class_attr, 
                                                          zonal_stats=zonal_stats,
                                                          feature_func=feature_layers,
                                                          return_coords=False)
    # standardise features
    scaler=scaler.fit(sampled_data[:,1:])
    sampled_data=scaler.transform(sampled_data[:,1:])
#     sampled_data[:,-6:]=sampled_data[:,-6:]*10000
#     sampled_data=sampled_data[:,1:]
    # fit kmeans model using the sample training data
    # first find optimal number of clusters based on Calinski-Harabasz index
    highest_score=-999
    n_cluster_optimal=5
    kmeans_model_optimal=None # initialise optimal model parameters
    labels_optimal=None
    for n_cluster in range(5,26):
        kmeans_model = KMeans(n_clusters=n_cluster, random_state=1).fit(sampled_data)
        labels=kmeans_model.predict(sampled_data)
        score=metrics.calinski_harabasz_score(sampled_data, labels)
#         score=metrics.davies_bouldin_score(sampled_data, labels)
        print('Calinski-Harabasz score for ',n_cluster,' clusters is: ',score)
#         print('Davies-Bouldin score for ',n_cluster,' clusters is: ',score)
        if (highest_score==-999)or(highest_score<score):
#         if (highest_score==-999)or(highest_score>score):
            highest_score=score
            n_cluster_optimal=n_cluster
            kmeans_model_optimal=kmeans_model
            labels_optimal=labels
    print('Best number of clusters for class %s: %s'%(i,n_cluster_optimal))
    
    # subset original training points for this class
    td_single_class=training_features[training_features[class_attr]==i].reset_index(drop=True)
    print('Number of training data collected: ',len(td_single_class))
    column_names, model_input = collect_training_data(gdf=td_single_class,
                                                      dc_query=query,
                                                      ncpus=ncpus,
                                                      field=class_attr,
                                                      zonal_stats=zonal_stats,
                                                      feature_func=feature_layers,
                                                      return_coords=True)
    print('Number of training data after removing Nans and Infs: ',model_input.shape[0])
    # first covert the training data to pandas
    td_single_class_filtered=pd.DataFrame(data=model_input,columns=column_names)
    # then to geopandas dataframe
    td_single_class_filtered=gpd.GeoDataFrame(td_single_class_filtered, 
                                    geometry=gpd.points_from_xy(model_input[:,-2], model_input[:,-1],
                                                                crs=crs))
    # normalisation before clustering
    model_input=scaler.transform(model_input[:,1:-2])
#     model_input=model_input[:,1:-2]
#     model_input[:,-6:]=model_input[:,-6:]*10000
    # predict clustering labels
    labels_kmeans = kmeans_model_optimal.predict(model_input)
    # append clustering results to pixel coordinates
    td_single_class_filtered['cluster']=labels_kmeans
    # append frequency of each cluster
    labels_optimal=pd.DataFrame(data=labels_optimal,columns=['cluster']) # calculate cluster frequencies of the random samples
    cluster_frequency=td_single_class_filtered['cluster'].map(labels_optimal['cluster'].value_counts(normalize=True))
    td_single_class_filtered['cluster_frequency']=cluster_frequency
#     print('filtered training data: \n',td_single_class_filtered[td_single_class_filtered['cluster_frequency']<frequency_threshold])
    # filter by cluster frequency
    td_single_class_filtered=td_single_class_filtered[td_single_class_filtered['cluster_frequency']>=frequency_threshold]
    print('Number of training data after filtering: ',len(td_single_class_filtered))
    # export filtered training data for this class as shapefile (will encounter 10-character limit for attributes)
#     td_single_class_filtered.to_file('Results/landcover_td2021_filtered_DEAfrica_new_class_'+str(i)+'.shp')
    # export filtered training data for this class as geojson file
    td_single_class_filtered.to_file('Results/landcover_td2021_filtered_class_'+str(i)+'.geojson', driver="GeoJSON")
    # append the filtered training points of this class to final filtered training data
    if td2021_filtered is None:
        td2021_filtered=td_single_class_filtered
    else:
        td2021_filtered=pd.concat([td2021_filtered, td_single_class_filtered])

## Export filtered training features
Once we've filtered the training signatures, we can write the filtered data to disk. The full filtered training features file is provided as 'Results/landcover_td2021_filtered.txt', which will allow us to import the data in the next step(s) of the workflow.

In [None]:
# save training data for all classes
print('filtered training data for 2021:\n',td2021_filtered)
td2021_filtered.to_file('Results/landcover_td2021_filtered.geojson', driver="GeoJSON")

# export the filtered training data as txt file
output_file = "Results/landcover_td2021_filtered.txt"
td2021_filtered.to_csv(output_file, header=True, index=None, sep=' ')