# Generate Training Points

## Background

**Training data** is the most important part of any supervised machine learning workflow. The quality of the training data has a greater impact on the classification than the algorithm used. Large and accurate training data sets are preferable: increasing the training sample size results in increased classification accuracy ([Maxell et al 2018](https://www.tandfonline.com/doi/full/10.1080/01431161.2018.1433343)).  A review of training data methods in the context of Earth Observation is available [here](https://www.mdpi.com/2072-4292/12/6/1034).

There are many platforms to use for gathering land cover training labels, the best one to use depends on your application. GIS platforms are great for collecting training data as they are highly flexible and mature platforms; [Geo-Wiki](https://www.geo-wiki.org/) and [Collect Earth Online](https://collect.earth/home) are two open-source websites that may also be useful depending on the reference data strategy employed. Alternatively, there are many pre-existing training datasets on the web that may be useful, e.g. [Radiant Earth](https://www.radiant.earth/) manages a growing number of reference datasets for use by anyone. With locations of land cover labels available, we can extract features at these locations from satellite imagery as input for machine learning.  

## Description

As timely training data is not always available, in this notebook we demonstrate how to generate a set of randomly distributed training points for a district (Kicukiro) in Rwanda from an existing classification map.

The workflow includes the following steps:

1. Preview the district boundaries of Rwanda on a basemap
2. Select a district as area of analysis
3. Merging classes on the classification map to keep only those you want
4. Generate randomly distributed training points and export for future use

***

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell.

### Load packages


In [None]:
%matplotlib inline
import warnings
import numpy as np
import geopandas as gpd
import pandas as pd
import xarray as xr
import rioxarray
from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap,BoundaryNorm
from matplotlib.patches import Patch
from random_sampling import random_sampling # adapted from function by Chad Burton: https://gist.github.com/cbur24/04760d645aa123a3b1817b07786e7d9f
from datacube.utils.cog import write_cog

## Analysis parameters

* `input_polygons_path`: The path to the shapefile containing polygons of Rwanda administrative boundaries.
* `input_map_path`: The classification map used to extract the training points. Here we use the [Rwanda Land Cover 2015 Scheme II map](http://geoportal.rcmrd.org/layers/servir%3Arwanda_landcover_2015_scheme_ii).
* `district_name_attribute`: This is the name of column in your shapefile attribute table that identifies level 2 names, i.e. district names.
>**Note**: If you change your file to a different shapefile, remember to update this variable to identify your area of interest.
* `output_crs`: Output spatial reference system.

In [None]:
input_polygons_path = 'Data/ADM2.shp'
input_map_path='Data/rwanda_landcover_2015_scheme_ii.tif'
district_name_attribute = 'ADM2_NAME'
output_crs='epsg:32735' # WGS84/UTM Zone 35S

## Load and display input data
Let's load the administration polygons and preview the data:

In [None]:
polygons=gpd.read_file(input_polygons_path)
polygons.head()

To display the polygons on an interactive basemap. When you hover on each polygon you would be able to its attributes:

In [None]:
polygons.explore(
    tiles = "https://mt1.google.com/vt/lyrs=y&x={x}&y={y}&z={z}", 
    attr ='Imagery @2022 Landsat/Copernicus, Map data @2022 Google',
    popup=True,
    cmap='viridis',
    style_kwds=dict(color= 'red', fillOpacity= 0, weight= 3),
    )

Let's load the classification map and display it:

In [None]:
classification_map=xr.open_dataset(input_map_path,engine="rasterio").astype(np.uint8)
classification_map=classification_map.to_array().squeeze()

According to its metadata of the classification map, it conatains 15 classes including Nodata represented by its pixel values: Nodata (0), Dense Forest (1), Moderate Forest (2), Sparse Forest (3), Woodland (4), Closed Grassland (5), Open Grassland (6), Closed Shrubland (7), Open Shrubland (8), Perennial Cropland (9), Annual Cropland (10), Wetland (11), Water Body (12), Urban Settlement (13) and Other Land (14). Here we define a dictionary of class name corresponding to pixel values for dispaly:

>**Note**: If you change the classification map, you will need to understand what class each pixel value represents.

In [None]:
dict_map={'Nodata':0,'Dense Forest':1,'Moderate Forest':2,'Sparse Forest':3,'Woodland':4,
          'Closed Grassland':5,'Open Grassland':6,'Closed Shrubland':7,'Open Shrubland':8,
          'Perennial Cropland':9,'Annual Cropland':10,'Wetland':11,'Water Body':12,'Urban Settlement':13,'Other Land':14}
# display colour for each class value
colours = {0:'white',1:'darkgreen',2:'limegreen',3:'lime',4:'lightgreen',5:'olive',6:'yellow',7:'goldenrod',
           8:'darkorange',9:'magenta',10:'pink',11:'cyan',12:'blue',13:'gray',14:'black'}

fig, axes = plt.subplots(1,1)

# Plot classification map
unique_values=np.unique(classification_map)
cmap=ListedColormap([colours[k] for k in unique_values])
norm = BoundaryNorm(list(unique_values)+[np.max(unique_values)+1], cmap.N)
classification_map.plot.imshow(ax=axes, 
                   cmap=cmap,
                   norm=norm,
                   add_labels=True, 
                   add_colorbar=False,
                   interpolation='none')
# add colour legend
patches_list=[Patch(facecolor=colour) for colour in colours.values()]
axes.legend(patches_list, list(dict_map.keys()),loc='upper center', ncol =4, bbox_to_anchor=(0.5, -0.1))

## Select district for analysis
Now we select a district of interest for analysis. Here we select the *Kicukiro* district and will use this district as demonstration for the rest of the workflow.  
>**Note**: If you change your district of interest, depending on its area size you may expect more time and need more memory to process.

In [None]:
district_name='Kicukiro'
polygon=polygons.loc[polygons[district_name_attribute]==district_name]

With the district selected, we clip the classification map to the region for later analysis:

In [None]:
map_clipped=classification_map.rio.clip(geometries=polygon.geometry.values, crs=polygon.crs)

## Class merging

As we would like to use training samples for pure pixels as possible but the classification map contains more class than we want, we merge the classes legends that are likely mixture of pure classes, e.g. Sparse Forest. Here we merge:  
* Dense Forest (1) and Moderate Forest (2) as Forest (1);  
* Closed Grassland (5) and Open Grassland (6) as Grassland (5);  
* Closed Shrubland (7) and Open Shrubland (8) as Shrubland (7).  

Other Land class is treated as Bare Land based on our observations of the map. Here we abandon the classes Sparse Forest (3) and Woodland (4) from the map.

In [None]:
map_clipped=map_clipped.where(map_clipped!=dict_map['Moderate Forest'],dict_map['Dense Forest']) # merge moderate forest (2) and dense forest (1)
map_clipped=map_clipped.where(map_clipped!=dict_map['Open Grassland'],dict_map['Closed Grassland']) # merge open grassland (6) and closed grassland (5)
map_clipped=map_clipped.where(map_clipped!=dict_map['Open Shrubland'],dict_map['Closed Shrubland']) # merge open shrubland (8) and closed shrubland (7)

>**Note**: If you change your class merging strategy, some of the variables related to class names and values used in the subsequent notebooks need to be updated as well.

## Generate random training samples
We generate some randomly distributed samples for each class from the clipped classification map using the `random_sampling` function. This function takes in a few parameters:  
* `da`: a classified map in the format of 2-dimensional xarray.DataArray
* `n`: total number of points to sample
* `min_sample_n`: Minimum number of samples to generate per class if proportional number is smaller than this
* `sampling`: the sampling strategy, e.g. 'stratified_random' where each class has a number of points proportional to its relative area, 'equal_stratified_random' where each class has the same number of points, or 'manual' which allows you to define number of samples for each class.
* `out_fname`: a filepath name for the function to export a shapefile/geojson of the sampling points into a file. You can set this to `None` if you don't need to output the file.
* `class_attr`: This is the column name of output dataframe that contains the integer class values on the classified map. 
* `drop_value`: pixel value on the classification map to be excluded from sampling.  

The output of the function is a geopandas dataframe of randomly distributed points containing a column `class_attr` identifying class values. 

Here we extract around 500 training points in total and export the points in a geojson file for use in the rest of workflow. Here we use the stratified sampling method by setting 'equal_stratified_random', but also set the minimum number of samples as 20 to avoid no samples for some minor classes. 

As mentioned earlier we don't want the abandoned classes Sparse Forest (3) and Woodland (4) to be included in the samples we assign them as Nodata (0) and set drop_value as 0 before implementing the function:

In [None]:
map_clipped=map_clipped.where((map_clipped!=dict_map['Sparse Forest'])&(map_clipped!=dict_map['Woodland']),0)

We now implement the function and set the output file name incorporating the district name:

In [None]:
class_attr='LC_Class_I'
out_fname='Results/Training_samples_'+district_name+'.geojson'
gpd_random_samples=random_sampling(da=map_clipped,n=500,sampling='stratified_random',
                                   min_sample_n=20,out_fname=out_fname,class_attr=class_attr,drop_value=0)

>**Note**: The output training data file can also be in other formats (e.g. shapefile) that can be read by `geopandas`, but if you change it, remember to update wherever it was used in subsequent notebooks.

Finally, we also export the clipped and class-merged map to disk for future use:

In [None]:
outname='Results/rwanda_landcover_2015_scheme_ii_clipped.tif'
write_cog(map_clipped, outname, overwrite=True)