# Extracting training data from the ODC

* **Products used:** 
[gm_s2_annual](https://explorer.digitalearth.africa/gm_s2_annual)


## Background

**Training data** is the most important part of any supervised machine learning workflow. The quality of the training data has a greater impact on the classification than the algorithm used. Large and accurate training data sets are preferable: increasing the training sample size results in increased classification accuracy ([Maxell et al 2018](https://www.tandfonline.com/doi/full/10.1080/01431161.2018.1433343)).  A review of training data methods in the context of Earth Observation is available [here](https://www.mdpi.com/2072-4292/12/6/1034) 

When creating training labels, be sure to capture the **spectral variability** of the class, and to use imagery from the time period you want to classify (rather than relying on basemap composites). Another common problem with training data is **class imbalance**. This can occur when one of your classes is relatively rare and therefore the rare class will comprise a smaller proportion of the training set. When imbalanced data is used, it is common that the final classification will under-predict less abundant classes relative to their true proportion.

There are many platforms to use for gathering training labels, the best one to use depends on your application. GIS platforms are great for collection training data as they are highly flexible and mature platforms; [Geo-Wiki](https://www.geo-wiki.org/) and [Collect Earth Online](https://collect.earth/home) are two open-source websites that may also be useful depending on the reference data strategy employed. Alternatively, there are many pre-existing training datasets on the web that may be useful, e.g. [Radiant Earth](https://www.radiant.earth/) manages a growing number of reference datasets for use by anyone.


## Description
This notebook will extract training data (feature layers) from the `open-data-cube` using geometries within a geojson. The default example will use the vegetated wetlands/non-vegetated wetland labels within the `'data/kenya_uganda_yearly_polygon_training_sites_{year}.geojson'` file. 

To do this, we rely on a custom `deafrica-sandbox-notebooks` function called `collect_training_data`, contained within the [deafrica_tools.classification](../../Tools/deafrica_tools/classification.py) script.  The principal goal of this notebook is to familarise users with this function so they can extract the appropriate data for their use-case. The default example also highlights extracting a set of useful feature layers for generating a vegetated wetland mask for Kenya and Uganda.


1. Preview the polygons in our training data by plotting them on a basemap
2. Define a feature layer function to pass to `collect_training_data`
3. Extract training data from the datacube using `collect_training_data`
4. Export the training data to disk for use in subsequent scripts


***

## Getting started

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell. 

### Load packages


In [1]:
%matplotlib inline

import os
import datacube
import numpy as np
import xarray as xr
import subprocess as sp
import geopandas as gpd
from odc.io.cgroups import get_cpu_quota
from datacube.utils.geometry import assign_crs

from deafrica_tools.plotting import map_shapefile
from deafrica_tools.bandindices import calculate_indices
from deafrica_tools.classification import collect_training_data

import warnings
warnings.filterwarnings("ignore")

  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)


## Analysis parameters

* `year`: The year for which to get training data for. 
* `path`: The path to the input vector file from which we will extract training data. A default geojson is provided.
* `field`: This is the name of column in your shapefile attribute table that contains the class labels. **The class labels must be integers**.
* `training_data`: The path to export the collected training data to.
* `product` : Which Annual  GeoMAD product to use:
    * Landsat 8 & 9 Annual GeoMAD ` gm_ls8_ls9_annual` is available from 2021- 
    * Sentinel-2 Annual GeoMAD `gm_s2_annual` is available from 2017 – 2020
    * Landsat 8 Annual GeoMAD `gm_ls8_annual` is available from 2013 – 2020
    * Landsat 5 & 7 Annual GeoMAD `gm_ls5_ls7_annual` is available from 1984 – 2012
* `measurements` : The bands to load for the specified product. 
* `collection` : Used when calculating spectral indices from bands. `c2` for Landsat data and `s2` for Sentinel 2 data.
* `resoluion` : The spatial resolution, in metres, to resample the satellite data to. `(-30, 30)` for Lansat data and `(-20, 20)` for Sentinel 2 data.

In [2]:
year = "2017"
path = f"data/training_data/kenya_uganda_yearly_polygon_training_sites_{year}.geojson"
field = "class"

# Create the output directory to store the results.
output_dir = "results"
os.makedirs(output_dir, exist_ok=True)
# Set the name and location of the output file.
training_data = f"{output_dir}/kenya_uganda_test_training_data_{year}.txt"

# Specify the product, measurements and resolution. 
product = "gm_s2_annual"
measurements =  ['blue','green','red','nir','swir_1','swir_2','red_edge_1', 'red_edge_2', 'red_edge_3', 'BCMAD', 'EMAD', 'SMAD'] 
satellite_mission = "s2"
resolution = (-20,20)

#product = "gm_ls5_ls7_annual"
# product = "gm_ls8_annual"
# product = " gm_ls8_ls9_annual"
#measurements =   ['blue','green','red','nir','swir_1','swir_2', 'BCMAD', 'EMAD', 'SMAD'] 
#satellite_mission = "ls"
#resolution = (-30,30) 

### Find the number of CPUs

In [3]:
ncpus=round(get_cpu_quota())
print('ncpus = '+str(ncpus))

ncpus = 15


## Preview input data

We can load and preview our input data shapefile using `geopandas`. The shapefile should contain a column with class labels (e.g. 'class'). These labels will be used to train our model. 

> Remember, the class labels **must** be represented by `integers`.


In [4]:
# Load input data shapefile
input_data = gpd.read_file(path)

# Plot first five rows
input_data.head()

Unnamed: 0,class,geometry
0,9,"POLYGON ((39.14620 -0.37515, 39.14620 -0.37215..."
1,4,"POLYGON ((35.74762 4.36950, 35.74762 4.37250, ..."
2,9,"POLYGON ((40.40394 -0.52649, 40.40394 -0.52349..."
3,9,"POLYGON ((40.43746 0.53355, 40.43746 0.53655, ..."
4,9,"POLYGON ((38.97078 -0.49255, 38.97078 -0.48955..."


In [5]:
# Plot training data in an interactive map
map_shapefile(input_data, attribute=field)

Label(value='')

Map(center=[0.17890070899999966, 35.625794355], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zo…

## Extracting training data

The function `collect_training_data` takes our geojson containing class labels and extracts training data (features) from the datacube over the locations specified by the input geometries. The function will also pre-process our training data by stacking the arrays into a useful format and removing any `NaN` or `inf` values.

The below variables can be set within the `collect_training_data` function:

* `zonal_stats`: An optional string giving the names of zonal statistics to calculate across each polygon (if the geometries in the vector file are polygons and not points). Default is None (all pixel values are returned). Supported values are 'mean', 'median', 'max', and 'min'. 

In addition to the `zonal_stats` parameter, we also need to set up a datacube query dictionary for the Open Data Cube query such as `measurements` (the bands to load from the satellite), the `resolution` (the cell size), and the `output_crs` (the output projection). These options will be added to a `query` dictionary that will be passed into `collect_training_data` using the parameter `collect_training_data(dc_query=query, ...)`.  The query dictionary will be the only argument in the **feature layer function** which we will define and describe in a moment.

> Note: `collect_training_data` also has a number of additional parameters for handling ODC I/O read failures, where polygons that return an excessive number of null values can be resubmitted to the multiprocessing queue.  Check out the [docs](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks/blob/83116e80ebb4f8744e3de74e7a713aadd0a7577a/Tools/deafrica_tools/classification.py#L565) to learn more.

In [6]:
#set up our inputs to collect_training_data
zonal_stats = 'mean'

# Set up the inputs for the ODC query
time = (year)

output_crs='epsg:6933'

Generate a datacube query object from the parameters above:

In [7]:
query = {
    'product': product,
    'time': time,
    'measurements': measurements,
    'resolution': resolution,
    'output_crs': output_crs
}

## Defining feature layers

To create the desired feature layers, we pass instructions to `collect_training_data` through the `feature_func` parameter.

* `feature_func`: A function for generating feature layers that is applied to the data within the bounds of the input geometry. The `feature_func` must accept a `dc_query` dictionary, and return a single `xarray.Dataset` or `xarray.DataArray` containing 2D coordinates (i.e x, y - no time dimension). e.g.

          def feature_function(query):
              dc = datacube.Datacube(app='feature_layers')
              ds = dc.load(**query)
              ds = ds.mean('time')
              return ds

Below, we will define a more complicated feature layer function than the brief example shown above. We will calculate some band indices on the Sentinel-2 [geoMAD](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks/blob/master/Datasets/GeoMAD.ipynb) and append a slope dataset.



In [8]:
from datacube.testutils.io import rio_slurp_xarray

def feature_layers(query):
    #connect to the datacube
    dc = datacube.Datacube(app='feature_layers')
    
    #load Annual geomedian
    ds = dc.load(**query)
    
    #calculate some band indices
    da = calculate_indices(ds,
                           index=['NDVI', 'LAI', 'MNDWI'],
                           drop=False,
                           satellite_mission=satellite_mission)
    
    #add slope dataset
    url_slope = "https://deafrica-input-datasets.s3.af-south-1.amazonaws.com/srtm_dem/srtm_africa_slope.tif"
    slope = rio_slurp_xarray(url_slope, gbox=ds.geobox)
    slope = slope.to_dataset(name='slope')
    
    #merge results into single dataset 
    result = xr.merge([da, slope],compat='override')

    return result.squeeze()

Now let's run the `collect_training_data` function.

> **Note**: With supervised classification, its common to have many, many labelled geometries in the training data. `collect_training_data` can parallelize across the geometries in order to speed up the extracting of training data. Setting `ncpus>1` will automatically trigger the parallelization. However, its best to set `ncpus=1` to begin with to assist with debugging before triggering the parallelization.  You can also limit the number of polygons to run when checking code. For example, passing in `gdf=input_data[0:5]` will only run the code over the first 5 polygons.

In [9]:
column_names, model_input = collect_training_data(
                                    gdf=input_data,
                                    dc_query=query,
                                    ncpus=ncpus,
                                    field=field,
                                    zonal_stats=zonal_stats,
                                    feature_func=feature_layers
                                    )

Taking zonal statistic: mean
Collecting training data in parallel mode


  0%|          | 0/4000 [00:00<?, ?it/s]

Percentage of possible fails after run 1 = 0.0 %
Removed 0 rows wth NaNs &/or Infs
Output shape:  (4000, 17)


The function returns a list (`column_names`) contains a list of the names of the feature layers we've computed:

In [10]:
print(column_names)

['class', 'blue', 'green', 'red', 'nir', 'swir_1', 'swir_2', 'red_edge_1', 'red_edge_2', 'red_edge_3', 'BCMAD', 'EMAD', 'SMAD', 'NDVI', 'LAI', 'MNDWI', 'slope']


The second object returned by the function is a numpy.array (`model_input`) and contains the data from our labelled geometries. The first item in the array is the class integer, the second set of items are the values for each feature layer we computed:

In [11]:
print(np.array_str(model_input, precision=2, suppress_small=True))

[[   9.    993.85 1434.34 ...    0.21   -0.55    2.24]
 [   9.   1166.82 1697.31 ...    0.26   -0.47    3.47]
 [   4.    636.53  951.93 ...    0.19   -0.33   15.  ]
 ...
 [   4.    433.71  754.96 ...    1.51   -0.49    5.03]
 [   4.    494.36  787.51 ...    1.29   -0.53    7.57]
 [   9.    352.88  589.98 ...    1.51   -0.49   32.83]]


## Export training data

Once we've collected all the training data we require, we can write the data to disk. This will allow us to import the data in the next step(s) of the workflow.


In [12]:
#grab all columns
model_col_indices = [column_names.index(var_name) for var_name in column_names]
#Export files to disk
np.savetxt(training_data, model_input[:, model_col_indices], header=" ".join(column_names), fmt="%4f")

## Recommended next steps

To continue working through the notebooks in this `Scalable Machine Learning on the ODC` workflow, go to the next notebook `2_Inspect_training_data.ipynb`.

1. **Extracting_training_data (this notebook)**
2. [Inspect_training_data](2_Inspect_training_data.ipynb)
3. [Evaluate_optimize_fit_classifier](3_Evaluate_optimize_fit_classifier.ipynb)
4. [Classify_satellite_data](4_Classify_satellite_data.ipynb)
5. [Object-based_filtering](5_Object-based_filtering.ipynb)


***

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Africa data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks).

**Compatible datacube version:** 

In [13]:
print(datacube.__version__)

1.8.8


**Last Tested:**

In [14]:
from datetime import datetime
datetime.today().strftime('%Y-%m-%d')

'2022-11-08'