# Extracting training data from the ODC <img align="right" src="../Supplementary_data/DE_Africa_Logo_Stacked_RGB_small.jpg">

* **Products used:** 
[s2_l2a](https://explorer.digitalearth.africa/s2_l2a)


## Description
This notebook will extract training data (feature layers) from the `open-data-cube` using geometries within a shapefile (or geojson). To do this, we rely on a custom `deafrica-sandbox-notebooks` function called `collect_training_data`, contained within the [deafrica_classificationtools](../Scripts/deafrica_classificationtools.py) script.  The goal of this notebook is to familarise users with this function so they can extract the appropriate data for their use-case.

1. Preview the polygons in our training data by plotting them on a basemap
3. Extract training data from the datacube using  a custom defined feature layer function that we can pass to `collect_training_data`
4. Export the training data to disk for use in subsequent scripts

***

## Getting started

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell. 

### Load packages


In [None]:
# !pip install https://packages.dea.ga.gov.au/hdstats/hdstats-0.1.6.tar.gz
# !pip install dask-ml
# !pip install multiprocessing-logging 
# !pip install backoff-utils

In [2]:
%matplotlib inline

import sys
import os
import warnings
import datacube
import numpy as np
import xarray as xr
import subprocess as sp
import geopandas as gpd
from datacube.utils.geometry import assign_crs

sys.path.append('../../Scripts')
from deafrica_plotting import map_shapefile
from feature_layer_functions import xr_geomedian_tmad
from deafrica_bandindices import calculate_indices
from deafrica_classificationtools import collect_training_data 

from feature_layer_functions import gm_mads_three_seasons #, gm_mads_two_seasons

warnings.filterwarnings("ignore")
%load_ext autoreload
%autoreload 2

from datacube.utils.rio import configure_s3_access
configure_s3_access(aws_unsigned=True, cloud_defaults=True)

  return f(*args, **kwds)


## Analysis parameters

* `path`: The path to the input shapefile from which we will extract training data.
* `field`: This is the name of column in your shapefile attribute table that contains the class labels. **The class labels must be integers**
* `ncpus`: Set this value to > 1 to parallize the collection of training data. eg. npus=8. To automatically find the number of cpus run the cell two below this one.

> **Note**: With supervised classification, its common to have many, many labelled geometries in the training data. `collect_training_data` can parallelize across the geometries in order to speed up the extracting of training data. Setting `ncpus>1` will automatically trigger the parallelization, however, its best to set `ncpus=1` to begin with to assist with debugging before triggering the parallelization. 


In [3]:
path = './data/Eastern_training_data_20201104.geojson' 
field = 'Class'

### Optional: Automatically find the number of cpus

In [4]:
try:
    ncpus = int(float(sp.getoutput('env | grep CPU')[-4:]))
except:
    ncpus = int(float(sp.getoutput('env | grep CPU')[-3:]))

print('ncpus = '+str(ncpus))

ncpus = 31


## Preview input data

We can load and preview our input data shapefile using `geopandas`. The shapefile should contain a column with class labels (e.g. 'class'). These labels will be used to train our model. 

> Remember, the class labels **must** be represented by `integers`.


In [5]:
# Load input data shapefile
input_data = gpd.read_file(path)

# Plot first five rows
input_data.head()

Unnamed: 0,Class,geometry
0,1,"POLYGON ((32.49666 -3.30737, 32.49693 -3.30716..."
1,1,"POLYGON ((32.49314 -3.30836, 32.49382 -3.30847..."
2,1,"POLYGON ((32.49962 -3.31316, 32.50028 -3.31338..."
3,1,"POLYGON ((32.51721 -3.10441, 32.51716 -3.10465..."
4,1,"POLYGON ((32.38058 -2.69827, 32.38091 -2.69820..."


In [6]:
# Plot training data in an interactive map
# map_shapefile(input_data, attribute=field)

Now, we can pass this function to `collect_training_data`.  For each of the geometries in our shapefile we will extract temporal statistics, and elevation data.

Because we are now interested in calculating a range of temporal statistics, we will redefine our intial parameters to include a time-series of Sentinel-2 data (whereas above we loaded data from a geomedian composite).  Remember, passing in a `custom_func` to `collect_training_data` means many of the other feature layer parameters are ignored.

In [7]:
#set up our inputs to collect_training_data
products =  ['s2_l2a']
time = ('2019-01','2019-12')
zonal_stats = 'median' 
return_coords=True

# Set up the inputs for the ODC query
measurements =  ['red','blue','green','nir','swir_1','swir_2','red_edge_1','red_edge_2','red_edge_3']
resolution = (-20,20)
output_crs='epsg:6933'

In [8]:
#generate a new datacube query object
query = {
    'time': time,
    'measurements': measurements,
    'resolution': resolution,
    'output_crs': output_crs,
    'group_by' : 'solar_day',
}

In [9]:
column_names, model_input = collect_training_data(
                                    gdf=input_data,
                                    products=products,
                                    dc_query=query,
                                    ncpus=25,
                                    return_coords=return_coords,
                                    field=field,
                                    zonal_stats=zonal_stats,
                                    custom_func=gm_mads_three_seasons,
                                    clean=True,
                                    fail_threshold=0.01,
                                    max_retries=4
                                    )

Reducing data using user supplied custom function
Taking zonal statistic: median
Collecting training data in parallel mode


 30%|██▉       | 855/2880 [27:56<47:00,  1.39s/it]  Error opening source dataset: s3://sentinel-cogs/sentinel-s2-l2a-cogs/2019/S2B_36NWG_20191106_0_L2A/B03.tif
 53%|█████▎    | 1540/2880 [49:35<1:04:31,  2.89s/it]Error opening source dataset: https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/36/N/XF/2019/5/S2B_36NXF_20190507_1_L2A/B05.tif
100%|█████████▉| 2878/2880 [1:37:55<00:04,  2.04s/it]  


Percentage of possible fails after run 1 = 18.99 %
Recollecting samples that failed


100%|██████████| 547/547 [13:55<00:00,  1.53s/it]


Percentage of possible fails after run 2 = 7.88 %
Recollecting samples that failed


  0%|          | 0/227 [00:00<?, ?it/s]Error opening source dataset: s3://sentinel-cogs/sentinel-s2-l2a-cogs/2019/S2A_35MRV_20190508_0_L2A/B02.tifError opening source dataset: s3://sentinel-cogs/sentinel-s2-l2a-cogs/2019/S2A_37PCL_20190112_0_L2A/B02.tif

Error opening source dataset: s3://sentinel-cogs/sentinel-s2-l2a-cogs/2019/S2A_37PDL_20190509_0_L2A/B03.tif
 99%|█████████▊| 224/227 [06:56<00:05,  1.86s/it]


Percentage of possible fails after run 3 = 3.06 %
Recollecting samples that failed


100%|██████████| 88/88 [03:27<00:00,  2.35s/it] 


Percentage of possible fails after run 4 = 1.63 %
Recollecting samples that failed


100%|██████████| 47/47 [02:07<00:00,  2.72s/it]

Removed 43 rows wth NaNs &/or Infs
Output shape:  (2828, 53)





In [10]:
print(column_names)
print('')
print(np.array_str(model_input, precision=2, suppress_small=True))

['Class', 'red_S1', 'blue_S1', 'green_S1', 'nir_S1', 'swir_1_S1', 'swir_2_S1', 'red_edge_1_S1', 'red_edge_2_S1', 'red_edge_3_S1', 'edev_S1', 'sdev_S1', 'bcdev_S1', 'NDVI_S1', 'LAI_S1', 'MNDWI_S1', 'rain_S1', 'red_S2', 'blue_S2', 'green_S2', 'nir_S2', 'swir_1_S2', 'swir_2_S2', 'red_edge_1_S2', 'red_edge_2_S2', 'red_edge_3_S2', 'edev_S2', 'sdev_S2', 'bcdev_S2', 'NDVI_S2', 'LAI_S2', 'MNDWI_S2', 'rain_S2', 'red_S3', 'blue_S3', 'green_S3', 'nir_S3', 'swir_1_S3', 'swir_2_S3', 'red_edge_1_S3', 'red_edge_2_S3', 'red_edge_3_S3', 'edev_S3', 'sdev_S3', 'bcdev_S3', 'NDVI_S3', 'LAI_S3', 'MNDWI_S3', 'rain_S3', 'slope', 'x_coord', 'y_coord']

[[      1.         0.08       0.05 ...       2.95 3135520.   -421710.  ]
 [      1.         0.08       0.06 ...       3.17 3135170.   -421860.  ]
 [      1.         0.12       0.07 ...       3.   3137450.   -395860.  ]
 ...
 [      1.         0.12       0.06 ...       0.71 3308330.     71410.  ]
 [      1.         0.1        0.06 ...       3.71 3309170.     7646

## Seperate coordinates

By setting `return_coords=True` in the `collect_training_data` function, our training data now has two extra columns called `x_coord` and `y_coord`.  We need to seperate these from our training dataset as they will not be used to train the machine learning model. Instead, these variables will be used to help conduct Spatial K-fold Cross validation (SKVC) and spatially aware test-train-splits in the notebook `3_Train_fit_evaluate_classifier`.  For more information on why this is important, see this [article](https://www.tandfonline.com/doi/abs/10.1080/13658816.2017.1346255?journalCode=tgis20).

In [11]:
coord_variables = ['x_coord', 'y_coord']
model_col_indices = [column_names.index(var_name) for var_name in coord_variables]

np.savetxt("results/training_data/training_data_coordinates.txt", model_input[:, model_col_indices])

## Export training data

Once we've collected all the training data we require, we can write the data to disk. This will allow us to import the data in the next step(s) of the workflow.


In [12]:
#set the name and location of the output file
output_file = "results/training_data/gm_mads_three_seasons_training_data.txt"

In [13]:
#grab all columns except the x-y coords
model_col_indices = [column_names.index(var_name) for var_name in column_names[0:-2]]
#Export files to disk
np.savetxt(output_file, model_input[:, model_col_indices], header=" ".join(column_names[0:-2]), fmt="%4f")