# Training feature extraction

## Background

Based on our initial exploration and understanding of the information that can be provided by the Earth observation (EO) data, we want to use both spectral and temporal EO measurements to classify crop types. We can also use information about the landscape, such as slope, as a predictor for crop type.

In a supervised machine learning approach, we will first build a labelled training dataset, combining the crop labels and their associated EO-derived information that can be extracted from the DE Africa platform. The training dataset is transformed into a set of features that can be understood by the machine learning algorithm.


## Description

Different machine learning models and implementations require different types of training data. We use [`scikit-learn`](https://scikit-learn.org/stable/), a powerful Python libary with a comprehensive set of machine learning algorithms and tools.

In this notebook, we demonstrate how to use extract data from the DE Africa platform, combine them with crop labels, and transform them into a set of features that will be used to train a machine learning model.


## Getting started

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell.

### Load packages

In [1]:
import os
import geopandas as gpd
import numpy as np
import pandas as pd
import json
import pickle
from deafrica_tools.classification import collect_training_data
from odc.io.cgroups import get_cpu_quota
from sklearn.preprocessing import LabelEncoder

from feature_collection import feature_layers

  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)


In [2]:
# Create data and results directories if they don't exist
input_folder="Results/Labels"
output_folder="Results/Model"
output_crs="EPSG:6933"
if not os.path.exists(input_folder):
    os.makedirs(input_folder)

if not os.path.exists(output_folder):
    os.makedirs(output_folder)

## Load training crop labels

We start with the cleaned and merged crop labels, which include four crop classes: Maize, Sesame, Soy, and Others.

In [3]:
# Point to crop type training data
path= os.path.join(input_folder,"Cash_crop_type_subset_single_crops_merged.shp")

# Load input data and project
single_crops_subset = gpd.read_file(path).to_crs(output_crs)
single_crops_subset.head()

Unnamed: 0,year,Crop_type,geometry
0,2021,Maize,"POLYGON ((3280425.252 -2409501.264, 3280415.15..."
1,2021,Maize,"POLYGON ((3280689.817 -2409502.893, 3280704.76..."
2,2021,Sesame,"POLYGON ((3529816.669 -2109463.643, 3529807.92..."
3,2021,Maize,"POLYGON ((3280542.444 -2408144.561, 3280512.19..."
4,2021,Maize,"POLYGON ((3301561.584 -2393086.954, 3301569.64..."


### Map crop types to numeric classes for prediction

The crop type labels need to be transformed into numbers for them to work with the ML algorithm. 
We will save the mapping as a JSON file so the predictions can be transformed back into the crop type labels.

In [4]:
# Select field to label
field = "Crop_type"

# Fit label encoder to match classes to numeric labels
le = LabelEncoder()
le.fit(single_crops_subset[field])

# Get a list of the crop types
classes = list(le.classes_)

# Assign numeric label for each class
single_crops_subset["label"] = le.transform(single_crops_subset[field])

# Create a dictionary mapping classes to numeric labels
class_dictionary = {crop_class: int(le.transform([crop_class])[0]) for crop_class in classes}
print("Class Dictionary:")
print(class_dictionary)

# Export class dictionary
with open(os.path.join(output_folder,"class_labels.json"), 'w', encoding='utf-8') as f:
    json.dump(class_dictionary, f, ensure_ascii=False, indent=4)

Class Dictionary:
{'Maize': 0, 'Others': 1, 'Sesame': 2, 'Soy': 3}


## Prepare query for feature extraction

Our machine learning approach uses measurements from statistical summary products as input features to ensure data are not affected by cloud cover in individual images. Temporal coverage of these products should be defined based on collection time of the labeled data and the crop year to map.

Specifically, we define below time windows for three different input datasets:

* `semiannual_geomad_times`: for Semi-annual [GeoMAD product](https://docs.digitalearthafrica.org/en/latest/data_specs/GeoMAD_specs.html)
* `monthly_ndvi_time_range`: for time range of [Monthly NDVI](https://docs.digitalearthafrica.org/en/latest/data_specs/NDVI_Anomaly_specs.html)
* `ls_fc_cover_times`: for time ranges of [Fractional Cover](https://docs.digitalearthafrica.org/en/latest/data_specs/Fractional_Cover_specs.html)

Each of these is configured as a dictionary, where the key indicates the feature name and the value is a date or date range for the data query (in "YYYY-MM-DD" format).

The crop labels we use were collected in April 2022, therefore we use time periods that overlap with the 2021/2022 crop season.

The example list of parameters below retrieves:

* Two Digital Earth Africa Semi-annual [GeoMAD](https://docs.digitalearthafrica.org/en/latest/data_specs/GeoMAD_specs.html), for the second half of 2021 and the first half of 2022.
* Digital Earth Africa [Monthly NDVI](https://docs.digitalearthafrica.org/en/latest/data_specs/NDVI_Anomaly_specs.html), from Oct 2021 to Sep 2022.
* Four quarterly medians of [Fractional Cover](https://docs.digitalearthafrica.org/en/latest/data_specs/Fractional_Cover_specs.html), from Oct 2021 to Sep 2022.

```
semiannual_geomad_times = {
    "semiannual_2021_07": "2021-07-01",
    "semiannual_2022_01": "2022-01-01",
}

monthly_ndvi_time_range=("2021-10","2022-09")

ls_fc_cover_times = {
    "Q4_2021": slice("2021-10-01", "2021-12-31"),
    "Q1_2022": slice("2022-01-01", "2022-03-31"),
    "Q2_2022": slice("2022-04-01", "2022-06-30"),
    "Q3_2022": slice("2022-07-01", "2022-09-30"),
}
```

In addition to the time series measurements, we also use slope derived from the Digital Elevation Model (DEM) as an input feature.

In [5]:
semiannual_geomad_times = {
    "semiannual_2021_07": "2021-07-01",
    "semiannual_2022_01": "2022-01-01",
}

monthly_ndvi_time_range=("2021-10","2022-09")

ls_fc_cover_times = {
    "Q4_2021": slice("2021-10-01", "2021-12-31"),
    "Q1_2022": slice("2022-01-01", "2022-03-31"),
    "Q2_2022": slice("2022-04-01", "2022-06-30"),
    "Q3_2022": slice("2022-07-01", "2022-09-30"),
}

We also need to set the spatial requirements and combine all parameters into a query dictionary. This query dictionay is saved and will be used later.

In [6]:
resolution = (-10, 10)
query = {
    #"annual_geomad_times": annual_geomad_times,
    "semiannual_geomad_times": semiannual_geomad_times,
    "monthly_ndvi_time_range":monthly_ndvi_time_range,
    "ls_fc_cover_times": ls_fc_cover_times,
    
    "resolution": resolution,
    "output_crs": output_crs,
}

# Export query to pickle file for future re-use
output_query=os.path.join(output_folder,'query.pickle')
with open(output_query, 'wb') as f:
    pickle.dump(query, f)

## Collect training data

We use the `collect_training_data()` function to extract data from the DE Africa platform.
By default, the method below will run in parallel mode, which decreases the amount of time to run feature extraction for each geometry. 

We also use the `feature_layers()` function defined in `feature_collection.py`. An error may occur if the `feature_layers()` is not defined properly.

### Debugging
When testing, it is suggested you set `parallel = False` below to switch back to serial mode. 

You can also set `gdf = single_crops_subset.iloc[0:5, :].copy()` in the function call to only run on the first five geometries.

> This step may take a few hours to run over a thousand polygons, depend on the number of features to extract and how much processing is required to obtain the measurements.

In [7]:
# Set parallel mode on or off (set to False if testing a new feature extraction function).
parallel = True

if parallel:
    ncpus = round(get_cpu_quota())
else:
    ncpus = 1
    
print("ncpus = " + str(ncpus))

ncpus = 4


For testing this workflow, we extract data over a random subsample of the labeled polygons. To do this we set `subsample = True`. In this instance features will be extracted over 10 randomly selected polygons. 

Collecting training data over the entire area would require you to set `subsample = False`.

> **Note:** In the following notebooks, we use the pre-loaded training data that has been collected over the entire test area. 

In [8]:
subsample = True
if subsample: 
    n_subset = 10
    n_sample = len(single_crops_subset)
    if n_sample > n_subset:
        subset = np.random.choice(
            single_crops_subset.index.values, n_subset)
        single_crops_subset = single_crops_subset.loc[subset]
        single_crops_subset.reset_index(inplace=True)

In [9]:
%%time

# Collect the training data
column_names, model_input = collect_training_data(
    gdf=single_crops_subset,
    dc_query=query,
    ncpus=ncpus,
    field="label",
    zonal_stats=None,
    feature_func=feature_layers,
)

Collecting training data in parallel mode


  0%|          | 0/10 [00:00<?, ?it/s]

Percentage of possible fails after run 1 = 0.0 %
Removed 0 rows wth NaNs &/or Infs
Output shape:  (686, 66)
CPU times: user 92.2 ms, sys: 42 ms, total: 134 ms
Wall time: 4min 2s


In [10]:
# Print the list of features collected
print("# of training features collected:", len(column_names)-1)
print("List of training features collected:", column_names[1:])

# of training features collected: 65
List of training features collected: ['blue_s2_semiannual_2021_07', 'green_s2_semiannual_2021_07', 'red_s2_semiannual_2021_07', 'nir_1_s2_semiannual_2021_07', 'nir_2_s2_semiannual_2021_07', 'swir_1_s2_semiannual_2021_07', 'swir_2_s2_semiannual_2021_07', 'red_edge_1_s2_semiannual_2021_07', 'red_edge_2_s2_semiannual_2021_07', 'red_edge_3_s2_semiannual_2021_07', 'smad_s2_semiannual_2021_07', 'emad_s2_semiannual_2021_07', 'bcmad_s2_semiannual_2021_07', 'NDVI_s2_semiannual_2021_07', 'LAI_s2_semiannual_2021_07', 'SAVI_s2_semiannual_2021_07', 'MSAVI_s2_semiannual_2021_07', 'MNDWI_s2_semiannual_2021_07', 'blue_s2_semiannual_2022_01', 'green_s2_semiannual_2022_01', 'red_s2_semiannual_2022_01', 'nir_1_s2_semiannual_2022_01', 'nir_2_s2_semiannual_2022_01', 'swir_1_s2_semiannual_2022_01', 'swir_2_s2_semiannual_2022_01', 'red_edge_1_s2_semiannual_2022_01', 'red_edge_2_s2_semiannual_2022_01', 'red_edge_3_s2_semiannual_2022_01', 'smad_s2_semiannual_2022_01', 'emad

## Export training data

Finally, we export the training data to a csv file.

In [11]:
# set the name and location of the output file
if subsample:
    output_file = os.path.join(
        output_folder, 'single_crops_merged_training_features_2022_subsample.csv')
else:
    output_file = os.path.join(
        output_folder, 'single_crops_merged_training_features_2022_all.csv')

# convert to a dataframe and save as a csv file
model_input_df = pd.DataFrame(model_input, columns=column_names)
model_input_df.to_csv(output_file, index=False)