<span style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">An Exception was encountered at '<a href="#papermill-error-cell">In [3]</a>'.</span>

In [None]:
from eoxhub import check_compatibility
check_compatibility("user-2022.10-14", dependencies=[])

# Crop-classification using Sentinel-2 time-series

This notebook implements a crop classification algorithm for Sentinel-2 time-series based on deep learning. The input time-series are derived from the signals computed by the Eurocrops BYOA. The method here implemented is described in more detail in the [crop-classification marker blog-post](https://medium.com/sentinel-hub/area-monitoring-crop-type-marker-1e70f672bf44).

For more examples on how to create markers for monitoring agricultural activity using Sentinel-2 signals, consult [this blog series](https://medium.com/sentinel-hub/area-monitoring-concept-effc2c262583).

This notebook will use a sample of pre-downloaded signals, and can be run on a CPU-based instance or laptop.

**Table of Contents**:

 0. [Constants](#constants)
 1. [Retrieve signals and labels](#retrieve-signals)
 2. [AI-ready dataset](#dataset)
 3. [Model training](#model-training)
 4. [Model evaluation](#model-evaluation)

In [None]:
import os
os.environ['EDC_PATH'] + "/notebooks/contributions/eurocrops-model"

<span id="papermill-error-cell" style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">Execution using papermill encountered an exception here and stopped:</span>

In [None]:
import json
import logging
import sys
import zipfile

import geopandas as gpd
import numpy as np
import pandas as pd
import subprocess
from model.lstm import LSTM
from model.polygon import PolyDataset
from model.transforms import get_sample_n_timestamps
from model.utils import test, train
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from torch.optim import Adam
from torch.utils.data import DataLoader

from sentinelhub import parse_time

## 0. Constants <a name=constants></a>

This section initialises paths, constants and utility functions used in the notebook.

In [None]:
# path to local folders, change as desired
INPUT_FOLDER = "./input"
RESULTS_FOLDER = "./output"

# name of files and folders of downloaded signals, do not change
DATAFILE = "ml-example-data.zip"
DOWNLOAD_URL = f"https://sinergise0-my.sharepoint.com/:u:/g/personal/nejc_vesel_sinergise_com/ETMx7NG-JHpBntNMJfnsCOMBVuEegDjYq8WtTmJYl8tZ-A?e=Ck5opE&download=1"
EUROCROPS_GPKG = "input_geometries.gpkg"
SIGNALS_FOLDER = "ml-example-signals"


# utility function to read json payload into a dataframe
def stats_to_df(stats_data):
    """Transform Statistical API response into a pandas.DataFrame"""
    df_data = []

    for single_data in stats_data["data"]:
        df_entry = {}
        is_valid_entry = True

        df_entry["interval_from"] = parse_time(single_data["interval"]["from"]).date()
        df_entry["interval_to"] = parse_time(single_data["interval"]["to"]).date()

        for output_name, output_data in single_data["outputs"].items():
            for band_name, band_values in output_data["bands"].items():
                band_stats = band_values["stats"]
                if band_stats["sampleCount"] == band_stats["noDataCount"]:
                    is_valid_entry = False
                    break

                for stat_name, value in band_stats.items():
                    col_name = f"{output_name}_{band_name}_{stat_name}"
                    if stat_name == "percentiles":
                        for perc, perc_val in value.items():
                            perc_col_name = f"{col_name}_{perc}"
                            df_entry[perc_col_name] = perc_val
                    else:
                        df_entry[col_name] = value

        if is_valid_entry:
            df_data.append(df_entry)

    return pd.DataFrame(df_data)


# utility function to log training progress
def define_logger(logger_name) -> logging.Logger:
    logger = logging.getLogger(logger_name)
    logger.setLevel(logging.INFO)
    formatter = logging.Formatter("[%(asctime)s] {%(filename)s:%(lineno)d} %(levelname)s - %(message)s")

    stdout_handler = logging.StreamHandler(sys.stdout)
    stdout_handler.setFormatter(formatter)
    logger.addHandler(stdout_handler)
    return logger

## 1. Retrieve signals and labels <a name=retrieve-signals></a>

In this Section, signals are read from JSON files, while the labels used during model training are extracted from the Eurocrops dataset. For this notebook, sample files are provided and downloaded locally in the following cells. 

In [None]:
if not os.path.exists(INPUT_FOLDER):
    os.mkdir(INPUT_FOLDER)

In [None]:
wget_str = f"wget {DOWNLOAD_URL} -O {os.path.join(INPUT_FOLDER, DATAFILE)}"
subprocess.call(wget_str.split(" "))

In [None]:
# unzip files
with zipfile.ZipFile(os.path.join(INPUT_FOLDER, DATAFILE), "r") as zip_ref:
    zip_ref.extractall(INPUT_FOLDER)

In [None]:
!ls {INPUT_FOLDER}

Read signals from JSON files as returned by Statistical API and create a dataframe.

In [None]:
dfs = []
for result_json in os.listdir(os.path.join(INPUT_FOLDER, "ml-example-signals")):
    with open(os.path.join(INPUT_FOLDER, "ml-example-signals", result_json)) as f:
        result = json.load(f)

    result_df = stats_to_df(result["response"])
    result_df["identifier"] = int(result["identifier"])
    dfs.append(result_df)

signals = pd.concat(dfs)

Check size of dataset and one time observation.

In [None]:
len(signals)

In [None]:
signals.iloc[0]

Read labels from provided geopackage file. These labels can be directly retrieved from GeoDB.

In [None]:
eurocrops_gdf = gpd.read_file(os.path.join(INPUT_FOLDER, EUROCROPS_GPKG))

Merge signals and crop labels into a single dataframe, by looking at the identifier of the field of interest (FOI).

In [None]:
eurocrops_signals = pd.merge(eurocrops_gdf, signals, on="identifier")

In [None]:
eurocrops_signals.iloc[0]

## 2. AI-ready dataset <a name=dataset></a>

This Section adds some features to be added to the raw bands which will be used by the model.

In [None]:
# add column for cloud probability
eurocrops_signals["CLP"] = eurocrops_signals["clp_B0_mean"] / 255
# compute NDVI
eurocrops_signals["NDVI"] = (eurocrops_signals["bands_B7_mean"] - eurocrops_signals["bands_B3_mean"]) / (
    eurocrops_signals["bands_B7_mean"] + eurocrops_signals["bands_B3_mean"]
)
# compute day-of-year from timestamp
eurocrops_signals["DOY"] = eurocrops_signals.interval_from.apply(lambda x: x.timetuple().tm_yday)

In [None]:
# get name of columns to be used as features, i.e. mean values of raw bands
feature_cols = [x for x in eurocrops_signals.columns if x.startswith("bands_") and x.endswith("_mean")]
# name of utility features
doy_feature = "DOY"
crop_type_feature = "ec_hcat_c"
crop_name_feature = "ec_hcat_n"
label_feature = "label"
identifier_feature = "identifier"

Map all the possible crop-types to specific groups assigned in Eurocrops.

In [None]:
crop_id_to_label_mapping = {val: idx for idx, val in enumerate(eurocrops_signals[crop_type_feature].unique())}

In [None]:
crop_id_to_name_mapping = {
    crop_id: crop_name
    for crop_id, crop_name in eurocrops_signals[[crop_type_feature, crop_name_feature]].drop_duplicates().values
}

In [None]:
eurocrops_signals[label_feature] = eurocrops_signals[crop_type_feature].map(crop_id_to_label_mapping)

Split the signals into a training and validation set. This datasets are demonstrative only, as in reality, a larger dataset would be required, and more robust validation strategies required to robustly estimate the performance of the model.

In [None]:
train_ids, val_ids = train_test_split(
    eurocrops_signals[identifier_feature].unique(), train_size=0.6, test_size=0.4, random_state=42
)

In [None]:
train_df = eurocrops_signals[eurocrops_signals[identifier_feature].isin(train_ids)]
val_df = eurocrops_signals[eurocrops_signals[identifier_feature].isin(val_ids)]

Create the training and validation datasets to be used for model training and validation.

In [None]:
train_poly_dataset = PolyDataset(
    train_df,
    feature_cols=feature_cols,
    label_col=label_feature,
    poly_id_col=identifier_feature,
    doys_col=doy_feature,
    online_transform=get_sample_n_timestamps(40),
)

val_poly_dataset = PolyDataset(
    val_df,
    feature_cols=feature_cols,
    label_col=label_feature,
    poly_id_col=identifier_feature,
    doys_col=doy_feature,
    online_transform=get_sample_n_timestamps(40),
)

## 3. Model training <a name=model-training></a>

In this Section, a LSTM model is trained on the signals for estimation of crop-type. The parameters of the LSTM might need tuning to different use-cases.

In [None]:
BATCH_SIZE = 16
N_WORKERS = 4
SHUFFLE = True

In [None]:
train_loader = DataLoader(dataset=train_poly_dataset, batch_size=BATCH_SIZE, num_workers=N_WORKERS, shuffle=SHUFFLE)

val_loader = DataLoader(dataset=val_poly_dataset, batch_size=BATCH_SIZE, num_workers=N_WORKERS, shuffle=False)

Initialise the model.

In [None]:
lstm = LSTM(
    input_dim=len(feature_cols),
    n_classes=eurocrops_signals[label_feature].nunique(),
    hidden_dims=128,
    num_rnn_layers=3,
    dropout=0.2,
    bidirectional=True,
    use_batchnorm=False,
    use_layernorm=True,
)

Initialise the optimiser.

In [None]:
optimizer = Adam(
    filter(lambda x: x.requires_grad, lstm.parameters()),
    betas=(0.9, 0.98),
    eps=1e-09,
    lr=0.001,
)

Initialise the logger.

In [None]:
logger = define_logger("Training")

Train the model !!

In [None]:
lstm = train(lstm, optimizer, train_loader, val_loader, 30, verbose=False, logger=logger)

## 4. Model evaluation <a name=model-evaluation></a>

This Section evaluates the performance of the trained model on the validation dataset. Perfomance is displayed as a confusion matrix, where estimated and reference crop-types are compared.

In [None]:
import numpy as np

In [None]:
def plot_confusion_matrix(cm,
                          target_names,
                          title='Confusion matrix',
                          cmap=None,
                          normalize=True):
    """
    given a sklearn confusion matrix (cm), make a nice plot

    Arguments
    ---------
    cm:           confusion matrix from sklearn.metrics.confusion_matrix

    target_names: given classification classes such as [0, 1, 2]
                  the class names, for example: ['high', 'medium', 'low']

    title:        the text to display at the top of the matrix

    cmap:         the gradient of the values displayed from matplotlib.pyplot.cm
                  see http://matplotlib.org/examples/color/colormaps_reference.html
                  plt.get_cmap('jet') or plt.cm.Blues

    normalize:    If False, plot the raw numbers
                  If True, plot the proportions

    Usage
    -----
    plot_confusion_matrix(cm           = cm,                  # confusion matrix created by
                                                              # sklearn.metrics.confusion_matrix
                          normalize    = True,                # show proportions
                          target_names = y_labels_vals,       # list of names of the classes
                          title        = best_estimator_name) # title of graph

    Citiation
    ---------
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

    """
    import matplotlib.pyplot as plt
    import numpy as np
    import itertools

    accuracy = np.trace(cm) / np.sum(cm).astype('float')
    misclass = 1 - accuracy

    if cmap is None:
        cmap = plt.get_cmap('Blues')

    plt.figure(figsize=(20, 20))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)

    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names, rotation=90)
        plt.yticks(tick_marks, target_names)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]


    thresh = np.nanmax(cm) / 1.5 if normalize else np.nanmax(cm) / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, "{:0.2f}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, "{:,}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")


    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label\naccuracy={:0.2f}; misclass={:0.2f}'.format(accuracy, misclass))
    plt.show()


In [None]:
predictions, targets, polygon_ids, logprobabilities = test(lstm, val_loader)

In [None]:
cm = confusion_matrix(targets, predictions, labels=list(crop_id_to_label_mapping.values()))

In [None]:
plot_confusion_matrix(cm, target_names=[crop_id_to_name_mapping[x] for x in crop_id_to_label_mapping.keys()])