![](./resources/System_v1_training_header.png)

This notebook contains a demonstration on how to train a custom temporary crop extent model based on your own reference data and how to apply the resulting model to generate a custom temporary crop extent map.

# Content

- [Before you start](#before-you-start)
- [1. Define region of interest](#1.-Define-a-region-of-interest)
- [2. Check existing in-situ reference data](#2.-Check-existing-in-situ-reference-data)
- [3. Prepare own reference data](#3.-Prepare-own-reference-data)
- [4. Extract required model inputs](#4.-Extract-required-model-inputs)
- [5. Train custom classification model](#5.-Train-custom-classification-model)
- [6. Deploy custom model](#6.-Deploy-custom-model)
- [7. Generate a map](#7.-Generate-a-map)

# Before you start

In order to run this notebook, you need to create an account on:

- The Copernicus Data Space Ecosystem (CDSE)
--> by completing the form [HERE](https://identity.dataspace.copernicus.eu/auth/realms/CDSE/login-actions/registration?client_id=cdse-public&tab_id=eRKGqDvoYI0)

- VITO's Terrascope platform
--> by completing the form [HERE](https://sso.terrascope.be/auth/realms/terrascope/login-actions/registration?client_id=drupal-terrascope&tab_id=irBzckp2aDo)

In [2]:
%load_ext autoreload
%autoreload 2

In [1]:
# TODO: Fix access to utils script avoiding this import

import sys

sys.path.append(
    "/home/jeroendegerickx/git/worldcereal/worldcereal-classification/notebooks"
)

# 1. Define a region of interest

When running the code snippet below, an interactive map will be visualized.
Click the Rectangle button on the left hand side of the map to start drawing your region of interest.
When finished, execute the second cell to store the coordinates of your region of interest. 

In [2]:
from worldcereal.utils.map import get_ui_map

m, dc = get_ui_map()
m

Map(center=[51.1872, 5.1154], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', 'zoo…

In [3]:
# retrieve bounding box from drawn rectangle
from worldcereal.utils.map import get_bbox_from_draw

spatial_extent, bbox, poly = get_bbox_from_draw(dc)

# TODO: relax processing area limit but raise a warning containing an estimate on processsing credits consumption for large areas?

Your area of interest: (110.139282, -7.383168, 110.147519, -7.373634)
Area of processing extent: 0.96 km²


# 2. Check existing in situ reference data

We query the database of existing training data, which is stored as a parquet file on a Cloudferro S3 bucket...

In [5]:
from utils import query_worldcereal_samples

public_df = query_worldcereal_samples(poly, buffer=50000, filter_cropland=False)
public_df.head()

Applying a buffer of 50.0 km to the selected area ...
Querying WorldCereal global database ...


RuntimeError: Query interrupted

# 3. Prepare own reference data

The idea of this section is that we prepare additional training data for our custom model based on a user-provided dataset. Required ingredients are:
- actual samples from the private dataset, extracted from the WorldCereal Reference Data Module (RDM)
- for all these samples, we need the required EO and auxilliary data needed to train the custom model. These data will be extracted through an OpenEO extraction workflow.

First we check which publicly available reference datasets are available in the RDM near our region of interest:

In [4]:
from utils import rdm_collection_request

col_ids = rdm_collection_request(poly)

https://ewoc-rdm-api.iiasa.ac.at/collections/search?Bbox=107.89349378970118&Bbox=-9.604179528638747&Bbox=112.3933072102988&Bbox=-5.141370619541715
The following collections intersect with your AOI:

Collection 1: 2023idnvitocampaignpoly110 of type Polygon containing 335 samples

Collection 2: 2023idnvitomanualpoint100 of type Point containing 1290 samples


We have uploaded a private dataset (2022idnvitopoint100) to the WorldCereal Reference Data Module (RDM).
In order to retrieve our private dataset from the RDM, we need to login with our Terrascope login.

In the following cell, we :
- log in to Terrascope to be able to discover this dataset
- verify which collections (including private ones) intersect with our region of interest


In [5]:
from utils import terrascope_login, rdm_collection_request, rdm_features_request
from pathlib import Path

token = terrascope_login()
col_ids = rdm_collection_request(poly, token)


samples = rdm_features_request(
    poly, col_ids=["2022idnvitopoint100"], headers=token, max_items=1000
)

# TODO: the following action should actually be done automatically in the RDM? --> whenever a user uploads a private dataset, all samples should be marked as "to be extracted"?
# We mark all samples as to be extracted
samples["extract"] = [1] * len(samples)

# TODO: make output path customizable
features_file = "/vitodata/worldcereal/test/demo_idn/features.parquet"
Path(features_file).parent.mkdir(parents=True, exist_ok=True)
samples.to_parquet(features_file)

https://ewoc-rdm-api.iiasa.ac.at/collections/search?Bbox=107.79602078970119&Bbox=-9.515884550205376&Bbox=112.30407721029881&Bbox=-5.056290330117706
The following collections intersect with your AOI:

Collection 1: 2023idnvitocampaignpoly110 of type Polygon containing 335 samples

Collection 2: 2022idnvitopoint100 of type Point containing 1291 samples

Collection 3: 2023idnvitomanualpoint100 of type Point containing 1290 samples
https://ewoc-rdm-api.iiasa.ac.at/collections/2022idnvitopoint100/items?Bbox=107.79602078970119&Bbox=-9.515884550205376&Bbox=112.30407721029881&Bbox=-5.056290330117706&MaxResultCount=1000
Got a total of 372 reference points


Now we download for all collections of interest all samples closer than 250 km from our area of interest and save that to a .parquet file:

In [None]:
# Enter here for which collections you would like to extract samples:
collections_to_extract = ["2022idnvitopoint100"]

# The following function requests all samples from the RDM
# TODO: make sure we can request more than 1000 samples --> add pagination (question was raised to Santosh)
samples = rdm_features_request(
    poly, col_ids=collections_to_extract, headers=token, max_items=1000
)

# TODO: the following action should actually be done automatically in the RDM? --> whenever a user uploads a private dataset, all samples should be marked as "to be extracted"?
# We mark all samples as to be extracted
samples["extract"] = [1] * len(samples)

# TODO: make output path customizable
features_file = "/vitodata/worldcereal/test/demo_idn/features.parquet"
Path(features_file).parent.mkdir(parents=True, exist_ok=True)
samples.to_parquet(features_file)

Now we start point extractions for all these acquired samples through OpenEO:

In [5]:
# TODO: import point extractions from scripts > extractions > point_extractions.py
# (we need to make sure we have a similar function in there which can be imported here)
# OR move this functionality to src
# and remove the point_extractions functionality from utils

# TODO: issue warning whenever user is about to launch A LOT OF extractions!

from utils import point_extractions

# TODO: make output path customizable
output_path = Path("/vitodata/worldcereal/test/demo_idn/extractions")
point_extractions(features_file, output_path)

TypeError: point_extractions() missing 1 required positional argument: 'output_path'

Point extractions have been splitted automatically in multiple jobs. Here, we fetch all extractions and merge them into a single dataframe.

In [7]:
import pandas as pd
from utils import fetch_point_extractions

dfs = fetch_point_extractions(output_path)

# TODO: first, each individual pandas dataframe needs to be processed into a format that can be used for training
# i.e. one row should represent a single sample
# because if you would merge these dataframes together before processing them, the "feature_index" will be wrong as it is set for
# each individual dataframe

processed_dfs = [process_df(df) for df in dfs]
private_df = pd.concat(processed_dfs, axis=0)

Index(['date', 'feature_index', 'S2-L2A-B02', 'S2-L2A-B03', 'S2-L2A-B04',
       'S2-L2A-B05', 'S2-L2A-B06', 'S2-L2A-B07', 'S2-L2A-B08', 'S2-L2A-B11',
       'S2-L2A-B12', 'S1-SIGMA0-VH', 'S1-SIGMA0-VV', 'COP-DEM',
       'AGERA5-PRECIP', 'AGERA5-TMEAN', 'geometry', 'irrigation_status',
       'extract', 'sample_id', 'quality_score_lc', 'tile', 'valid_time',
       'quality_score_ct', 'ewoc_code', 'h3_l3_cell'],
      dtype='object')

Now we merge the public data with our private data...

In [None]:
merged_df = pd.concat([public_df, private_df], axis=0, ignore_index=True)
print(f"Total number of samples: {len(merged_df)}")

# 4. Extract required model inputs

Here we prepare presto features for each sample by using a model pretrained on WorldCereal data.

In [None]:
from utils import get_inputs_outputs

encodings, targets = get_inputs_outputs(public_df, task_type="cropland")

# 5. Train custom classification model
We train a catboost model and upload this model to artifactory.

In [None]:
from utils import train_classifier

custom_model, report = train_classifier(encodings, targets)

In [None]:
# Print the classification report
print(report)

# 6. Deploy custom model

Once trained, we have to upload our model to the cloud so it can be used for inference.

In [None]:
from utils import deploy_model

model_url = deploy_model(custom_model, pattern="demo_cropland")

# 7. Generate a map

Using our custom model, we generate a map for our region of interest...

In [None]:
from worldcereal.job import WorldCerealProduct, generate_map, CropLandParameters
from openeo_gfmap import TemporalContext

# Set temporal range to generate product
temporal_extent = TemporalContext(
    start_date="2021-11-01",
    end_date="2022-10-31",
)

# Initializes default parameters
parameters = CropLandParameters()

# Change the URL to the classification model
parameters.classifier_parameters.classifier_url = model_url

# Launch the job
job_results = generate_map(
    spatial_extent,
    temporal_extent,
    output_path="./cropland_map.tif",
    product_type=WorldCerealProduct.CROPLAND,
    croptype_parameters=parameters,
    out_format="GTiff",
)