![](./resources/Custom_croptype_map.png)

### Content

- [Introduction](###-Introduction)
- [How to run this notebook?](###-How-to-run-this-notebook?)
- [Before you start](###-Before-you-start)
- [1. Gather and prepare your training data](###-1.-Gather-and-prepare-your-training-data)
- [2. Prepare training features](###-2.-Prepare-training-features)
- [3. Train custom classification model](###-3.-Train-custom-classification-model)
- [4. Deploy your custom model](###-4.-Deploy-your-custom-model)
- [5. Generate your custom crop type map](###-5.-Generate-your-custom-crop-type-map)

### Introduction

This notebook guides you through the process of training a custom crop type classification model in Tanzania based on publicly available training data sourced from the WorldCereal Reference Data Module.

After model training, we deploy your custom model to the cloud, from where it can be accessed by OpenEO, allowing you to generate a crop type map.

Note that if you would like to repeat this exercise for another area and/or season, we refer to the [worldcereal_custom_croptype.ipynb](https://github.com/WorldCereal/worldcereal-classification/blob/main/notebooks/worldcereal_custom_croptype.ipynb) notebook.

### How to run this notebook?

#### Option 1: Run on Terrascope

You can use a preconfigured environment on [**Terrascope**](https://terrascope.be/en) to run the workflows in a Jupyter notebook environment. Just register as a new user on Terrascope or use one of the supported EGI eduGAIN login methods to get started.

Once you have a Terrascope account, you can run this notebook by clicking the button shown below.

<div class="alert alert-block alert-warning">When you click the button, you will be prompted with "Server Options".<br>
Make sure to select the "Worldcereal" image here. Did you choose "Terrascope" by accident?<br>
Then go to File > Hub Control Panel > Stop my server, and click the link below once again.</div>


<a href="https://notebooks.terrascope.be/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FWorldCereal%2Fworldcereal-classification&urlpath=lab%2Ftree%2Fworldcereal-classification%2Fnotebooks%2FLPS_tutorial_custom_croptype.ipynb&branch=main"><img src="https://img.shields.io/badge/Generate%20custom%20crop%20type%20map-Terrascope-brightgreen" alt="Generate custom crop type map" valign="middle"></a>


#### Option 2: Install Locally

If you prefer to install the package locally, you can create the WorldCereal environment using **Conda** or **pip**.

First clone the repository:
```bash
git clone https://github.com/WorldCereal/worldcereal-classification.git
cd worldcereal-classification
```
Next, install the package locally:
- for Conda: `conda env create -f environment.yml`
- for Pip: `pip install .[train,notebooks]`


#### Option 3: Run on CDSE notebooks or Google colab

Working in another Jupyter cloud environment?

You can install the worldcereal package and its dependencies using:
```bash
pip install git+https://github.com/WorldCereal/presto-worldcereal.git@croptype
pip install --no-deps --quiet "git+https://github.com/worldcereal/worldcereal-classification.git@main"
pip install --upgrade --quiet "worldcereal[train,notebooks]"
```

Then upload the notebook and all files in the `notebook_utils` subdirectory and you should be good to go...

### Before you start

In order to run WorldCereal crop mapping jobs from this notebook, you need to create an account on the [Copernicus Data Space Ecosystem](https://dataspace.copernicus.eu/).<br>
This is free of charge and will grant you a number of free openEO processing credits to continue this demo.

### 1. Gather and prepare your training data

Public reference data, along with pre-processed time series of all required inputs for model training, are available in a dedicated S3 bucket.<br>
Here we query this bucket to retrieve all data surrounding our area of interest.

**Step 1: Define your area of interest (AOI)**

For the purpose of this demo, we use a preconfigured bounding box, located in the Manyara province in Tanzania.

In [None]:
from openeo_gfmap import BoundingBoxExtent
from shapely.geometry import box

processing_extent = BoundingBoxExtent(west=36.44183147892941, south=-5.575316737398433, east=36.60335694430027, north=-5.499907154141718, epsg=4326)
processing_extent_utm = BoundingBoxExtent(west=216548.57047209592, south=9383125.482200256, east=234490.35950809612, north=9391543.763833705, epsg=32737)
polygon = box(
    processing_extent.west,
    processing_extent.south,
    processing_extent.east,
    processing_extent.north)

**Step 2: Get all available reference data**

We apply a spatial buffer of 600 km around our area of interest to ensure enough training data is found.
You can freely expand this search perimeter by changing the value of the `buffer` parameter.

In the background, we explicitly filter on temporary crops.<br>
Note that this implies mapping of permanent crops is currently not supported.

In [None]:
from pathlib import Path
from notebook_utils.extractions import query_extractions

# Specify a buffer distance to expand your search perimeter
buffer = 600000  # meters

# Query our public database of training data
extractions = query_extractions(polygon, buffer)
extractions.head()

**Step 3: Perform a quick quality check**

In this optional step, we provide you with some tools to quickly assess the quality of the datasets.

Upon executing this cell, you will be prompted to enter a dataset name (ref_id) for inspection.

Especially the visualization of the time series might help you better define your season of interest.

In [None]:
from notebook_utils.extractions import get_band_statistics, visualize_timeseries

dataset_name = input('Enter the dataset name: ')
subset_data = extractions.loc[extractions['ref_id'] == dataset_name]

# Check band statistics
band_stats = get_band_statistics(subset_data)

# Visualize timeseries for a few samples
visualize_timeseries(subset_data)

**Step 4: Select your season of interest**

To gain a better understanding of crop seasonality in your area of interest, you can consult the WorldCereal crop calendars (by executing the next cell), or check out the [USDA crop calendars](https://ipad.fas.usda.gov/ogamaps/cropcalendar.aspx).

In [None]:
from notebook_utils.seasons import retrieve_worldcereal_seasons

seasons = retrieve_worldcereal_seasons(processing_extent_utm)

Let's say we want to produce our map for 2021, more specifically for the season running from October 2020 to April 2021.<br>
WorldCereal models always need a full year of data, so we ensure the peak of season (February 2021) is nicely centered in the selected period of time: 

In [None]:
from openeo_gfmap import TemporalContext

start_date = '2020-08-01'
end_date = '2021-07-31'

processing_period = TemporalContext(start_date, end_date)

**Step 5: Filter your training data based on timing of observation**

In this step, our training data is first converted into a format which can be used by our training feature computation and model training routines.

Then, we filter out any sample for which the observation date (attribute `valid_time`) does not match the selected season of interest.

In [None]:
from worldcereal.utils.refdata import process_extractions_df

# Process the merged data
training_df = process_extractions_df(extractions, processing_period)

# Report on the contents of the data
print(f'Samples originating from {training_df["ref_id"].nunique()} unique reference datasets.')
print('Distribution of samples across years:')
print(training_df.year.value_counts())
ncroptypes = training_df['ewoc_code'].nunique()
print(f'Number of crop types remaining: {ncroptypes}')
if ncroptypes <= 1:
    raise ValueError("Not enough crop types found in the remaining data to train a model, cannot continue.")
training_df.head()

**Step 6: Select your crops of interest**

The following widget will display all available crop types in your training dataframe.

Tick the checkbox for each crop type you wish to explicitly include in your model.<br>
In case you wish to group multiple crops together, just tick the parent node in the hierarchy.

Not selected crops will be merged together in an `other_temporary_crops` class.

After selecting all your crop types of interest, hit the "Apply" button.

<div class="alert alert-block alert-info">
<b>Minimum number of samples:</b><br>
In order to train a model, we recommend a minimum of 30-50 samples to be available for each unique crop type.<br>
Think carefully about which crops to include!
</div>


In [None]:
from notebook_utils.croptypepicker import CropTypePicker

croptypepicker = CropTypePicker(sample_df=training_df, count_threshold=0, expand=True)

In the next cell, we apply your selection to your training dataframe.<br>
The new dataframe will contain a `downstream_class` attribute, denoting the final label.<br>
Let's first check which classes ended up in the "other_temporary_crops" class:

In [None]:
from notebook_utils.croptypepicker import apply_croptypepicker_to_df
from worldcereal.utils.legend import translate_ewoc_codes

training_df = apply_croptypepicker_to_df(training_df, croptypepicker)
other_classes = list(training_df.loc[training_df['downstream_class'] == 'other_temporary_crops']['ewoc_code'].unique())
other_crops = translate_ewoc_codes(other_classes)
other_crops

Based on this list, you might consider dropping some classes.<br>
This can be done by providing the "ewoc_codes" in the following cell:

In [None]:
# drop classes
to_drop = [1111010100]   # dropping sugarcane
if len(to_drop) > 0:
    training_df = training_df.loc[~training_df['ewoc_code'].isin(to_drop)]
training_df['downstream_class'].value_counts()

Finally, you could opt to combine some classes using the code snippet below as an example:

In [None]:
combine_classes = {
    'vegetables_root_crops': ['vegetables_fruits', 'root_tuber_crops'],}
for new_class, old_classes in combine_classes.items():
    training_df.loc[training_df['downstream_class'].isin(old_classes), 'downstream_class'] = new_class

# Report on the contents of the data
training_df['downstream_class'].value_counts()

**Step 7: Save your final training dataframe for future reference**

Upon executing the next cell, you will be prompted to provide a unique name for your dataframe.

In [None]:
from pathlib import Path
from notebook_utils.classifier import get_input

df_name = get_input("name dataframe")

training_dir = Path('./training_data')
training_dir.mkdir(exist_ok=True)

outfile = training_dir / f'{df_name}.csv'

if outfile.exists():
    raise ValueError(f"File {outfile} already exists. Please delete it or choose a different name.")

training_df.to_csv(outfile)

print(f"Dataframe saved to {outfile}")
training_df.head()

### 2. Prepare training features

Using a deep learning framework (Presto), we derive classification features for each sample in the dataframe resulting from your query. Presto was pre-trained on millions of unlabeled samples around the world and finetuned on global labelled land cover and crop type data from the WorldCereal reference database. The resulting *embeddings* (`presto_ft_0` -> `presto_ft_127`) and the target labels (`downstream_class`) to train on will be returned as a training dataframe which we will use for downstream model training.

In [None]:
from notebook_utils.classifier import prepare_training_dataframe

training_dataframe = prepare_training_dataframe(training_df)
training_dataframe.head()

### 3. Train custom classification model

We train a catboost model for the selected crop types.<br> 

By default, we apply **class balancing** to ensure minority classes are not discarded. However, depending on the class distribution this may lead to undesired results. There is no golden rule here. If your main goal is to make sure the most dominant classes in your training data are very precisely identified in your map, you can opt to NOT apply class balancing by setting: `balance_classes=False`. 

Before training, the available training data has been automatically split into a calibration and validation part. The validation report and (optionally) confusion matrix already provides you with a first idea on your model's performance.

In [None]:
from notebook_utils.classifier import train_classifier

custom_model, report, confusion_matrix = train_classifier(
    training_dataframe, balance_classes=True, show_confusion_matrix=True,
)
print(report)

### 4. Deploy your custom model

Once trained, we have to upload our model to the cloud so it can be used by OpenEO for inference.

Upon executing the next cell, you will be prompted to provide a clear and short name for your custom model.

Note that these models are only kept in cloud storage for a limited amount of time. Make sure to download your model (using the link provided) if you wish to store it for a longer period of time!

In [None]:
from worldcereal.utils.upload import deploy_model
from openeo_gfmap.backend import cdse_connection
from notebook_utils.classifier import get_input

modelname = get_input("model")
model_url = deploy_model(cdse_connection(), custom_model, pattern=modelname)
print(f"Your model can be downloaded from: {model_url}")

### 5. Generate your custom crop type map

Using our custom model, we generate a map for our region and season of interest.

The next cell takes care of splitting your area of interest into small tiles (size is specified through `tile_resolution` parameter) and generate a map for each tile.<br>

You will be able to track progress through the automated reporting.<br>

Results will be automatically saved to a folder containing your model name:<br> `runs/CROPTYPE_custom_{your_modelname}_{timestamp}`<br>

The first time you run this, you will be asked to authenticate with your CDSE account by clicking the link provided below the cell.<br>

<div class="alert alert-block alert-warning">
<b>What to do in case of interruption?</b><br> 
In case processing got interrupted, just make sure to manually set `output_dir` to the directory you previously used. In this case, processing will just continue where it stopped.
</div>

In [None]:
import pandas as pd
from pathlib import Path
from worldcereal.job import PostprocessParameters
from worldcereal.job import WorldCerealProductType, CropTypeParameters
from notebook_utils.production import start_production_process, monitor_production_process

# The output directory is named after the model
timestamp = pd.Timestamp.now().strftime("%Y%m%d-%H%M%S")
output_dir = Path('./runs') / f'CROPTYPE_custom_{modelname}_{timestamp}'
print(f"Output directory: {output_dir}")

#-----------------------------------------------------------------------
### OPTIONAL PARAMETERS
#-----------------------------------------------------------------------
# Choose whether you want to store the cropland mask as separate output file
save_mask = True

# Choose whether or not you want to spatially clean the classification results
postprocess_result = True

# Choose the postprocessing method you want to use ["smooth_probabilities", "majority_vote"]
# ("smooth_probabilities will do limited spatial cleaning,
# while "majority_vote" will do more aggressive spatial cleaning, depending on the value of kernel_size)
postprocess_method = "majority_vote"

# Additional parameter for the majority vote method
# (the higher the value, the more aggressive the spatial cleaning,
# should be an odd number, not larger than 25, default = 5)
kernel_size = 5

# Do you want to save the intermediate results? (before applying the postprocessing)
save_intermediate = True

# Do you want to save all class probabilities in the final product? (default is False)
keep_class_probs = True

postprocess_parameters = PostprocessParameters(
    enable=postprocess_result,
    method=postprocess_method,
    kernel_size=kernel_size,
    save_intermediate=save_intermediate,
    keep_class_probs=keep_class_probs,
)
#-----------------------------------------------------------------------

# Initializes default parameters
parameters = CropTypeParameters()

# Change the URL to your custom classification model
parameters.classifier_parameters.classifier_url = model_url
parameters.save_mask = save_mask

# Define tile resolution in km
tile_resolution = 20

job_options={"image-name":"registry.prod.warsaw.openeo.dataspace.copernicus.eu/prod/openeo-geotrellis-kube-python311:20250619-34"}

args = (processing_extent, processing_period, output_dir)
kwargs = dict(
    tile_resolution=tile_resolution,
    product_type=WorldCerealProductType.CROPTYPE,
    croptype_parameters=parameters,
    postprocess_parameters=postprocess_parameters,
    job_options=job_options,
)

proc, queue, stop_event = start_production_process(args, kwargs)
status_df = monitor_production_process(proc, queue, stop_event)

Once production across your tiles is finalized, you can use the cell below to merge the different tiles together into one map.<br>

By default, four products are generated:
- `cropland-raw` --> cropland mask produced using the global WorldCereal cropland model
- `cropland` --> cropland mask, after post-processing
- `croptype-raw` --> your custom crop type product
- `croptype` --> your custom crop type product after post-processing

For each of these products, you will get a raster file containing at least two bands:
1. The label of the winning class
2. The probability of the winning class [50 - 100]
3. and beyond (optional, depending on settings): Class probabilities of each class

In [None]:
from notebook_utils.production import merge_maps

merged_path = merge_maps(output_dir, product='croptype')
print(f"Results merged to {merged_path}")

Finally, use the next cell to quickly visualize your crop type product in this notebook.

In case you want to compare all products, we recommend you to use QGIS.

In [None]:
from notebook_utils.visualization import visualize_product
from worldcereal.utils.models import load_model_lut

lut = load_model_lut(model_url)
visualize_product(merged_path, product='croptype', lut=lut, write=True)


Congratulations, you have reached the end of this demo!