![](./resources/System_v1_training_header.png)

**Table of contents**<a id='toc0_'></a>    
- [Before you start](#toc1_)    
- [Define a region of interest](#toc2_)    
- [Extract public in situ reference data](#toc3_)    
- [Select desired crops for prediction](#toc4_)    
- [Extract required model inputs](#toc5_)    
- [Train custom classification model](#toc6_)    
- [Deploy custom model](#toc7_)    
- [Generate a map](#toc8_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Before you start](#toc0_)

In order to run this notebook, you need to create an account on the Copernicus Data Space Ecosystem (CDSE) by completing [this](https://identity.dataspace.copernicus.eu/auth/realms/CDSE/login-actions/registration?client_id=cdse-public&tab_id=eRKGqDvoYI0).

# <a id='toc2_'></a>[Define a region of interest](#toc0_)

When running the code snippet below, an interactive map will be visualized.
Click the Rectangle button on the left hand side of the map to start drawing your region of interest. Currently, there is a maximum size of 100 km² for your area, shown during drawing of the polygon.

When finished, execute the second cell to store the coordinates of your region of interest. 

In [15]:
from worldcereal.utils.map import get_ui_map

m, dc = get_ui_map()
m

Map(center=[51.1872, 5.1154], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', 'zoo…

In [16]:
# retrieve bounding box from drawn rectangle
from worldcereal.utils.map import get_bbox_from_draw

spatial_extent, bbox, poly = get_bbox_from_draw(dc)

Your area of interest: (4.930115, 50.61636, 5.015259, 50.661649)
Area of processing extent: 31.98 km²


# <a id='toc3_'></a>[Extract public in situ reference data](#toc0_)

Here we query existing reference data that have already been processed by WorldCereal and are ready to use.
We filter for croptype labels by default, intersecting with a buffer (250 km by default) around the bbox.

In [17]:
from utils import query_worldcereal_samples

public_df = query_worldcereal_samples(poly)

Applying a buffer of 250.0 km to the selected area ...
Querying WorldCereal global database ...
Processing selected samples ...
Extracted and processed 39192 samples from global database.


# <a id='toc4_'></a>[Select desired crops for prediction](#toc0_)

Crops with ticked checkboxes will be included in the prediction. All the crops that are not selected will be grouped under the "other_crop" category. The model will be trained in a multi-class setting, not a hierarchical one. Keep this in mind when choosing your crop types.

In [18]:
from utils import pick_croptypes
from IPython.display import display

checkbox, checkbox_widgets = pick_croptypes(public_df, samples_threshold=100)
display(checkbox)

VBox(children=(Checkbox(value=False, description='maize (20276 samples)'), Checkbox(value=False, description='…

Based on your selection, a custom target label is now generated for each sample. Verify that only crops of your choice are appearing in the `custom_class`, all others will fall under `other`.

In [19]:
from utils import get_custom_labels

public_df = get_custom_labels(public_df, checkbox_widgets)
public_df["custom_class"].value_counts()

custom_class
maize                 20276
unspecified_wheat      5664
potatoes               4830
other                  4813
beet                   1908
unspecified_barley     1701
Name: count, dtype: int64

# <a id='toc5_'></a>[Extract required model inputs](#toc0_)

Here we prepare presto inputs features for each sample by using a model pretrained on WorldCereal data. The resulting `encodings` and `targets` will be used for model training.

In [20]:
from utils import get_inputs_outputs

encodings, targets = get_inputs_outputs(public_df)

Computing Presto embeddings ...
Done.


# <a id='toc6_'></a>[Train custom classification model](#toc0_)
We train a catboost model for the selected crop types. Class weights are automatically determined to balance the individual classes.

In [21]:
from utils import train_classifier

custom_model, report = train_classifier(encodings, targets)

Split train/test ...
Computing class weights ...
Class weights: {'beet': 3.424968789013733, 'maize': 0.32215411353014395, 'other': 1.3571781933313545, 'potatoes': 1.352361234348812, 'unspecified_barley': 3.8390708088441086, 'unspecified_wheat': 1.1531736023539303}
Training CatBoost classifier ...
0:	learn: 1.7407270	test: 1.7437118	best: 1.7437118 (0)	total: 101ms	remaining: 13m 24s
25:	learn: 1.1963429	test: 1.2453393	best: 1.2453393 (25)	total: 2.57s	remaining: 13m 7s
50:	learn: 1.0100359	test: 1.0921785	best: 1.0921785 (50)	total: 5.03s	remaining: 13m 3s
75:	learn: 0.9027329	test: 1.0121642	best: 1.0121642 (75)	total: 7.59s	remaining: 13m 11s
100:	learn: 0.8274552	test: 0.9621098	best: 0.9621098 (100)	total: 10s	remaining: 13m 4s
125:	learn: 0.7687655	test: 0.9278064	best: 0.9278064 (125)	total: 12.3s	remaining: 12m 48s
150:	learn: 0.7207899	test: 0.9012466	best: 0.9012466 (150)	total: 14.7s	remaining: 12m 45s
175:	learn: 0.6801320	test: 0.8809879	best: 0.8809879 (175)	total: 17.2s	

In [22]:
# Print the classification report
print(report)

                    precision    recall  f1-score   support

              beet       0.65      0.69      0.67       573
             maize       0.93      0.89      0.91      6083
             other       0.62      0.68      0.65      1444
          potatoes       0.74      0.75      0.74      1449
unspecified_barley       0.58      0.58      0.58       510
 unspecified_wheat       0.79      0.82      0.80      1699

          accuracy                           0.81     11758
         macro avg       0.72      0.73      0.73     11758
      weighted avg       0.82      0.81      0.82     11758



# <a id='toc7_'></a>[Deploy custom model](#toc0_)

Once trained, we have to upload our model to the cloud so it can be used for inference. Executing the cell below will require you to enter a `token`. A WorldCereal admin has to provide this token.


In [23]:
from utils import deploy_model

model_url = deploy_model(custom_model, pattern="demo_large")

Uploading model to `demo_large_20240709155441_custommodel.onnx`


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 42.3M    0     0  100 42.3M      0  11.5M  0:00:03  0:00:03 --:--:-- 11.5M

Deployed to: https://artifactory.vgt.vito.be/artifactory/worldcereal_models/demo_large_20240709155441_custommodel.onnx


100 42.3M    0   812  100 42.3M    207  10.8M  0:00:03  0:00:03 --:--:-- 10.8M


# <a id='toc8_'></a>[Generate a map](#toc0_)

Using our custom model, we generate a map for our region of interest and download the result.

You can also manually download the resulting GeoTIFF by clicking on the link that will be diplayed.

In [34]:
from worldcereal.job import WorldCerealProduct, generate_map, CropTypeParameters
from openeo_gfmap import TemporalContext

# Set temporal range to generate product
temporal_extent = TemporalContext(
    start_date="2021-11-01",
    end_date="2022-10-31",
)

# Initializes default parameters
parameters = CropTypeParameters()

# Change the URL to the classification model
parameters.classifier_parameters.classifier_url = model_url

# Launch the job
job_results = generate_map(
    spatial_extent,
    temporal_extent,
    output_path="./cropmap.tif",
    product_type=WorldCerealProduct.CROPTYPE,
    croptype_parameters=parameters,
    out_format="GTiff",
    tile_size=128,
)

INFO:openeo.rest.connection:Found OIDC providers: ['CDSE']
INFO:openeo.rest.connection:No OIDC provider given, but only one available: 'CDSE'. Using that one.
INFO:openeo.rest.connection:Using default client_id 'sh-b1c3a958-52d4-40fe-a333-153595d1c71e' from OIDC provider 'CDSE' info.
INFO:openeo.rest.connection:Found refresh token: trying refresh token based authentication.
INFO:openeo.rest.auth.oidc:Doing 'refresh_token' token request 'https://identity.dataspace.copernicus.eu/auth/realms/CDSE/protocol/openid-connect/token' with post data fields ['grant_type', 'client_id', 'refresh_token'] (client_id 'sh-b1c3a958-52d4-40fe-a333-153595d1c71e')
INFO:openeo.rest.connection:Obtained tokens: ['access_token', 'id_token', 'refresh_token']
INFO:openeo.rest.auth.config:Storing refresh token for issuer 'https://identity.dataspace.copernicus.eu/auth/realms/CDSE' (client 'sh-b1c3a958-52d4-40fe-a333-153595d1c71e')


Authenticated using refresh token.
Selected orbit direction: DESCENDING from max accumulated area overlap between bounds and products.




0:00:00 Job 'j-240709923b394dfaa57d04dfa7aacee3': send 'start'
0:00:16 Job 'j-240709923b394dfaa57d04dfa7aacee3': created (progress 0%)
0:00:26 Job 'j-240709923b394dfaa57d04dfa7aacee3': created (progress 0%)
0:00:32 Job 'j-240709923b394dfaa57d04dfa7aacee3': created (progress 0%)
0:00:47 Job 'j-240709923b394dfaa57d04dfa7aacee3': created (progress 0%)
0:00:57 Job 'j-240709923b394dfaa57d04dfa7aacee3': created (progress 0%)
0:01:12 Job 'j-240709923b394dfaa57d04dfa7aacee3': running (progress N/A)
0:01:28 Job 'j-240709923b394dfaa57d04dfa7aacee3': running (progress N/A)
0:01:47 Job 'j-240709923b394dfaa57d04dfa7aacee3': running (progress N/A)
0:02:11 Job 'j-240709923b394dfaa57d04dfa7aacee3': running (progress N/A)
0:02:41 Job 'j-240709923b394dfaa57d04dfa7aacee3': running (progress N/A)
0:03:19 Job 'j-240709923b394dfaa57d04dfa7aacee3': running (progress N/A)
0:04:06 Job 'j-240709923b394dfaa57d04dfa7aacee3': running (progress N/A)
0:05:04 Job 'j-240709923b394dfaa57d04dfa7aacee3': running (progres

INFO:openeo.rest.job:Downloading Job result asset 'openEO_2020-01-01Z.tif' from https://openeo.creo.vito.be/openeo/jobs/j-240709923b394dfaa57d04dfa7aacee3/results/assets/NGZkOWRiOTYtZDYyMC00NDU0LTliZTYtMTRhN2Q4ZTkyMzU3/ba0b5b0b6f0c2e76f004c79495de9329/openEO_2020-01-01Z.tif?expires=1721139745 to cropmap.tif


For interpreting your raster, the following information is useful:
- Band 1 contains the class integers and by executing the cell below you can check which integer belongs to which crop type
- Band 2 contains the probability associated to the prediction

In [35]:
LUT = {class_int: class_name for class_int, class_name in enumerate(custom_model.get_params()['class_names'])}
print('Raster value - Class name')
for key, value in LUT.items():
    print(f"{key} -> {value}")

Raster value - Class name
0 -> beet
1 -> maize
2 -> other
3 -> potatoes
4 -> unspecified_barley
5 -> unspecified_wheat
