![](./resources/System_v2_custom_partI.png)

### Content

- [Introduction](###-Introduction)
- [How to run this notebook?](###-How-to-run-this-notebook?)
- [Before you start](###-Before-you-start)
- [1. Get your private reference data](###-1.-Get-your-private-reference-data)
- [2. Prepare your reference data](###-2.-Prepare-your-reference-data)
- [3. EO data extractions](###-3.-EO-data-extractions)
- [4. Inspect results](###-4.-Inspect-results)

### Introduction

The following demo illustrates how to train and deploy your custom crop type model, based on a combination of your own private reference data and publicly available reference data, using the WorldCereal system.<br>

The demo has been split into two parts:

- **Part 1** (this part) deals with preparing your private data so it can be used for training a crop type model. This involves uploading your data to the [WorldCereal Reference Data Module](https://rdm.esa-worldcereal.org/) and extracting the relevant EO data time series for all your samples to be used for model training. Extractions are done through OpenEO, from the [Copernicus Data Space Ecosystem](https://dataspace.copernicus.eu/) cloud backend. After extracting the EO data, we perform a quality check on the extracted data.

- **Part 2** (see separate notebook) proceeds with selecting the right training data for your application, trains the crop type model, deploys it in the cloud and produces a map based on your custom model.

### How to run this notebook?

#### Option 1: Run on Terrascope

You can use a preconfigured environment on [**Terrascope**](https://terrascope.be/en) to run the workflows in a Jupyter notebook environment. Just register as a new user on Terrascope or use one of the supported EGI eduGAIN login methods to get started.

Once you have a Terrascope account, you can run this notebook by clicking the button shown below.

<div class="alert alert-block alert-warning">When you click the button, you will be prompted with "Server Options". Make sure to select the "Worldcereal" image here. Did you choose "Terrascope" by accident? Then go to File > Hub Control Panel > Stop my server, and click the link below once again.</div>

<a href="https://notebooks.terrascope.be/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FWorldCereal%2Fworldcereal-classification&urlpath=lab%2Ftree%2Fworldcereal-classification%2Fnotebooks%2Fworldcereal_v2_demo_custom_model_training_PART1.ipynb&branch=main"><img src="https://img.shields.io/badge/Run%20System%20v2%20demo%20Part1-Terrascope-brightgreen" alt="Run System V2 demo Part 1" valign="middle"></a>


#### Option 2: Install Locally

If you prefer to install the package locally, you can create the WorldCereal environment using **Conda** or **pip**.

First clone the repository:
```bash
git clone https://github.com/WorldCereal/worldcereal-classification.git
cd worldcereal-classification
```
Next, install the package locally:
- for Conda: `conda env create -f environment.yml`
- for Pip: `pip install .[train,notebooks]`

### Before you start

To be able to use all functionality in this notebook, you will need to register for:
- a free [Terrascope](https://terrascope.be/en) account
- a free [Copernicus Data Space Ecosystem (CDSE)](https://dataspace.copernicus.eu/) account

In addition, you are required to **upload your private reference dataset to the WorldCeral Reference Data Module**, using the highly automated upload workflow in our [user interface](https://rdm.esa-worldcereal.org/). Please consult the following resources to find out how to prepare and upload your dataset:
- [Demo video on how to upload your dataset](https://www.youtube.com/watch?v=458soD-Gsv8)
- [WorldCereal documentation portal](https://worldcereal.github.io/worldcereal-documentation/rdm/overview.html)
- [Free online course on reference data in WorldCereal](https://esa-worldcereal.org/en/resources/free-massive-open-online-courses-mooc)

### 1. Get your private reference data

Here we query the [WorldCereal Reference Data Module (RDM)](https://rdm.esa-worldcereal.org/) through the dedicated API to retrieve your private reference data.

To learn more about how to interact with the WorldCereal RDM, consult our [dedicated notebook on RDM interaction](https://github.com/WorldCereal/worldcereal-classification/blob/main/notebooks/worldcereal_RDM_demo.ipynb).

In [None]:
# We first initiate an interaction session with the RDM:
from worldcereal.rdm_api import RdmInteraction
rdm = RdmInteraction()

# Get a list of your private collections
collections = rdm.get_collections(include_public=False, include_private=True)

# Extract the collection ID's
ids = [col.id for col in collections]
print(f'Number of collections found: {len(ids)}')
print(ids)

# In case you want to look at the metadata of a specific collection, you can use the print_metadata function:
collections[0].print_metadata()

Based on the list of available collections and their metadata, select your collection of interest. <br>

In the next cell, we download the samples contained within your collection. Upon upload to RDM, your dataset has automatically been subsampled taking into account geographical distribution and crop type labels of the observations.<br>

You have the option to:
- download ALL observations (use `subset = False`)
- download only the subsample of your dataset (use `subset = True`)

When executing the next cell, you will be prompted to enter the collection ID of your collection of interest. Afterwards, your dataset is downloaded to a `download` folder, located in the folder of this notebook.

In [None]:
collection_id = input('Please enter the desired collection ID: ')

dwnld_folder = './download'

subset = True
parquet_file = rdm.download_collection_geoparquet(collection_id, dwnld_folder, subset=subset)

### 2. Prepare your reference data

Before initiating extractions of EO time series, we run some final preparations/checks on your dataset:

- Ensure all required attributes are included in the data

- In case you start from a polygon dataset, we convert those polygons to points (centroids).<br>
In case the centroid does not intersect with the original polygon, we discard the sample.

- We inform you on the total number of samples contained within your dataset, as well as on the geographical spread of the samples. In case high extraction costs are expected, we issue a warning.

In [None]:
from notebook_utils.extractions import prepare_samples_dataframe

samples_df = prepare_samples_dataframe(parquet_file, collection_id)
samples_df.head()

### 3. EO data extractions

Now that our GeoDataFrame with reference data is ready, we extract for each reference sample the required EO time series from CDSE using OpenEO.

The specific **start and end date** of the time series is automatically set to resp. 9 months prior and 9 months after `valid_time` for each sample.

The following **monthly** time series are extracted for the indicated time range:
- Sentinel-2 L2A data (all bands)
- Sentinel-1 SIGMA0, VH and VV
- Average air temperature and precipitation sum derived from AgERA5
- Slope and elevation from Copernicus DEM

Note that pre-processing of the time series (e.g. cloud masking, temporal compositing) happens on the fly during the extractions.

The following cell splits your dataset into several smaller processing jobs and launches these jobs automatically. <br>
Depending on the size and spatial spread of your dataset, this step might take a while.

Average CDSE credit consumption of one such processing job amounts to 30 credits, but can vary up to 300 credits depending on local data density.

<div class="alert alert-block alert-warning"> 
In case the extraction process might get interrupted during execution, you can just re-run the cell and extractions will resume where they stopped.<br>
If you explicitly want to retry any failed processing jobs, you need to set `restart_failed` parameter to `True`.

Starting another set of extractions in the same output folder is not possible.
</div>




In [None]:
from pathlib import Path
from worldcereal.extract.common import run_extractions
from worldcereal.stac.constants import ExtractionCollection

# Define the output folder for the extractions
extractions_folder = Path('./extractions')
outfolder_col = extractions_folder / collection_id
outfolder_col.mkdir(parents=True, exist_ok=True)

# Save the samples dataframe to a file
samples_df_path = outfolder_col / 'samples_gdf.gpkg'
samples_df.to_file(samples_df_path, driver='GPKG')

run_extractions(
    ExtractionCollection.POINT_WORLDCEREAL,
    outfolder_col,
    samples_df_path,
    extract_value=0,
    restart_failed=False,
    )

### 4. Inspect results

Once the extractions have been completed, we first inspect the job tracking dataframe to find out how many jobs were successfully completed.
We also check the success rate of extractions on a sample basis and inspect the cost of the extractions.


In [None]:
from worldcereal.extract.common import check_job_status, get_succeeded_job_details

status_histogram = check_job_status(outfolder_col)
succeeded_jobs = get_succeeded_job_details(outfolder_col)

print('************')
print('Number of samples successfully extracted:')
print(succeeded_jobs['n_samples'].sum())

print('************')
print('Details of succeeded jobs:')
print(succeeded_jobs)


Finally, we do a more in-depth check of the extracted data by:

- manually inspecting a subset of the extracted data (`load_point_extractions` function)

- printing statistics for each individual band that was extracted (`get_band_statistics` function)

- visualizing time series for randomly selected samples (`visualize_timeseries` function)

In [None]:
from notebook_utils.extractions import load_point_extractions

gdf = load_point_extractions(outfolder_col, subset=True)
gdf.head()

# keep in mind when inspecting the results: nodata value = 65535

In [None]:
from notebook_utils.extractions import get_band_statistics

stats_df = get_band_statistics(outfolder_col)
print('************')
print('Band statistics:')
print(stats_df)

Note that by default, the following cell will visualize the NDVI time series for 5 randomly chosen samples.<br>

You can change this behaviour by:
- specifying the number of samples to visualize (`n_samples` parameter)
- specifying the band to visualize (`band` parameter)
- specifying a list of sample IDs to visualize (`sample_ids` parameter)

Example:
`visualize_timeseries(outfolder_col, band="S1-SIGMA0-VV", n_samples=2)`

In [None]:
from notebook_utils.extractions import visualize_timeseries

visualize_timeseries(outfolder_col)

End of Part 1 of this exercise.
In the next part, we will use the extracted data to train our custom crop type model.