# UN Handbook on Remote Sensing for Agricultural Statistics
## Chapter 6: WorldCereal – A Global Effort for Crop Mapping
<hr style="border:0; border-top:1px solid #ccc; margin:12px 0 20px;">

<div style="margin-left:120px; display:flex; align-items:center;">
  <img src="../resources/worldcereal_logo.jpg" width="280" alt="WorldCereal" style="vertical-align:middle; margin-right:60px;">
  <img src="../resources/ESA_logo.jpg" height="120" alt="ESA" style="vertical-align:middle;">
</div>
<br>
<br>
<div style="display:flex; justify-content:left; margin-bottom:30px;">
  <div style="font-size:17px; line-height:1.3; text-align:left;">
    <strong>Authors:</strong><br>
    Jeroen Degerickx, Christina Butsko, Kristof Van Tricht<br>
    VITO Remote Sensing, Boeretang 200, 2400 Mol, BELGIUM<br>
    <br>
    <img src="../resources/Vito_RemoteSensing.png" height="60" alt="VITO Remote Sensing" style="margin-left:50px;">
    <hr style="border:0; border-top:1px solid #ddd; margin:12px 0;">
    
  </div>
</div>


# Data preparation notebook

The [WorldCereal crop mapping demonstration notebook](https://github.com/WorldCereal/worldcereal-classification/tree/main/notebooks/UN_handbook/WorldCereal_crop_mapping_demo.ipynb) fully relies on open and free data. This notebook demonstrates how this data has been acquired.

### Content

- [How to run this notebook?](###-How-to-run-this-notebook?)
- [Before you start](###-Before-you-start)
- [1. Gather the training data](###-1.-Gather-the-training-data)
- [2. Gather patches of pre-processed inputs](###-2.-Gather-patches-of-pre-processed-inputs)

### How to run this notebook?

#### Option 1: Run on Terrascope

You can use a preconfigured environment on [**Terrascope**](https://terrascope.be/en) to run the workflows in a Jupyter notebook environment.
Just register as a new user on Terrascope or use one of the supported EGI eduGAIN login methods to get started.

Once you have a Terrascope account, you can run this notebook by clicking the button shown below.

<div class="alert alert-block alert-warning">When you click the button, you will be prompted with "Server Options".<br>
Make sure to select the "Worldcereal" image here. Did you choose "Terrascope" by accident?<br>
Then go to File > Hub Control Panel > Stop my server, and click the link below once again.</div>


<a href="https://notebooks.terrascope.be/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FWorldCereal%2Fworldcereal-classification&urlpath=lab%2Ftree%2Fworldcereal-classification%2Fnotebooks%2UN_handbook%20_data_preparation.ipynb&branch=main"><img src="https://img.shields.io/badge/Run%20on-Terrascope-brightgreen" alt="Run on Terrascope" valign="middle"></a>


<div class="alert alert-block alert-warning">
<b>WARNING:</b> <br>
Every time you click the above link, the latest version of the notebook will be fetched, potentially leading to conflicts with changes you have made yourself.<br>
To avoid such code conflicts, we recommend you to make a copy of the notebook and make changes only in your copied version.
</div>


#### Option 2: Install Locally

If you prefer to install the package locally, you can create the WorldCereal environment using **Conda** or **pip**.

First clone the repository:
```bash
git clone https://github.com/WorldCereal/worldcereal-classification.git
cd worldcereal-classification
```
Next, install the package locally:
- for Conda: `conda env create -f environment.yml`
- for Pip: `pip install .[train,notebooks]`

### Before you start

In order to run this notebook, you need to create an account on the [Copernicus Data Space Ecosystem](https://dataspace.copernicus.eu/).<br>
This is free of charge and will grant you a number of free openEO processing credits to continue this demo.
<br>
<br>
Make sure all utilities can be accessed:

In [None]:
# add parent dirctory to sys.path
import sys
sys.path.append('..')

### 1. Gather the training data

The [WorldCereal Reference Data Module](https://rdm.esa-worldcereal.org/) hosts millions of public, harmonized land cover and crop type observations scattered around the globe that can be used for training customized crop models.<br>

These observations, together with their associated input time series (Sentinel-2, Sentinel-1, meteo, DEM) are stored on a public S3 bucket.

Here, we extract a subset of samples for:
- several datasets in France, spanning multiple years
- one dataset in Latvia

In [None]:
from pathlib import Path
import pandas as pd
from notebook_utils.extractions import sample_extractions

TRAIN_DATASETS = [
    '2018_FRA_LPIS_POLY_110',
    '2019_FRA_LPIS_POLY_110',
    '2020_FRA_LPIS_POLY_110',
    '2022_FRA_LPIS_POLY_110'
]

TRAIN2_DATASETS = ['2021_LVA_LPIS_POLY_110']

TEST_DATASETS = [
    '2021_FRA_LPIS_POLY_110',
    '2021_LVA_LPIS_POLY_110',
]

CROP_TYPES = ['wheat', 'spring_wheat', 'barley', 'spring_barley', 'rye', 'spring_rye', 'maize', 'vegetables_fruits', 'dry_pulses_legumes', 'sunflower', 'rapeseed_rape', 'potatoes', 'beet', 'grass_fodder_crops', 'fibre_crops']
CEREALS = ['wheat', 'spring_wheat', 'barley', 'spring_barley', 'rye', 'spring_rye']

datadir = Path('./data')
datadir.mkdir(parents=True, exist_ok=True)

class_mappings_csv = Path('./resources/crop_type_class_mappings.csv')
class_mappings = pd.read_csv(class_mappings_csv, sep=";", header=0)
classes = class_mappings['finetune_class'].dropna().unique().tolist()

# Sample training data from France for cereals
extractions_train_pt1 = sample_extractions(
    ref_ids=TRAIN_DATASETS,
    crop_types=CEREALS,
    sample_size=300,
    class_mappings_csv=class_mappings_csv
)
# Sample training data from France for non-cereals
extractions_train_pt2 = sample_extractions(
    ref_ids=TRAIN_DATASETS,
    crop_types=[c for c in CROP_TYPES if c not in CEREALS],
    sample_size=700,
    class_mappings_csv=class_mappings_csv
)
# Combine the two
extractions_train = pd.concat([extractions_train_pt1, extractions_train_pt2], ignore_index=True)

# Save the extractions to a parquet file
extractions_train.to_parquet(datadir / "extractions" / "extractions_train.parquet", index=False)

# Sample additional training data from Latvia
extractions_train_lva = sample_extractions(
    ref_ids=[d for d in TEST_DATASETS if 'LVA' in d],
    crop_types=CROP_TYPES,
    sample_size=100,
    class_mappings_csv=class_mappings_csv,
    random_state=42
)
# Save the extractions to a parquet file
extractions_train_lva.to_parquet(datadir / "extractions" /"extractions_train_lva.parquet", index=False)

# Sample test extractions
extractions_test = sample_extractions(
    ref_ids=TEST_DATASETS,
    crop_types=CROP_TYPES,
    sample_size=100,
    class_mappings_csv=class_mappings_csv,
    random_state=12
)
# Save the extractions to a parquet file
extractions_test.to_parquet(datadir / "extractions" /"extractions_test.parquet", index=False)

### 2. Gather patches of pre-processed inputs

Once a crop type model has been trained, we want to quickly apply the model to small patches in both France and Latvia to get a sense of how the resulting map would look like.

Here, we extract the required pre-processed satellite input data for two patches, by accessing the required data on the [Copernicus Data Space Ecosystem](https://dataspace.copernicus.eu/) via OpenEO.

In [None]:
from pathlib import Path
from notebook_utils.preprocessed_inputs import collect_worldcereal_inputs_patches

# Get the spatial extent
# site = 'france'
site = 'latvia'
patches_file = Path(f'./resources/test_site_{site}.gpkg')

# Define start and end dates
start_date = '2021-01-01'
end_date = '2021-12-31'

# Define output directory
outdir = datadir / 'preprocessed_inputs'/ site

# Run data collection
collect_worldcereal_inputs_patches(patches_file, outdir, 
                                  start_date,
                                  end_date)

All done!