![](./resources/MOOC_refdata_RDM_exploration.png)

### Introduction

This notebook demonstrates the different possibilities to explore and download our harmonized reference data, hosted in the [WorldCereal Reference Data Module (RDM)](https://rdm.esa-worldcereal.org/).<br>
Here, we use the dedicated [RDM API](https://ewoc-rdm-api.iiasa.ac.at/swagger/index.html) to interact with the data through Python code.<br>
For quick exploration of the reference data, you can obviously also make use of our [user interface](https://rdm.esa-worldcereal.org/map).

This notebook only covers the processes of inspecting and downloading existing reference data within the WorldCereal RDM.<br>
Data harmonization and upload should be done through the dedicated user interface, which can be accessed by clicking the "Contribute" button, [here](https://rdm.esa-worldcereal.org/).

For more background information on our vision and approach regarding reference data, visit [the reference data section on our project website](https://esa-worldcereal.org/en/reference-data).

For more technical background information, visit [our documentation portal](https://worldcereal.github.io/worldcereal-documentation/rdm/overview.html).

To engage with us on the topic of reference data, reach out on our [user forum](https://forum.esa-worldcereal.org/c/ref-data/6).

### How to run this notebook?

#### Option 1: Run on Terrascope

You can use a preconfigured environment on [**Terrascope**](https://terrascope.be/en) to run the workflows in a Jupyter notebook environment. Just register as a new user on Terrascope or use one of the supported EGI eduGAIN login methods to get started.

Once you have a Terrascope account, you can run this notebook by clicking the button shown below.

<div class="alert alert-block alert-warning">When you click the button, you will be prompted with "Server Options". Make sure to select the "Worldcereal" image here. Did you choose "Terrascope" by accident? Then go to File > Hub Control Panel > Stop my server, and click the link below once again.</div>

<a href="https://notebooks.terrascope.be/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FWorldCereal%2Fworldcereal-classification&urlpath=lab%2Ftree%2Fworldcereal-classification%2Fnotebooks%2Fworldcereal_RDM_demo.ipynb&branch=main"><img src="https://img.shields.io/badge/run%20RDM%20demo-Terrascope-brightgreen" alt="Run RDM demo" valign="middle"></a>


#### Option 2: Install Locally

If you prefer to install the package locally, you can create the WorldCereal environment using **Conda** or **pip**.

First clone the repository:
```bash
git clone https://github.com/WorldCereal/worldcereal-classification.git
cd worldcereal-classification
```
Next, install the package locally:
- for Conda: `conda env create -f environment.yml`
- for Pip: `pip install .[train,notebooks]`

### Content
  
- [Before you start](###-Before-you-start)
- [1. Browse and explore collections](#1.-Browse-and-explore-collections)
- [2. Download individual collections](#2.-Download-individual-collections)
- [3. Filter collections](#3.-Filter-collections)
- [4. Get crop counts across datasets](#4.-Get-crop-counts-across-datasets)
- [5. Get individual samples across multiple collections](#5.-Get-individual-samples-across-multiple-collections)

### Before you start

Reference data in the WorldCereal RDM are organized in collections.<br>
A collection contains observations derived from a single source and for a single year.<br>

Each collection is characterized by a **data privacy level**, controlling who can access the data:<br>
- *Public datasets* can be accessed by anyone and have been explicitly curated by the WorldCereal consortium;
- *Private datasets* can only be accessed by the user who uploaded the data;
- *Restricted datasets* are private datasets which were approved by the uploader to be used for training the global WorldCereal classification models, but which are not shared publicly.

Anyone is able to access, explore and download public datasets, without the need for any registration or user account.

In order to upload and access private datasets, you need to sign up for a [Terrascope account](https://terrascope.be/en).<br>
This is completely free of charge!

### 1. Browse and explore collections

In this section, we demonstrate how to retrieve a list of available collections and how to get a bit more information about an individual collection.<br>
For now we will focus on public collections only.

In [1]:
# We first initiate an interaction session with the RDM:
from worldcereal.rdm_api import RdmInteraction
rdm = RdmInteraction()

# Get a list of available collections
# (by default, the following method only returns public collections)
collections = rdm.get_collections()

# Extract the collection ID's
ids = [col.id for col in collections]
print(f'Number of collections found: {len(ids)}')
ids

Number of collections found: 137


['2021_deu_eurocropsls_poly_110',
 '2018_mli_nhicropharvest_poly_110',
 '2019_mli_nhicropharvest_poly_110',
 '2021_mex_cimmyt1_poly_111',
 '2022_aus_clumcom_poly_110',
 '2020_mex_cimmyt1_poly_111',
 '2023_mex_cimmyt1_poly_111',
 '2017_tza_osfafsis_point_110',
 '2018_af_oneacrefundmel_point_110',
 '2017_bel_lpisflanders_poly_110',
 '2023_npl_cimmyt_poly_110',
 '2018_bel_lpisflanders_poly_110',
 '2021_glo_ewocval_poly_111',
 '2022_glo_ewocval_poly_111',
 '2022_mex_cimmyt1_poly_111',
 '2020_aus_clumcom_poly_110',
 '2022_af_dewatrain1_poly_100',
 '2023idnvitomanualpoint100',
 '2018_irl_lpis_poly_110',
 '2019_aus_clumcom_poly_110',
 '2021_aus_clumcom_poly_110',
 '2019_irl_lpis_poly_110',
 '2019_mex_cimmyt1_poly_111',
 '2022_idn_vitomanualpoints_point_100',
 '2020_irl_lpis_poly_110',
 '2018_glo_glance_point_100',
 '2019_glo_glance_point_100',
 '2019_esp_sigpaccatalunya_poly_111',
 '2022_aut_lpis_poly_110',
 '2018_aus_clumcom_poly_110',
 '2023_aus_clumcom_poly_110',
 '2020_glo_glance_point_10

Collection ID's are constructed according to a fixed naming convention:<br>
(year) _ (country/region) _ (identifier) _ (point or poly) _ (information content)<br>
The latter is represented by a numeric code:
- 100: only land cover information
- 110: land cover and crop type information
- 111: land cover, crop type and irrigation information

Find out more on [this page](https://worldcereal.github.io/worldcereal-documentation/rdm/refdata.html#dataset-naming-convention).

Each individual collection is accompanied by a standard set of metadata.
Let's see what basic information is available:

In [2]:
# We get the first collection
col = collections[0]

# We can now access the collection's metadata
col.print_metadata()


#######################
Collection Metadata:
ID: 2021_deu_eurocropsls_poly_110
Title: EUROCROPS Germany Lower Saxony 2021
Number of samples: 432889
Data type: Polygon
Access type: Public
Observation method: Unknown
Confidence score for land cover: 97
Confidence score for crop type: 97
Confidence score for irrigation label: 0
List of available crop types: [1101010011, 1101020001, 2001020000, 1101010021, 1201050000, 1110000000, 1101060001, 1101060002, 1101030001, 1100000000, 1107000010, 1107000031, 1101050001, 1401000000, 1201000000, 1110000300, 1106000031, 1101020002, 1207000000, 1111010000, 1115000000, 1101040001, 1105010020, 1101010041, 1101040002, 1103020000, 1103020040, 1105000040, 1101030002, 1105010010, 1103040030, 4000000000, 1101010012, 1201000010, 1108020020, 1103000000, 1103040010, 1103060000, 1103080000, 1103090070, 1103080060, 1103080050, 1106000032, 1107000032, 1110000440, 1105000000, 1101110012, 1101000000, 1103120010, 1200000000, 1111020010, 1111020030, 1101010042, 110311

Most of these metadata items speak for themselves. Let's for now have a look at 3 of them:<br>

- The **list of available crop types** mentions which crop types are present within the collection, but does not inform you on the quantity of samples per type. Further down in this notebook you will see how to extract actual sample counts per crop type.<br>

Crop types are indicated using numeric codes, as defined by our [hierarchical land cover/crop type legend](https://artifactory.vgt.vito.be/artifactory/auxdata-public/worldcereal//legend/WorldCereal_LC_CT_legend_latest.pdf).<br>
To ease interpretation of these codes, the following function translates the information into a more human-readable format:

In [3]:
from worldcereal.utils.legend import translate_ewoc_codes

crop_types = translate_ewoc_codes(col.ewoc_codes)
crop_types

Unnamed: 0_level_0,label_full,level_1,level_2,level_3,level_4,level_5,definition
ewoc_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1000000000,cropland_unspecified,cropland_unspecified,,,,,"Unknown cropped land, either annual or perennial"
1101010000,unspecified_wheat,temporary_crops,cereals,wheat,unspecified_wheat,,Triticum genus
1101060000,maize,temporary_crops,cereals,maize,maize,,Zea mays
1101070000,unspecified_sorghum,temporary_crops,cereals,sorghum,unspecified_sorghum,,
1101080000,rice,temporary_crops,cereals,rice,rice,,
1101120000,unspecified_millet,temporary_crops,cereals,millet,unspecified_millet,,
1105000000,dry_pulses_legumes,temporary_crops,dry_pulses_legumes,,,,
1105010010,beans,temporary_crops,dry_pulses_legumes,beans_peas,beans,,
1107000020,sweet_potatoes,temporary_crops,root_tuber_crops,,sweet_potatoes,,


- The **temporal extent** shows the range of observation dates present in the dataset and can give you a rough indication whether this dataset would be useful for you to consider given a specific growing season of interest.<br>

In [4]:
print(f'Start date: {col.temporal_extent[0]}, End date: {col.temporal_extent[1]}')

Start date: 2017-01-01T00:00:00, End date: 2017-08-03T00:00:00


- The **spatial extent** provides you with a bounding box in which all observations are contained and gives you an idea about the location of the samples.

To get a better idea on where the dataset is located, we can visualize the spatial extent on a map:

In [4]:
from worldcereal.rdm_api.rdm_collection import visualize_spatial_extents
visualize_spatial_extents([col])

Map(center=[52.59550409787735, 8.781974489454234], controls=(ZoomControl(options=['position', 'zoom_in_text', …

### 2. Download individual collections

Let's explore which assets are available for each collection.

#### 2.1 The samples (individual observations)

Samples for individual collections can be downloaded as GeoParquet files.
Each sample at minimum holds information on:
- the observed land cover/crop type (attribute: *ewoc_code*)
- the location (point or polygon, captured in the *geometry* attribute)
- the date for which the observation is valid, i.e. the designated crop was present on the field (attribute: *valid_time*)

Large public collections containing many observations have been automatically subsampled into a representative subset, taking into account crop type variability and spatial distribution of the data. As a user, you have the ability to either download this subset, or the full dataset (by specifying the `subset` parameter in the function below).

Let's create a new *download* folder where this notebook is located and download the samples of our example collection:

In [6]:
dwnld_folder = './download'
# the following function will automatically create the download folder in case it does not exist
parquet_file = rdm.download_collection_geoparquet(col.id, dwnld_folder, subset=True)

[32m2025-01-28 13:40:45.511[0m | [1mINFO    [0m | [36mworldcereal.rdm_api.rdm_interaction[0m:[36mdownload_collection_geoparquet[0m:[36m686[0m - [1mSamples for collection 2017_tza_osfafsis_point_110 downloaded to download/sample_2017_TZA_OSF-AFSIS_POINT_110.parquet[0m


The resulting file can be visualized in QGIS or, for a quick view, using the function below:

In [7]:
from worldcereal.utils.map import visualize_rdm_geoparquet

visualize_rdm_geoparquet(parquet_file)

Map(center=[-7.049473005453702, 34.468186726851854], controls=(ZoomControl(options=['position', 'zoom_in_text'…

#### 2.2 Full metadata

We have previously seen a basic set of metadata for our reference data collection.<br>
During the harmonization of public datasets, the WorldCereal moderators are  collecting an extensive set of metadata to fully document both the original dataset and the harmonization steps that have been undertaken.

Most of this metadata is bundled in a metadata Excel file, hosted on the RDM portal. This metadata can either be downloaded as a dictionary, or as an xlsx file:

In [8]:
metadata_file = rdm.download_collection_metadata(col.id, dwnld_folder)

metadata = rdm.get_collection_metadata(col.id)
metadata

[32m2025-01-28 12:37:50.161[0m | [1mINFO    [0m | [36mworldcereal.rdm_api.rdm_interaction[0m:[36mdownload_collection_metadata[0m:[36m634[0m - [1mMetadata for collection 2017_tza_osfafsis_point_110 downloaded to ./download[0m


{'CuratedDataSet:ReferenceCuratedDataSet:NameCuratedDataSet:': '2017_tza_osfafsis_point_110',
 'CuratedDataSet::TitleCuratedDataSet:': 'TanSIS, 2017',
 'OriginalDataSet:Provider:Code:': 'TanSIS',
 'OriginalDataSet:Provider:DescriptionCuratedDataSet:': 'Tanzania Soil Information Service (TanSIS), a project  under the auspices of the Division of Research and Development (DRD) of the Ministry of Agriculture of Tanzania.',
 'OriginalDataSet:Provider:URL:': 'https://osf.io/4ngau',
 'OriginalDataSet:Provider:Contact:': 'Markus Walsh',
 'OriginalDataSet:NameDataSet::': 'TanSIS data set, 2017',
 'OriginalDataSet:DOI::': '10.17605/OSF.IO/4NGAU',
 'OriginalDataSet:License:TypeOfLicense:': 'CC_BY',
 'OriginalDataSet:License:ReferenceToLicense:': 'CC-By Attribution 4.0 International',
 'OriginalDataSet:License:RequiredCitation:': 'Walsh, Markus, Joel Meliyo, Bruce Scott, Barbara Walsh, and Bob Macmillan. 2021. “Tanzania Soil Information Service (TanSIS).” OSF. September 6. doi:10.17605/OSF.IO/4NGA

Explore the contents of the metadata.<br>
Note that in addition to a list of available crop types, the extended metadata version also includes crop statistics (i.e. the amount of samples per crop type):

In [9]:
crop_counts = rdm.get_collection_stats(col.id)
crop_counts

Unnamed: 0_level_0,Count,Label
Code,Unnamed: 1_level_1,Unnamed: 2_level_1
1101060000,91,maize
1000000000,84,cropland_unspecified
1101070000,5,unspecified_sorghum
1105010010,3,beans
1107000020,3,sweet_potatoes
1101120000,2,unspecified_millet
1101010000,2,unspecified_wheat
1101080000,1,rice
1105000000,1,dry_pulses_legumes


#### 2.3 Harmonization information

Next to the metadata, each PUBLIC dataset is also accompanied by a harmonization document (PDF), including information about all steps that have been undertaken during data curation and harmonization by the WorldCereal moderator, including the translation of original land cover/crop type information to the WorldCereal legend and computation of the dataset confidence score.<br>
Below, we show how to get this document for an individual dataset:

In [None]:
# Note: this only works for public datasets!
harmonization_file = rdm.download_collection_harmonization_info(col.id, dwnld_folder)

[32m2025-01-27 17:24:54.280[0m | [1mINFO    [0m | [36mworldcereal.rdm_api.rdm_interaction[0m:[36mdownload_collection_harmonization_info[0m:[36m661[0m - [1mHarmonization PDF for collection 2017_tza_osfafsis_point_110 downloaded to download/2017_TZA_OSF-AFSIS_POINT_110.pdf[0m


### 3. Filter collections

Now that you have a basic understanding of how reference data is organized in collections and how to access information about and data within individual collections, let's now focus on how to find relevant reference data for your use case.

We have implemented various filtering methods allowing you to identify collections that contain the data you are looking for.

At the moment you can filter based on:
- crop type
- spatial extent
- temporal extent
- access type (public vs private)

In the sections below we demonstrate each of these cases.

#### 3.1 Filter based on crop types

To filter collections on crop type, you need to provide a list of numerical crop type codes.<br>
All collections containing at least one of the mentioned crop types will be returned.<br>

You can either manually enter a list of crop types, or use the application below to select crop types of interest:

In [2]:
from notebook_utils.croptypepicker import CropTypePicker, DEMO_CROPS

croptypepicker = CropTypePicker()
croptypepicker.display()

VBox(children=(HTML(value='<h2>Select your crop types of interest:</h2>'), VBox(children=(Checkbox(value=False…

The croptypepicker application returns a dataframe containing the ewoc_codes you selected. In case you chose to aggregate multiple classes into a single class, this information is shown in the "*new_label*" column. 

In [21]:
croptypes = croptypepicker.croptypes
croptypes

Unnamed: 0,new_label,original_label
1107000000,root_tuber_crops,root_tuber_crops
1107000010,root_tuber_crops,potatoes
1107000040,root_tuber_crops,cassava
1105000030,lentils,lentils


In [22]:
# Get a list of public collections containing these crop types
ewoc_codes = list(croptypes.index.values)

collections = rdm.get_collections(ewoc_codes=ewoc_codes)
ids = [col.id for col in collections]
print(f'Number of collections found: {len(ids)}')
ids

Number of collections found: 59


['2019_can_aafccropinventory_point_110',
 '2021_swe_eurocrops_poly_110',
 '2021_svk_eurocrops_poly_110',
 '2021_est_eurocrops_poly_110',
 '2018_nld_lpis_poly_110',
 '2021_lva_lpis_poly_110',
 '2020_fin_lpis_poly_110',
 '2020_nld_lpis_poly_110',
 '2019_aut_lpis_poly_110',
 '2018_fra_lpis_poly_110',
 '2020_aut_lpis_poly_110',
 '2019_fra_lpis_poly_110',
 '2020_bel_lpisflanders_poly_110',
 '2017_aut_lpis_poly_110',
 '2020_can_aafccropinventory_point_110',
 '2018_aut_lpis_poly_110',
 '2017_lbn_faowapor1_poly_111',
 '2017_lbn_faowapor2_poly_111',
 '2017_mdg_jecamcirad_poly_111',
 '2018_can_aafccropinventory_point_110',
 '2018_eth_faowapor1_poly_111',
 '2018_eu_lucas_point_110',
 '2018_mdg_jecamcirad_poly_111',
 '2018_tza_osfafsis_point_110',
 '2019_egy_faowapor2_poly_111',
 '2019_ken_radiantearth01_poly_111',
 '2019_mdg_jecamcirad_poly_111',
 '2019_tza_osfafsis_point_110',
 '2019_usausda2019cdls_point_110',
 '2020_rwa_faowapor1_point_111',
 '2020_rwa_faowapor2_point_111',
 '2020_rwa_faowapor

#### 3.2 Filter based on location

To retrieve a list of collections intersecting a certain area of interest, you need to provide a spatial bounding box to the `get_collections` function.<br>
Either enter a bounding box manually, or use the below application to draw a bounding box:

In [None]:
from worldcereal.utils.map import ui_map

map = ui_map()

VBox(children=(Map(center=[51.1872, 5.1154], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_…

In [24]:
bbox = map.get_extent(projection="latlon")

collections = rdm.get_collections(bbox=bbox)

ids = [col.id for col in collections]
print(f'Number of collections found: {len(ids)}')
ids

[32m2025-01-27 17:26:56.387[0m | [1mINFO    [0m | [36mworldcereal.utils.map[0m:[36mget_processing_extent[0m:[36m170[0m - [1mYour processing extent: (-100.005946, 49.060932, -97.104723, 49.633507)[0m


Number of collections found: 6


['2019_can_aafccropinventory_point_110',
 '2020_can_aafccropinventory_point_110',
 '2018_can_aafccropinventory_point_110',
 '2019_usausda2019cdls_point_110',
 '2017_can_aafccropinventory_point_110',
 '2021_can_aafccropinventory_point_110']

To manually define a bounding box, enter the desired coordinates below (lat/lon):

In [25]:
from openeo_gfmap import BoundingBoxExtent

north = 34.79638823
east = -0.34539808
south = 34.45619011
west = -0.91010781

bbox = BoundingBoxExtent(north=north, east=east, south=south, west=west)

#### 3.3 Filter based on temporal extent

Simply define a start and end date for your period of interest:

In [26]:
from openeo_gfmap import TemporalContext

# We use the year 2020 as an example:
start_date = "2020-01-01"
end_date = "2020-12-31"

temporal_extent = TemporalContext(start_date=start_date, end_date=end_date)

# Access public collections for the specific temporal range
collections = rdm.get_collections(temporal_extent=temporal_extent)
ids = [col.id for col in collections]
print(f'Number of collections found: {len(ids)}')
ids



Number of collections found: 21


['2020_svn_lpis_poly_110',
 '2020_esp_eurocropsnavarre_poly_110',
 '2020_fin_lpis_poly_110',
 '2020_nld_lpis_poly_110',
 '2020_aut_lpis_poly_110',
 '2020_bel_lpisflanders_poly_110',
 '2020_bra_inpelemaug_poly_110',
 '2020_bra_inpelemfeb_poly_110',
 '2020_bra_inpelemmar_poly_110',
 '2020_can_aafccropinventory_point_110',
 '2020_eth_nhicropharvest_poly_100',
 '2020_glo_nhicropharvest_point_100',
 '2020_rwa_faowapor1_point_111',
 '2020_rwa_faowapor2_point_111',
 '2020_rwa_faowaporakagera_point_111',
 '2020_sdn_faowapor1_poly_110',
 '2020_sdn_faowapor2_poly_111',
 '2020_zwe_nhicropharvest_point_110',
 '2021_rwa_faowaporyan_poly_111',
 '2021_rwa_faowapormuvu_poly_111',
 '2020_fra_lpis_poly_110']

#### 3.4 Filter based on data privacy

To change the default behaviour of `get_collections` and get (only) private collections, you can set the `include_public` and `include_private` parameters, as demonstrated below.

NOTE that requesting private collections will trigger a login prompt if you are not already logged in.<br>
Simply click the designated link to login to your Terrascope account.

Reminder: Uploading your own collections to the RDM can be accomplished by hitting the "Contribute" button on [this page](https://rdm.esa-worldcereal.org/), where you will be guided through the upload procedure.

In [4]:
private_collections = rdm.get_collections(include_public=False, include_private=True)
private_ids = [col.id for col in private_collections]
print(f'Number of private collections found: {len(private_ids)}')
private_ids

[32m2025-01-27 14:05:41.596[0m | [1mINFO    [0m | [36mworldcereal.rdm_api.rdm_interaction[0m:[36mget_collections[0m:[36m146[0m - [1mTo access private collections, you need to authenticate.[0m


[32m2025-01-27 14:05:54.213[0m | [1mINFO    [0m | [36mworldcereal.rdm_api.rdm_interaction[0m:[36mget_collections[0m:[36m194[0m - [1mNo collections found in the RDM for your search criteria.[0m


Number of private collections found: 0


[]

#### 3.5 Your turn!

Use a combination of the filters as presented above to look for collections in Kenya containing samples for maize.
Additionally filter on the year 2021.

How many public maize samples can be found for this country in the RDM?

### 4. Get crop counts across datasets

So far we have gone through the process of first identifying your collections of interest, and then checking the amount of samples for these datasets one by one by downloading each dataset's metadata.<br>

This way of working is quite cumbersome in case you would like to get crop counts over a large area of interest, as potentially many different collections are involved.<br>
Below, we demonstrate how you can request crop statistics for one or multiple crop types over one or multiple collections in one go.

In [3]:
# Get total count of soybean and spring wheat samples across two collections in Canada:
collection_ids = ['2021_can_aafccropinventory_point_110', '2018_can_aafccropinventory_point_110']
crop_codes = [1106000020, 1101010002] # soybean + spring wheat

counts = rdm.get_crop_counts(ref_ids=collection_ids, ewoc_codes=crop_codes)
counts

Unnamed: 0_level_0,ref_id,2018_can_aafccropinventory_point_110,2021_can_aafccropinventory_point_110
ewoc_code,Label,Unnamed: 2_level_1,Unnamed: 3_level_1
1101010002,unspecified_spring_wheat,799,345
1106000020,soy_soybeans,15180,6180


Note that in case you do not specify a list of crop types, the statistics for all crop types present in the collections of interest are returned.

In [2]:
ref_ids = ['2018_sen_jecamcirad_poly_111', '2019_mli_nhicropharvest_poly_110']

counts = rdm.get_crop_counts(ref_ids)
counts

Unnamed: 0_level_0,ref_id,2018_sen_jecamcirad_poly_111,2019_mli_nhicropharvest_poly_110
ewoc_code,Label,Unnamed: 2_level_1,Unnamed: 3_level_1
1000000000,cropland_unspecified,261,0
1101060000,maize,20,20
1101070000,unspecified_sorghum,28,25
1101080000,rice,0,7
1101120000,unspecified_millet,285,22
1103020050,watermelon,2,0
1105010050,cow_peas,21,0
1106000050,groundnuts,143,0
1106000100,sesame,1,0
2000000000,non_cropland_herbaceous,22,0


### 5. Get individual samples across multiple collections

Of course we kept the best for last: we implemented a function that allows you to download individual samples across collections, matching your custom search criteria.

There are different ways to filter the samples you wish to download. You can freely combine these different filters in a single request. You will recognize most of these filters from the `get_collections` functionality:

- `ref_ids`: a list of collection id's
- `ewoc_codes`: a list of crop types
- `bbox`: a spatial bounding box
- `temporal_extent`: a temporal range as defined by a start and end date
- `include_public`: whether or not to include public collections
- `include_private`: whether or not to include private collections
- `subset`: if True, the function will only download a subsample of the samples for which the "extract" attribute is 1 or higher. If False (default), all samples matching your search criteria will be downloaded.
- `min_quality_lc`: only download samples for which the land cover quality score is higher than this number
- `min_quality_ct`: only download samples for which the crop type quality score is higher than this number

Below we provide one example of this functionality, but feel free to play around and request samples for your specific area/period/crop types of interest!

For instance, you can make use of the `ui_map` function to draw your own bounding box...

In [6]:
# supply a bounding box (this one is located on Java, Indonesia)
bbox = BoundingBoxExtent(north=-6.577303, east=111.950684, south=-8.05923, west=107.116699)

# do not explicitly filter on collections
ref_ids = None

# Do not filter on date
temporal_extent = None

# Do not filter on crop type
ewoc_codes = None

# Include public collections only
include_public = True
include_private = False

# Limit sample download to a subset in case many samples available
subset = True

# Minimum quality score for samples
min_quality_ct = 75
min_quality_lc = 0

gdf = rdm.get_samples(
    ref_ids=ref_ids,
    subset=subset,
    bbox=bbox,
    temporal_extent=temporal_extent,
    ewoc_codes=ewoc_codes,
    include_public=include_public,
    include_private=include_private,
    min_quality_ct=min_quality_ct,
    min_quality_lc=min_quality_lc,
)

print(f"Total number of samples downloaded: {len(gdf)}")
gdf.head()

[32m2025-01-28 12:18:34.399[0m | [1mINFO    [0m | [36mworldcereal.rdm_api.rdm_interaction[0m:[36mdownload_samples[0m:[36m476[0m - [1mQuerying 2 collections...[0m


Total number of samples downloaded: 517


Unnamed: 0,sample_id,ewoc_code,valid_time,quality_score_lc,quality_score_ct,extract,h3_l3_cell,ref_id,geometry
0,2023_IDN_vito-campaign_POLY_110_0,1201020040,2023-09-27,96,96,2,838c16fffffffff,2023_idn_vitocampaign_poly_110,"POLYGON ((108.15588 -6.82013, 108.15623 -6.820..."
1,2023_IDN_vito-campaign_POLY_110_1,1201020040,2023-09-27,96,96,2,838c16fffffffff,2023_idn_vitocampaign_poly_110,"POLYGON ((108.15645 -6.81920, 108.15651 -6.820..."
2,2023_IDN_vito-campaign_POLY_110_2,1201020040,2023-09-27,96,96,2,838c16fffffffff,2023_idn_vitocampaign_poly_110,"POLYGON ((108.15450 -6.82075, 108.15443 -6.821..."
3,2023_IDN_vito-campaign_POLY_110_3,1201020040,2023-09-27,96,96,2,838c16fffffffff,2023_idn_vitocampaign_poly_110,"POLYGON ((108.15660 -6.82173, 108.15676 -6.822..."
4,2023_IDN_vito-campaign_POLY_110_4,1201020040,2023-09-27,96,96,2,838c16fffffffff,2023_idn_vitocampaign_poly_110,"POLYGON ((108.15668 -6.82271, 108.15653 -6.823..."


The dataframe you get as a result includes information on the origin of each individual sample in the "*ref_id*" attribute:

In [31]:
gdf['ref_id'].unique()

array(['2023_idn_vitocampaign_poly_110', '2023_idn_vitomanualpoints_100'],
      dtype=object)

You can now save this dataframe as a geoparquet file and visualize it in QGIS or using the same visualization function as before.

Note that in some cases, the resulting dataframe will contain both polygons and points, depending on the collections which were matching your request criteria.

In [33]:
from pathlib import Path

# save as geoparquet file and visualize
parquet_file = str(Path(dwnld_folder) / 'samples.geoparquet')
gdf.to_parquet(parquet_file)

visualize_rdm_geoparquet(parquet_file)


Map(center=[-7.165193221046393, 109.3288611817593], controls=(ZoomControl(options=['position', 'zoom_in_text',…

Congratulations, you have reached the end of this exercise!

You have acquired the necessary skills to request the reference data needed to train your own crop type classification algorithms!<br>
In Version 2 of the [WorldCereal processing system](https://github.com/WorldCereal/worldcereal-classification), we will demonstrate how to proceed to actually train your algorithms and produce a crop type map for your region of interest.

In case you did not find the data you are looking for, please consider contributing data to the platform, either as a private, restricted or fully public collection. Read more about our data collection and sharing efforts, [here](https://esa-worldcereal.org/en/reference-data).