In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os

os.environ['EOTDL_API_URL'] = 'https://api.eotdl.com/'
# os.environ['EOTDL_API_URL'] = 'http://localhost:8000/'

In this use case we show how to perform feature engineering with openEO within EOTDL.

https://github.com/earthpulse/eotdl/issues/190


1. stage the EuroCrops dataset with EOTDL.
2. filter the EuroCrops Dataset to create a subset of parcels, e.g., 8 crop classes, each with 1000 examples, for one country
3. run feature engineering with openEO, creating temporal metrics from a S1 and S2 time series (temporally optimised for crops classe of interest). Store feature engineering process graph with the training datsets in EOTDL
4. Use EOTDL functionality to train a model (for this the features need to be retrieved..). Store the model along with the openEO process graph in EOTDL.
5. Use the model to run inference (from within EOTDL?) in an openEO platform such as CDSE or openEO platform. Make use of the feature engineering process graph stored along with the EOTDL model.

## 1 Stage EuroCrops from EOTDL

Dataset can be found at https://www.eotdl.com/datasets/EuroCrops/. The dataset contains a zip file, which in turn contains zip files for each country with the shapefiles (16 total).

> Uncomment the following cells to stage the dataset.

In [4]:
# !eotdl datasets get EuroCrops -v 1 -f -a
# !unzip -o ~/.cache/eotdl/datasets/EuroCrops/EuroCrops.zip -d data/

Staging assets:   0%|                                     | 0/1 [00:00<?, ?it/s]^C
Staging assets:   0%|                                     | 0/1 [00:14<?, ?it/s]


In [5]:
# from glob import glob

# zips = glob('data/*.zip')

# zips

In [6]:
# # unzip shapefiles

# import zipfile

# for zip_file in zips:
# 	with zipfile.ZipFile(zip_file, 'r') as zip_ref:
# 		zip_ref.extractall('data/')


In [7]:
# cleanup

# !rm -rf data/*.zip

List of all the shapefiles in the dataset.

In [3]:
from glob import glob

shapefiles = glob('data/**/*.shp', recursive=True)

shapefiles

[]

In [5]:
import geopandas as gpd

path = shapefiles[0]

gdf = gpd.read_file(path)

gdf.head()


Unnamed: 0,taotlusaas,pollu_id,pindala_ha,taotletud_,taotletu_1,niitmise_t,niitmise_1,viimase_mu,taotletu_2,taotleja_n,taotleja_r,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
0,2021,19994165,0.25,Karjatamine väljaspool põllumaj. maad,Karjatamine väljaspool põllumaj. maad,,,2021/05/02 14:37:52.000,,FIE,,Rough grazings,pasture_meadow_grassland_grass,3302000000,"POLYGON ((26.50243 59.31839, 26.50244 59.31843..."
1,2021,19990783,1.7,rohttaimed,Püsirohumaa,Niidetud,28.06.2021-04.07.2021,2021/05/02 06:59:17.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,grasses,pasture_meadow_grassland_grass,3302000000,"POLYGON ((24.54648 58.86884, 24.54674 58.86879..."
2,2021,19990784,0.49,rohttaimed,Püsirohumaa,Ei kuulu jälgimisele,,2021/05/02 06:59:17.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,grasses,pasture_meadow_grassland_grass,3302000000,"POLYGON ((24.54597 58.86827, 24.54668 58.86816..."
3,2021,19996106,0.54,talinisu allakülvita,Põllukultuurid,,,2021/05/02 20:58:12.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Winter wheat,winter_common_soft_wheat,3301010101,"POLYGON ((27.42837 58.11975, 27.42839 58.11972..."
4,2021,19990620,2.48,"punane ristik (vähemalt 80% ristikut, kuni 20%...",Põllukultuurid,Niidetud,06.07.2021-11.07.2021,2021/07/05 07:26:35.000,Kliimat ja keskkonda säästvate põllumajandusta...,TAMSAMÄE OÜ,11350602.0,Red clover (at least 80% clover up to 20% gras...,clover,3301090303,"POLYGON ((26.66816 57.82049, 26.66815 57.8205,..."


In [6]:
# columns
gdf.columns

Index(['taotlusaas', 'pollu_id', 'pindala_ha', 'taotletud_', 'taotletu_1',
       'niitmise_t', 'niitmise_1', 'viimase_mu', 'taotletu_2', 'taotleja_n',
       'taotleja_r', 'EC_trans_n', 'EC_hcat_n', 'EC_hcat_c', 'geometry'],
      dtype='object')

## 2. Filter EuroCropsDataset

Filter the EuroCropsDataset to create a subset of parcels, e.g., 8 crop classes, each with 1000 examples, for one country

In [7]:
# random country

import numpy as np

ix = np.random.randint(0, len(shapefiles))
country = shapefiles[ix]

country

'data/filtered_gdf.shp'

In [8]:
gdf = gpd.read_file(path)

In [9]:
crop_classes = gdf['EC_hcat_n'].unique()

crop_classes

array(['pasture_meadow_grassland_grass', 'winter_common_soft_wheat',
       'clover', 'peas', 'winter_barley', 'winter_rapeseed_rape',
       'spring_barley', 'fresh_vegetables', 'fallow_land_not_crop',
       'orchards_fruits', 'potatoes', 'oats', 'spring_common_soft_wheat',
       'not_known_and_other', 'buckwheat',
       'legumes_dried_pulses_protein_crops', 'raspberry_raspberries',
       'legumes_harvested_green', 'mangelwurzel_fodder_beet', 'melilot',
       'mustard', 'lolium_ryegrass', 'alfalfa_lucerne', 'strawberries',
       'apples', 'blueberry', 'unspecified_cereals', 'beans', 'rye',
       'rhubarb', 'spring_rapeseed_rape', 'winter_triticale',
       'spring_triticale', 'nurseries_nursery', 'coriander',
       'hippophae_sea_buckthorns_seaberry', 'blackcurrant_cassis',
       'willows_osiers', 'beetroot_beets', 'grain_maize_corn_popcorn',
       'pumpkin_squash_gourd', 'cucumber_pickle', 'aronia_chokeberries',
       'aromatic_medicinal_culinary_plants_spices_herbs', 'red

In [10]:
# number of samples per class

num_samples_per_class = {class_: len(gdf[gdf['EC_hcat_n'] == class_]) for class_ in crop_classes}

num_samples_per_class = dict(sorted(num_samples_per_class.items(), key=lambda x: x[1], reverse=True))

num_samples_per_class

{'pasture_meadow_grassland_grass': 84107,
 'winter_common_soft_wheat': 13726,
 'legumes_harvested_green': 13152,
 'spring_barley': 10737,
 'oats': 6911,
 'clover': 6877,
 'winter_rapeseed_rape': 6012,
 'spring_common_soft_wheat': 5525,
 'peas': 4210,
 'potatoes': 3438,
 'winter_barley': 2245,
 'fallow_land_not_crop': 1922,
 'beans': 1719,
 'fresh_vegetables': 1600,
 'rye': 1427,
 'alfalfa_lucerne': 1241,
 'spring_rapeseed_rape': 1236,
 'buckwheat': 1140,
 'grain_maize_corn_popcorn': 852,
 'strawberries': 846,
 'orchards_fruits': 832,
 'legumes_dried_pulses_protein_crops': 831,
 'winter_triticale': 533,
 'finola': 522,
 'melilot': 446,
 'apples': 383,
 'hippophae_sea_buckthorns_seaberry': 322,
 'raspberry_raspberries': 299,
 'blackcurrant_cassis': 275,
 'not_known_and_other': 234,
 'mustard': 209,
 'spring_triticale': 195,
 'unspecified_cereals': 127,
 'aromatic_medicinal_culinary_plants_spices_herbs': 118,
 'blueberry': 94,
 'garlic': 93,
 'carrots_daucus': 84,
 'lolium_ryegrass': 83,


In [11]:
# import matplotlib.pyplot as plt

# plt.figure(figsize=(5, 25))
# plt.barh(list(num_samples_per_class.keys()), list(num_samples_per_class.values()))
# plt.tight_layout()
# plt.show()

In [12]:
# filter 1000 examples per class

# Each job runs separately, so we need to limit the number of classes and samples per class
# samples = 1000
# num_classes = 8

samples = 10
num_classes = 2

# keep classes with at least 1000 samples
classes = [class_ for class_, count in num_samples_per_class.items() if count >= samples]

# random 8 classes
classes = np.random.choice(classes, num_classes, replace=False)

classes


array(['white_cabbage', 'marian_thistles'], dtype='<U47')

In [13]:
filtered_gdf = gdf[gdf['EC_hcat_n'].isin(classes)]

filtered_gdf = filtered_gdf.groupby('EC_hcat_n').sample(n=samples, random_state=42)

filtered_gdf.head()

Unnamed: 0,taotlusaas,pollu_id,pindala_ha,taotletud_,taotletu_1,niitmise_t,niitmise_1,viimase_mu,taotletu_2,taotleja_n,taotleja_r,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
67729,2021,21145929,0.59,"maarjaohakas, harilik",Põllukultuurid,,,2021/05/19 20:58:54.000,Kliimat ja keskkonda säästvate põllumajandusta...,LAKENIIDU TALU OÜ,14696264.0,Milk thistle,marian_thistles,3301061300,"POLYGON ((27.193 57.69425, 27.19309 57.69427, ..."
52851,2021,20734858,4.95,"maarjaohakas, harilik",Põllukultuurid,,,2021/05/19 10:35:57.000,Kliimat ja keskkonda säästvate põllumajandusta...,UNESTE TALL OÜ,12908746.0,Milk thistle,marian_thistles,3301061300,"POLYGON ((23.68208 58.91499, 23.68199 58.91497..."
45016,2021,20650905,10.01,"maarjaohakas, harilik",Põllukultuurid,,,2021/06/15 14:30:53.000,Kliimat ja keskkonda säästvate põllumajandusta...,TIIGIKALDA OÜ,11489116.0,Milk thistle,marian_thistles,3301061300,"POLYGON ((28.1353 59.39166, 28.13618 59.39145,..."
67881,2021,21018823,1.0,"maarjaohakas, harilik",Põllukultuurid,,,2021/05/19 12:16:58.000,Kliimat ja keskkonda säästvate põllumajandusta...,FIE,,Milk thistle,marian_thistles,3301061300,"POLYGON ((25.81268 58.30422, 25.81274 58.30421..."
31054,2021,20408606,2.28,"maarjaohakas, harilik",Põllukultuurid,,,2021/05/14 10:06:16.000,Kliimat ja keskkonda säästvate põllumajandusta...,BALTIC BARLEY OÜ,14391487.0,Milk thistle,marian_thistles,3301061300,"POLYGON ((25.79092 58.19135, 25.79096 58.19134..."


In [14]:
assert len(filtered_gdf) == num_classes * samples

# save to disk
filtered_gdf.to_file('data/filtered_gdf.shp')


## 3. Feature Engineering with openEO

## 3.1 Feature Enginering pipeline


The first thing that we need is a feature engineering pipeline.

In [15]:
!cat s1_weekly_statistics.json

{
    "process_graph": {
        "loadcollection1": {
            "process_id": "load_collection",
            "arguments": {
                "bands": [
                    "VH",
                    "VV"
                ],
                "id": "SENTINEL1_GRD",
                "spatial_extent": {
                    "from_parameter": "spatial_extent"
                },
                "temporal_extent": {
                    "from_parameter": "temporal_extent"
                }
            }
        },
        "sarbackscatter1": {
            "process_id": "sar_backscatter",
            "arguments": {
                "coefficient": "sigma0-ellipsoid",
                "contributing_area": false,
                "data": {
                    "from_node": "loadcollection1"
                },
                "elevation_model": "COPERNICUS_30",
                "ellipsoid_incidence_angle": false,
                "local_incidence_angle": false,
                "mask": false,
                "

In [16]:
!cat s2_weekly_statistics.json

{
    "process_graph": {
        "loadcollection1": {
            "process_id": "load_collection",
            "arguments": {
                "bands": [
                    "B02",
                    "B03",
                    "B04",
                    "B05",
                    "B06",
                    "B07",
                    "B08",
                    "B8A",
                    "B11",
                    "B12"
                ],
                "id": "SENTINEL2_L2A",
                "properties": {
                    "eo:cloud_cover": {
                        "process_graph": {
                            "lte1": {
                                "process_id": "lte",
                                "arguments": {
                                    "x": {
                                        "from_parameter": "value"
                                    },
                                    "y": 75.0
                                },
                                "resul

We can ingest the pipelines to the EOTDL. First, create a folder with the metadata (README.md) and the pipelines.

In [17]:
text = """---
name: EuroCropsPipeline
authors: 
  - eotdl
license: free
source: https://github.com/earthpulse/eotdl/tree/main/tutorials/usecases/openEO
---

# EuroCropsPipeline

This pipeline will extract features from a S1 and S2 time series for a given set of parcels in the EuroCrops dataset.
"""

os.makedirs('pipeline', exist_ok=True)
with open(f"pipeline/README.md", "w") as outfile:
    outfile.write(text)
    
!cp s1_weekly_statistics.json pipeline/.
!cp s2_weekly_statistics.json pipeline/.
!cat pipeline/README.md

---
name: EuroCropsPipeline
authors: 
  - eotdl
license: free
source: https://github.com/earthpulse/eotdl/tree/main/tutorials/usecases/openEO
---

# EuroCropsPipeline

This pipeline will extract features from a S1 and S2 time series for a given set of parcels in the EuroCrops dataset.


Then, ingest to EOTDL.

In [18]:
# os.environ['EOTDL_API_URL'] = 'http://localhost:8000/'

In [22]:
from eotdl.fe import ingest_openeo 

ingest_openeo('pipeline')

Ingesting directory: pipeline


Ingesting files: 100%|██████████| 3/3 [00:00<00:00, 59.44it/s]


PosixPath('pipeline/catalog.parquet')

In [23]:
!eotdl pipelines ingest -p pipeline

Ingesting directory: pipeline
Ingesting files: 100%|███████████████████████████| 3/3 [00:00<00:00, 100.19it/s]
No new version was created, your dataset has not changed.


In [24]:
!eotdl pipelines list

['EuroCropsPipeline']


## 3.2 Running the pipeline

We can retrieve the pipeline from the EOTDL very easily.

In [3]:
from eotdl.fe import stage_pipeline 

stage_pipeline('EuroCropsPipeline', path="pipeline2")

'pipeline2/EuroCropsPipeline'

In [4]:
!eotdl pipelines get EuroCropsPipeline -p pipeline2 -a -f

Staging assets: 100%|█████████████████████████████| 3/3 [00:00<00:00, 77.16it/s]
Data available at pipeline2/EuroCropsPipeline


In [5]:
!ls pipeline2/EuroCropsPipeline

README.md                 catalog.v2.parquet        s1_weekly_statistics.json
catalog.v1.parquet        qrweqweqw.txt             s2_weekly_statistics.json


But openeo needs access to the pipelines from public links

> The url should return the actual json, not the json file (download).


In [12]:
from eotdl.files import get_file_content_url

s1_weekly_statistics_url = get_file_content_url('s1_weekly_statistics.json', 'EuroCropsPipeline', 'pipelines')
s2_weekly_statistics_url = get_file_content_url('s2_weekly_statistics.json', 'EuroCropsPipeline', 'pipelines')

s1_weekly_statistics_url, s2_weekly_statistics_url

('http://localhost:8000/pipelines/683d8961a5dae84af45603d7/raw/s1_weekly_statistics.json',
 'http://localhost:8000/pipelines/683d8961a5dae84af45603d7/raw/s2_weekly_statistics.json')

In [13]:
import requests

response = requests.get(s1_weekly_statistics_url)
print(response.text)


{"process_graph":{"loadcollection1":{"process_id":"load_collection","arguments":{"bands":["VH","VV"],"id":"SENTINEL1_GRD","spatial_extent":{"from_parameter":"spatial_extent"},"temporal_extent":{"from_parameter":"temporal_extent"}}},"sarbackscatter1":{"process_id":"sar_backscatter","arguments":{"coefficient":"sigma0-ellipsoid","contributing_area":false,"data":{"from_node":"loadcollection1"},"elevation_model":"COPERNICUS_30","ellipsoid_incidence_angle":false,"local_incidence_angle":false,"mask":false,"noise_removal":true}},"aggregatetemporalperiod1":{"process_id":"aggregate_temporal_period","arguments":{"data":{"from_node":"sarbackscatter1"},"period":"week","reducer":{"process_graph":{"mean1":{"process_id":"mean","arguments":{"data":{"from_parameter":"data"}},"result":true}}}}},"applydimension1":{"process_id":"apply_dimension","arguments":{"data":{"from_node":"aggregatetemporalperiod1"},"dimension":"t","process":{"process_graph":{"quantiles1":{"process_id":"quantiles","arguments":{"data"

Run feature engineering with openEO, creating temporal metrics from a S1 and S2 time series (temporally optimised for crops classe of interest). 

In [14]:
import geopandas as gpd

gdf = gpd.read_file('data/filtered_gdf.shp')

gdf.shape

(20, 15)

Add urls to the gdf

In [17]:
gdf['s1_weekly_statistics_url'] = s1_weekly_statistics_url
gdf['s2_weekly_statistics_url'] = s2_weekly_statistics_url

gdf.head()

Unnamed: 0,taotlusaas,pollu_id,pindala_ha,taotletud_,taotletu_1,niitmise_t,niitmise_1,viimase_mu,taotletu_2,taotleja_n,taotleja_r,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry,s1_weekly_statistics_url,s2_weekly_statistics_url
0,2021,21145929,0.59,"maarjaohakas, harilik",Põllukultuurid,,,2021/05/19 20:58:54.000,Kliimat ja keskkonda säästvate põllumajandusta...,LAKENIIDU TALU OÜ,14696264.0,Milk thistle,marian_thistles,3301061300,"POLYGON ((27.193 57.69425, 27.19309 57.69427, ...",http://localhost:8000/pipelines/683d8961a5dae8...,http://localhost:8000/pipelines/683d8961a5dae8...
1,2021,20734858,4.95,"maarjaohakas, harilik",Põllukultuurid,,,2021/05/19 10:35:57.000,Kliimat ja keskkonda säästvate põllumajandusta...,UNESTE TALL OÜ,12908746.0,Milk thistle,marian_thistles,3301061300,"POLYGON ((23.68208 58.91499, 23.68199 58.91497...",http://localhost:8000/pipelines/683d8961a5dae8...,http://localhost:8000/pipelines/683d8961a5dae8...
2,2021,20650905,10.01,"maarjaohakas, harilik",Põllukultuurid,,,2021/06/15 14:30:53.000,Kliimat ja keskkonda säästvate põllumajandusta...,TIIGIKALDA OÜ,11489116.0,Milk thistle,marian_thistles,3301061300,"POLYGON ((28.1353 59.39166, 28.13618 59.39145,...",http://localhost:8000/pipelines/683d8961a5dae8...,http://localhost:8000/pipelines/683d8961a5dae8...
3,2021,21018823,1.0,"maarjaohakas, harilik",Põllukultuurid,,,2021/05/19 12:16:58.000,Kliimat ja keskkonda säästvate põllumajandusta...,FIE,,Milk thistle,marian_thistles,3301061300,"POLYGON ((25.81268 58.30422, 25.81274 58.30421...",http://localhost:8000/pipelines/683d8961a5dae8...,http://localhost:8000/pipelines/683d8961a5dae8...
4,2021,20408606,2.28,"maarjaohakas, harilik",Põllukultuurid,,,2021/05/14 10:06:16.000,Kliimat ja keskkonda säästvate põllumajandusta...,BALTIC BARLEY OÜ,14391487.0,Milk thistle,marian_thistles,3301061300,"POLYGON ((25.79092 58.19135, 25.79096 58.19134...",http://localhost:8000/pipelines/683d8961a5dae8...,http://localhost:8000/pipelines/683d8961a5dae8...


> will run one job per parcel, very slow and not cost effective (~5mins/parcel, can speed up with `parallel_jobs`)

In [42]:
from eotdl.fe.openeo import eurocrops_point_extraction 

eurocrops_point_extraction(
    gdf, 
    start_date = "2024-01-01", 
    nb_months = 2, 
    job_tracker = 'jobs4.csv', 
    parallel_jobs=10, 
    extra_cols=['EC_hcat_n']
)


Authenticated using refresh token.


AssertionError: Unexpected keyword arguments: {'s1_weekly_statistics_url': 'https://obs.eu-nl.otc.t-systems.com/eotdl-data/68306c8ee2cef594e0c0ef07/s1_weekly_statistics.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=XSO2NXISXUSPJ5HDP58Y%2F20250523%2Feu-nl%2Fs3%2Faws4_request&X-Amz-Date=20250523T124034Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=a2460b73093fdce8c4983c4615c4b469ef665179a6cc455d4cc6896715d6a29e', 's2_weekly_statistics_url': 'https://obs.eu-nl.otc.t-systems.com/eotdl-data/68306c8ee2cef594e0c0ef07/s2_weekly_statistics.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=XSO2NXISXUSPJ5HDP58Y%2F20250523%2Feu-nl%2Fs3%2Faws4_request&X-Amz-Date=20250523T124034Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=8f0a10ea851f8590bf6301e8856a60041601cc63b9dd03c6c2e9f284c6e96051'}

In [4]:
import pandas as pd 

job = pd.read_csv('jobs.csv')
job

Unnamed: 0,fid,geometry,crs,temporal_extent,id,backend_name,status,start_time,running_start_time,cpu,memory,duration
0,,"POLYGON ((24.11064338 58.26119788, 24.11063604...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-25051610175047429329ed6feb96dc20,cdse,finished,2025-05-16T10:17:50Z,,213.518821386 cpu-seconds,1183999.60546875 mb-seconds,126 seconds
1,,"POLYGON ((24.59143788 59.20213941, 24.59144036...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-2505161018084db998fe92f976077609,cdse,finished,2025-05-16T10:18:08Z,,127.460649337 cpu-seconds,835593.53515625 mb-seconds,103 seconds
2,,"POLYGON ((24.734337 59.14479386, 24.73418153 5...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-25051610182546a38540155e6dc862e0,cdse,finished,2025-05-16T10:18:25Z,,180.935664807 cpu-seconds,759126.931640625 mb-seconds,123 seconds
3,,"POLYGON ((24.80384407 58.36675872, 24.80394854...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-2505161018424cf0ae7815d18f278e7c,cdse,finished,2025-05-16T10:18:42Z,,227.41301282900002 cpu-seconds,795693.4453125 mb-seconds,138 seconds
4,,"POLYGON ((24.60552724 59.22486752, 24.60602263...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-2505161018584abea5903b1fad46efcc,cdse,finished,2025-05-16T10:18:58Z,,142.529833755 cpu-seconds,709820.935546875 mb-seconds,111 seconds
5,,"POLYGON ((24.59730459 59.20052431, 24.59728423...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-250516101916430aa489f9f704370f29,cdse,finished,2025-05-16T10:19:16Z,2025-05-16T10:21:45Z,173.42491585800002 cpu-seconds,997323.6796875 mb-seconds,132 seconds
6,,"POLYGON ((26.76171867 57.97226022, 26.76168713...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-2505161019344b0089b9eed895f0a803,cdse,finished,2025-05-16T10:19:34Z,2025-05-16T10:21:45Z,188.659103819 cpu-seconds,1162275.666015625 mb-seconds,161 seconds
7,,"POLYGON ((24.11337928 58.26099873, 24.11331102...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-25051610195045fbbad3794af8313574,cdse,finished,2025-05-16T10:19:50Z,2025-05-16T10:21:45Z,209.11050119899997 cpu-seconds,1066614.70703125 mb-seconds,130 seconds
8,,"POLYGON ((27.12280053 58.12659856, 27.12281697...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-2505161020064465b5eed42de26d9a63,cdse,finished,2025-05-16T10:20:06Z,2025-05-16T10:21:45Z,231.61191986400001 cpu-seconds,2017177.75 mb-seconds,152 seconds
9,,"POLYGON ((26.73512739 58.27442748, 26.73589497...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-2505161020234fe888a2c9f1bb45b4e7,cdse,finished,2025-05-16T10:20:23Z,2025-05-16T10:21:45Z,209.677308211 cpu-seconds,717685.546875 mb-seconds,132 seconds


In [11]:
# Initialize an empty list to store all dataframes
all_data = []

# Loop through each job and read its parquet file
for idx, _job in job.iterrows():
    try:
        job_data = pd.read_parquet(f'job_{_job["id"]}/timeseries.parquet')
        # Add job_id as a column to identify the source
        job_data['job_id'] = _job["id"]
        all_data.append(job_data)
    except Exception as e:
        print(f"Error reading job {_job['id']}: {e}")

# Concatenate all dataframes into one
if all_data:
    data = pd.concat(all_data, ignore_index=True)
    print(f"Successfully merged {len(all_data)} time series datasets")
else:
    data = pd.DataFrame()
    print("No time series data was loaded")

data

Successfully merged 20 time series datasets


Unnamed: 0,geometry,feature_index,B02_P10,B02_P25,B02_P50,B02_P75,B02_P90,B03_P10,B03_P25,B03_P50,...,VH_P25,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90,job_id
0,"b'\x01\x03\x00\x00\x00\x01\x00\x00\x00""\x00\x0...",0,,,,,,,,,...,0.005915,0.009845,0.013578,0.016449,0.039836,0.05057,0.072721,0.122933,0.147181,j-25051610175047429329ed6feb96dc20
1,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00:\x00\x0...",0,,,,,,,,,...,0.003498,0.005155,0.007258,0.011056,0.021004,0.026354,0.035084,0.046927,0.063484,j-2505161018084db998fe92f976077609
2,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x17\x00...,0,,,,,,,,,...,0.002803,0.003923,0.005395,0.007169,0.020292,0.024544,0.031204,0.040079,0.053383,j-25051610182546a38540155e6dc862e0
3,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00\x1b\x00...",0,,,,,,,,,...,0.006838,0.009548,0.014226,0.020363,0.031789,0.037089,0.049702,0.068509,0.106466,j-2505161018424cf0ae7815d18f278e7c
4,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00#\x00\x0...,0,,,,,,,,,...,0.003203,0.0045,0.006347,0.008537,0.020882,0.024984,0.031692,0.041187,0.052576,j-2505161018584abea5903b1fad46efcc
5,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,,...,0.00595,0.008022,0.010358,0.014889,0.032923,0.039042,0.049017,0.06354,0.082097,j-250516101916430aa489f9f704370f29
6,b'\x01\x03\x00\x00\x00\x02\x00\x00\x00b\x00\x0...,0,,,,,,,,,...,0.006577,0.009207,0.012007,0.015258,0.030508,0.039642,0.052435,0.069091,0.091799,j-2505161019344b0089b9eed895f0a803
7,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,,...,0.00377,0.005692,0.007832,0.011728,0.027132,0.031695,0.038928,0.047977,0.057112,j-25051610195045fbbad3794af8313574
8,b'\x01\x03\x00\x00\x00\x03\x00\x00\x00\xe6\x00...,0,,,,,,,,,...,0.007683,0.010565,0.0137,0.01697,0.030198,0.037728,0.048462,0.062068,0.075866,j-2505161020064465b5eed42de26d9a63
9,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x0c\x00...,0,,,,,,,,,...,0.00197,0.003242,0.005464,0.008266,0.018165,0.023109,0.029741,0.038527,0.046957,j-2505161020234fe888a2c9f1bb45b4e7


In [12]:
data.columns

Index(['geometry', 'feature_index', 'B02_P10', 'B02_P25', 'B02_P50', 'B02_P75',
       'B02_P90', 'B03_P10', 'B03_P25', 'B03_P50', 'B03_P75', 'B03_P90',
       'B04_P10', 'B04_P25', 'B04_P50', 'B04_P75', 'B04_P90', 'B05_P10',
       'B05_P25', 'B05_P50', 'B05_P75', 'B05_P90', 'B06_P10', 'B06_P25',
       'B06_P50', 'B06_P75', 'B06_P90', 'B07_P10', 'B07_P25', 'B07_P50',
       'B07_P75', 'B07_P90', 'B08_P10', 'B08_P25', 'B08_P50', 'B08_P75',
       'B08_P90', 'B8A_P10', 'B8A_P25', 'B8A_P50', 'B8A_P75', 'B8A_P90',
       'B11_P10', 'B11_P25', 'B11_P50', 'B11_P75', 'B11_P90', 'B12_P10',
       'B12_P25', 'B12_P50', 'B12_P75', 'B12_P90', 'VH_P10', 'VH_P25',
       'VH_P50', 'VH_P75', 'VH_P90', 'VV_P10', 'VV_P25', 'VV_P50', 'VV_P75',
       'VV_P90', 'job_id'],
      dtype='object')

In [13]:
data.to_csv('data/features.csv')


## 4. Train a model with EOTDL

We will train a simple random forest model on the features.


In [35]:
data = pd.read_csv('data/features.csv')

data

Unnamed: 0.1,Unnamed: 0,geometry,feature_index,B02_P10,B02_P25,B02_P50,B02_P75,B02_P90,B03_P10,B03_P25,...,VH_P25,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90,job_id
0,0,"b'\x01\x03\x00\x00\x00\x01\x00\x00\x00""\x00\x0...",0,,,,,,,,...,0.005915,0.009845,0.013578,0.016449,0.039836,0.05057,0.072721,0.122933,0.147181,j-25051610175047429329ed6feb96dc20
1,1,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00:\x00\x0...",0,,,,,,,,...,0.003498,0.005155,0.007258,0.011056,0.021004,0.026354,0.035084,0.046927,0.063484,j-2505161018084db998fe92f976077609
2,2,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x17\x00...,0,,,,,,,,...,0.002803,0.003923,0.005395,0.007169,0.020292,0.024544,0.031204,0.040079,0.053383,j-25051610182546a38540155e6dc862e0
3,3,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00\x1b\x00...",0,,,,,,,,...,0.006838,0.009548,0.014226,0.020363,0.031789,0.037089,0.049702,0.068509,0.106466,j-2505161018424cf0ae7815d18f278e7c
4,4,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00#\x00\x0...,0,,,,,,,,...,0.003203,0.0045,0.006347,0.008537,0.020882,0.024984,0.031692,0.041187,0.052576,j-2505161018584abea5903b1fad46efcc
5,5,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,...,0.00595,0.008022,0.010358,0.014889,0.032923,0.039042,0.049017,0.06354,0.082097,j-250516101916430aa489f9f704370f29
6,6,b'\x01\x03\x00\x00\x00\x02\x00\x00\x00b\x00\x0...,0,,,,,,,,...,0.006577,0.009207,0.012007,0.015258,0.030508,0.039642,0.052435,0.069091,0.091799,j-2505161019344b0089b9eed895f0a803
7,7,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,...,0.00377,0.005692,0.007832,0.011728,0.027132,0.031695,0.038928,0.047977,0.057112,j-25051610195045fbbad3794af8313574
8,8,b'\x01\x03\x00\x00\x00\x03\x00\x00\x00\xe6\x00...,0,,,,,,,,...,0.007683,0.010565,0.0137,0.01697,0.030198,0.037728,0.048462,0.062068,0.075866,j-2505161020064465b5eed42de26d9a63
9,9,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x0c\x00...,0,,,,,,,,...,0.00197,0.003242,0.005464,0.008266,0.018165,0.023109,0.029741,0.038527,0.046957,j-2505161020234fe888a2c9f1bb45b4e7


In [36]:
parcels = gpd.read_file('data/filtered_gdf.shp')

parcels



Unnamed: 0,taotlusaas,pollu_id,pindala_ha,taotletud_,taotletu_1,niitmise_t,niitmise_1,viimase_mu,taotletu_2,taotleja_n,taotleja_r,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
0,2021,21339359,0.34,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/20 20:08:34.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((24.11064 58.2612, 24.11064 58.26108,..."
1,2021,20435919,2.12,"võilill, harilik",Põllukultuurid,Niidetud,03.08.2021-09.08.2021,2021/05/14 14:14:50.000,Kliimat ja keskkonda säästvate põllumajandusta...,FIE,,Dandelion common,dandelions,3301081400,"POLYGON ((24.59144 59.20214, 24.59144 59.20215..."
2,2021,21666530,2.33,"võilill, harilik",Põllukultuurid,Niidetud,10.07.2021-11.07.2021,2021/05/22 20:46:35.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((24.73434 59.14479, 24.73418 59.14473..."
3,2021,21781505,0.17,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/26 15:43:20.000,Kliimat ja keskkonda säästvate põllumajandusta...,OSAÜHING VIIVEKA,10040905.0,Dandelion common,dandelions,3301081400,"POLYGON ((24.80384 58.36676, 24.80395 58.36691..."
4,2021,20435920,0.93,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/14 14:14:50.000,Kliimat ja keskkonda säästvate põllumajandusta...,FIE,,Dandelion common,dandelions,3301081400,"POLYGON ((24.60553 59.22487, 24.60602 59.22505..."
5,2021,20435918,0.46,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/14 14:14:50.000,Kliimat ja keskkonda säästvate põllumajandusta...,FIE,,Dandelion common,dandelions,3301081400,"POLYGON ((24.5973 59.20052, 24.59728 59.20044,..."
6,2021,21625865,2.89,"võilill, harilik",Põllukultuurid,Niidetud,30.07.2021-05.08.2021,2021/06/12 10:41:17.000,Keskkonnasõbraliku majandamise toetus;Kliimat ...,FIE,,Dandelion common,dandelions,3301081400,"POLYGON ((26.76172 57.97226, 26.76169 57.97218..."
7,2021,21339355,0.38,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/20 20:08:34.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((24.11338 58.261, 24.11331 58.26101, ..."
8,2021,21520681,2.92,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/21 14:27:56.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((27.1228 58.1266, 27.12282 58.12662, ..."
9,2021,20494621,0.47,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/16 09:36:00.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((26.73513 58.27443, 26.73589 58.27454..."


> How can I match the features to the parcels? The only common column is the geometry...

Assuming both dataframes have same order (which is not likely the case).

In [38]:
data['target'] = parcels.EC_hcat_n

data

Unnamed: 0.1,Unnamed: 0,geometry,feature_index,B02_P10,B02_P25,B02_P50,B02_P75,B02_P90,B03_P10,B03_P25,...,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90,job_id,target
0,0,"b'\x01\x03\x00\x00\x00\x01\x00\x00\x00""\x00\x0...",0,,,,,,,,...,0.009845,0.013578,0.016449,0.039836,0.05057,0.072721,0.122933,0.147181,j-25051610175047429329ed6feb96dc20,dandelions
1,1,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00:\x00\x0...",0,,,,,,,,...,0.005155,0.007258,0.011056,0.021004,0.026354,0.035084,0.046927,0.063484,j-2505161018084db998fe92f976077609,dandelions
2,2,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x17\x00...,0,,,,,,,,...,0.003923,0.005395,0.007169,0.020292,0.024544,0.031204,0.040079,0.053383,j-25051610182546a38540155e6dc862e0,dandelions
3,3,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00\x1b\x00...",0,,,,,,,,...,0.009548,0.014226,0.020363,0.031789,0.037089,0.049702,0.068509,0.106466,j-2505161018424cf0ae7815d18f278e7c,dandelions
4,4,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00#\x00\x0...,0,,,,,,,,...,0.0045,0.006347,0.008537,0.020882,0.024984,0.031692,0.041187,0.052576,j-2505161018584abea5903b1fad46efcc,dandelions
5,5,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,...,0.008022,0.010358,0.014889,0.032923,0.039042,0.049017,0.06354,0.082097,j-250516101916430aa489f9f704370f29,dandelions
6,6,b'\x01\x03\x00\x00\x00\x02\x00\x00\x00b\x00\x0...,0,,,,,,,,...,0.009207,0.012007,0.015258,0.030508,0.039642,0.052435,0.069091,0.091799,j-2505161019344b0089b9eed895f0a803,dandelions
7,7,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,...,0.005692,0.007832,0.011728,0.027132,0.031695,0.038928,0.047977,0.057112,j-25051610195045fbbad3794af8313574,dandelions
8,8,b'\x01\x03\x00\x00\x00\x03\x00\x00\x00\xe6\x00...,0,,,,,,,,...,0.010565,0.0137,0.01697,0.030198,0.037728,0.048462,0.062068,0.075866,j-2505161020064465b5eed42de26d9a63,dandelions
9,9,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x0c\x00...,0,,,,,,,,...,0.003242,0.005464,0.008266,0.018165,0.023109,0.029741,0.038527,0.046957,j-2505161020234fe888a2c9f1bb45b4e7,dandelions


In [39]:
from sklearn.model_selection import train_test_split

# drop columns with nans

data_clean = data.dropna(axis=1)

# drop unused columns

data_clean = data_clean.drop(columns=['Unnamed: 0', 'geometry', 'job_id', 'feature_index'])

# split train/test

X_train, X_test, y_train, y_test = train_test_split(data_clean.drop(columns=['target']), data_clean['target'], test_size=0.2, random_state=42)

In [42]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

model.score(X_test, y_test)


0.75

TODO:
- ingest model to EOTDL
- ingest feature recipe to EOTDL

## 5. Run inference with EOTDL

In [None]:
# sample = X_test.iloc[3]

# pred = model.predict(sample.values.reshape(1, -1))

# pred

Let's perform inference on some new parcels.

In [57]:
ix = np.random.randint(0, len(shapefiles))
country = shapefiles[ix]

country

'data/DK_2019_EC21.shp'

In [58]:
gdf = gpd.read_file(path)

gdf.head()

Unnamed: 0,taotlusaas,pollu_id,pindala_ha,taotletud_,taotletu_1,niitmise_t,niitmise_1,viimase_mu,taotletu_2,taotleja_n,taotleja_r,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
0,2021,19994165,0.25,Karjatamine väljaspool põllumaj. maad,Karjatamine väljaspool põllumaj. maad,,,2021/05/02 14:37:52.000,,FIE,,Rough grazings,pasture_meadow_grassland_grass,3302000000,"POLYGON ((26.50243 59.31839, 26.50244 59.31843..."
1,2021,19990783,1.7,rohttaimed,Püsirohumaa,Niidetud,28.06.2021-04.07.2021,2021/05/02 06:59:17.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,grasses,pasture_meadow_grassland_grass,3302000000,"POLYGON ((24.54648 58.86884, 24.54674 58.86879..."
2,2021,19990784,0.49,rohttaimed,Püsirohumaa,Ei kuulu jälgimisele,,2021/05/02 06:59:17.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,grasses,pasture_meadow_grassland_grass,3302000000,"POLYGON ((24.54597 58.86827, 24.54668 58.86816..."
3,2021,19996106,0.54,talinisu allakülvita,Põllukultuurid,,,2021/05/02 20:58:12.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Winter wheat,winter_common_soft_wheat,3301010101,"POLYGON ((27.42837 58.11975, 27.42839 58.11972..."
4,2021,19990620,2.48,"punane ristik (vähemalt 80% ristikut, kuni 20%...",Põllukultuurid,Niidetud,06.07.2021-11.07.2021,2021/07/05 07:26:35.000,Kliimat ja keskkonda säästvate põllumajandusta...,TAMSAMÄE OÜ,11350602.0,Red clover (at least 80% clover up to 20% gras...,clover,3301090303,"POLYGON ((26.66816 57.82049, 26.66815 57.8205,..."


In [59]:
gdf = gdf.sample(n=3)
gdf

Unnamed: 0,taotlusaas,pollu_id,pindala_ha,taotletud_,taotletu_1,niitmise_t,niitmise_1,viimase_mu,taotletu_2,taotleja_n,taotleja_r,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
162397,2021,22113589,21.37,talinisu allakülvita,Põllukultuurid,,,2021/06/14 15:52:11.000,Keskkonnasõbraliku majandamise toetus;Kliimat ...,OÜ KÕO AGRO,10070214,Winter wheat,winter_common_soft_wheat,3301010101,"POLYGON ((25.66627 58.63345, 25.66638 58.63344..."
42649,2021,20657638,10.17,rohttaimed,Püsirohumaa,Niidetud,16.08.2021-21.08.2021,2021/05/17 16:18:18.000,Kliimat ja keskkonda säästvate põllumajandusta...,AKTSIASELTS METSAKÜLA PIIM,10014380,grasses,pasture_meadow_grassland_grass,3302000000,"POLYGON ((24.37944 59.35179, 24.37945 59.35179..."
65776,2021,21102710,3.62,küüslauk,Põllukultuurid,,,2021/05/19 16:40:15.000,Kliimat ja keskkonda säästvate põllumajandusta...,OSAÜHING KASKEMA TALU,11017417,garlic,garlic,3301220200,"POLYGON ((24.36846 58.85519, 24.36852 58.85522..."


In [61]:
from eotdl.fe.openeo import point_extraction

# should be the same start_data and nb_monts; how can we save this  in the feature recipe?

point_extraction(gdf, start_date = "2024-01-01", nb_months = 2, job_tracker = 'jobs-inference.csv', parallel_jobs=10)

Authenticated using refresh token.


In [63]:
job = pd.read_csv("jobs-inference.csv")
job

Unnamed: 0,fid,geometry,crs,temporal_extent,id,backend_name,status,start_time,running_start_time,cpu,memory,duration
0,,"POLYGON ((25.6662734 58.63345021, 25.66637754 ...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-2505161103024444b6e50166f8aeb88b,cdse,finished,2025-05-16T11:03:02Z,2025-05-16T11:04:52Z,229.88571932 cpu-seconds,1565364.35546875 mb-seconds,150 seconds
1,,"POLYGON ((24.37944058 59.35179394, 24.37944933...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-250516110319411dbdd4ebb4189c99ec,cdse,finished,2025-05-16T11:03:19Z,2025-05-16T11:05:53Z,228.19191047700002 cpu-seconds,1323684.845703125 mb-seconds,193 seconds
2,,"POLYGON ((24.3684642 58.85519494, 24.36851719 ...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-25051611033745fea98123d8907b6e1c,cdse,finished,2025-05-16T11:03:37Z,2025-05-16T11:05:53Z,169.140624351 cpu-seconds,1299268.8515625 mb-seconds,172 seconds


In [64]:
# Initialize an empty list to store all dataframes
all_data = []

# Loop through each job and read its parquet file
for idx, _job in job.iterrows():
    try:
        job_data = pd.read_parquet(f'job_{_job["id"]}/timeseries.parquet')
        # Add job_id as a column to identify the source
        job_data['job_id'] = _job["id"]
        all_data.append(job_data)
    except Exception as e:
        print(f"Error reading job {_job['id']}: {e}")

# Concatenate all dataframes into one
if all_data:
    data = pd.concat(all_data, ignore_index=True)
    print(f"Successfully merged {len(all_data)} time series datasets")
else:
    data = pd.DataFrame()
    print("No time series data was loaded")

data

Successfully merged 3 time series datasets


Unnamed: 0,geometry,feature_index,B02_P10,B02_P25,B02_P50,B02_P75,B02_P90,B03_P10,B03_P25,B03_P50,...,VH_P25,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90,job_id
0,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00p\x00\x0...,0,,,,,,,,,...,0.003686,0.005511,0.007969,0.014515,0.032455,0.040394,0.053392,0.071857,0.116476,j-2505161103024444b6e50166f8aeb88b
1,b'\x01\x03\x00\x00\x00\x16\x00\x00\x00<\x04\x0...,0,,,,,,,,,...,0.005453,0.00784,0.010888,0.014063,0.036952,0.046182,0.060499,0.07805,0.096651,j-250516110319411dbdd4ebb4189c99ec
2,"b'\x01\x03\x00\x00\x00\x01\x00\x00\x00,\x00\x0...",0,,,,,,,,,...,0.002787,0.004535,0.007065,0.013067,0.036071,0.045936,0.062672,0.084033,0.129215,j-25051611033745fea98123d8907b6e1c


In [66]:
# drop columns with nans (should be defined in the feature recipe, how?)

data_clean = data.dropna(axis=1)

# drop unused columns (should be defined in the feature recipe, how? maybe better to define which columns to keep)

data_clean = data_clean.drop(columns=['geometry', 'job_id', 'feature_index'])

data_clean

Unnamed: 0,VH_P10,VH_P25,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90
0,0.002636,0.003686,0.005511,0.007969,0.014515,0.032455,0.040394,0.053392,0.071857,0.116476
1,0.003829,0.005453,0.00784,0.010888,0.014063,0.036952,0.046182,0.060499,0.07805,0.096651
2,0.001747,0.002787,0.004535,0.007065,0.013067,0.036071,0.045936,0.062672,0.084033,0.129215


In [67]:
preds = model.predict(data_clean.values)

preds




array(['spring_rapeseed_rape', 'spring_rapeseed_rape',
       'spring_rapeseed_rape'], dtype=object)

In [68]:
gdf.EC_hcat_n

162397          winter_common_soft_wheat
42649     pasture_meadow_grassland_grass
65776                             garlic
Name: EC_hcat_n, dtype: object

Of course model is not good, need to train with more parcels & classes. You can use this notebook to do so.

TODO:
- Stage model from EOTDL
- Stage feature recipe from EOTDL