In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os

os.environ['EOTDL_API_URL'] = 'https://api.eotdl.com/'
# os.environ['EOTDL_API_URL'] = 'http://localhost:8000/'

In this use case we show how to perform feature engineering with openEO within EOTDL.

https://github.com/earthpulse/eotdl/issues/190


1. stage the EuroCrops dataset with EOTDL.
2. filter the EuroCrops Dataset to create a subset of parcels, e.g., 8 crop classes, each with 1000 examples, for one country
3. run feature engineering with openEO, creating temporal metrics from a S1 and S2 time series (temporally optimised for crops classe of interest). Store feature engineering process graph with the training datsets in EOTDL
4. Use EOTDL functionality to train a model (for this the features need to be retrieved..). Store the model along with the openEO process graph in EOTDL.
5. Use the model to run inference (from within EOTDL?) in an openEO platform such as CDSE or openEO platform. Make use of the feature engineering process graph stored along with the EOTDL model.

## 1 Stage EuroCrops from EOTDL

Dataset can be found at https://www.eotdl.com/datasets/EuroCrops/. The dataset contains a zip file, which in turn contains zip files for each country with the shapefiles (16 total).

> Uncomment the following cells to stage the dataset.

In [4]:
# !eotdl datasets get EuroCrops -v 1 -f -a
# !unzip -o ~/.cache/eotdl/datasets/EuroCrops/EuroCrops.zip -d data/

Staging assets:   0%|                                     | 0/1 [00:00<?, ?it/s]^C
Staging assets:   0%|                                     | 0/1 [00:14<?, ?it/s]


In [5]:
# from glob import glob

# zips = glob('data/*.zip')

# zips

In [6]:
# # unzip shapefiles

# import zipfile

# for zip_file in zips:
# 	with zipfile.ZipFile(zip_file, 'r') as zip_ref:
# 		zip_ref.extractall('data/')


In [7]:
# cleanup

# !rm -rf data/*.zip

List of all the shapefiles in the dataset.

In [2]:
from glob import glob

shapefiles = glob('C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data/**/*.shp', recursive=True)

shapefiles

['C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\AT_2021_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\BE_VLG_2021_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\DE_LS_2021_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\DE_NRW_2021_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\DK_2019_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\EE_2021_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\LT_2021_EC.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\LV_2021_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\NL_2020_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\SI_2021_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\SK_2021_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VIT

In [3]:
import geopandas as gpd

path = shapefiles[9]
gdf = gpd.read_file(path)
gdf.head()


Unnamed: 0,ID,GERK_PID,SIFRA_KMRS,AREA,RASTLINA,CROP_LAT_E,COLOR,EC_NUTS3,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
0,7010691,233087,5,3386.6741,koruza za zrnje,Zea mays L.,804040,HR064,Maize,grain_maize_corn_popcorn,3301010600,"POLYGON ((555072.79 96582.665, 555053.735 9659..."
1,6184203,238296,204,3557.380688,trajno travinje,Permanent grassland,00FF00,HR064,Permanent grassland,pasture_meadow_grassland_grass,3302000000,"POLYGON ((554581.35 95741.51, 554527.685 95766..."
2,6694259,283013,204,9426.362488,trajno travinje,Permanent grassland,00FF00,HR064,Permanent grassland,pasture_meadow_grassland_grass,3302000000,"POLYGON ((546614.045 111420.645, 546610.125 11..."
3,6793446,286463,204,7615.182612,trajno travinje,Permanent grassland,00FF00,HR064,Permanent grassland,pasture_meadow_grassland_grass,3302000000,"POLYGON ((555892.23 100213.705, 555893.76 1002..."
4,6365666,399199,5,2695.138562,koruza za zrnje,Zea mays L.,804040,HR064,Maize,grain_maize_corn_popcorn,3301010600,"POLYGON ((554523.545 95946.53, 554527.295 9594..."


## 2. Filter EuroCropsDataset

Filter the EuroCropsDataset to create a subset of parcels, e.g., 8 crop classes, each with 1000 examples, for one country

In [4]:
crop_classes = gdf['EC_hcat_n'].unique()
num_samples_per_class = {class_: len(gdf[gdf['EC_hcat_n'] == class_]) for class_ in crop_classes}
num_samples_per_class = dict(sorted(num_samples_per_class.items(), key=lambda x: x[1], reverse=True))

crop_classes

array(['grain_maize_corn_popcorn', 'pasture_meadow_grassland_grass',
       'winter_barley', 'clover', 'winter_common_soft_wheat',
       'fresh_vegetables', 'temporary_grass', 'soy_soybeans',
       'winter_unspecified_cereals', 'winter_spelt', 'alfalfa_lucerne',
       'spring_common_soft_wheat', 'winter_triticale', 'potatoes',
       'arable_crops', 'fallow_land_not_crop', 'walnuts', 'winter_oats',
       'winter_rapeseed_rape', 'vineyards_wine_vine_rebland_grapes',
       'orchards_fruits', 'apples', 'peach', 'plums', 'spring_barley',
       'buckwheat', 'not_known_and_other', 'winter_rye', 'asparagus',
       'olive_plantations', 'beans', 'other_arable_land_crops', 'hops',
       'aronia_chokeberries', 'nectarine', 'cherry_cherries',
       'sweet_chestnuts', 'lavender_lavandula', 'strawberries',
       'nurseries_nursery', 'raspberry_raspberries', 'durum_hard_wheat',
       'winter_meslin', 'other_permanent_crops_plantations', 'camelina',
       'sunflower', 'phacelia', 'hemp_can

In [5]:
# filter 1000 examples per class
import numpy as np

#Â Each job runs separately, so we need to limit the number of classes and samples per class

samples = 100
num_classes = 10

# keep classes with at least 1000 samples
classes = [class_ for class_, count in num_samples_per_class.items() if count >= samples]

# random 8 classes
classes = np.random.choice(classes, num_classes, replace=False)

filtered_gdf = gdf[gdf['EC_hcat_n'].isin(classes)]
filtered_gdf = filtered_gdf.groupby('EC_hcat_n').sample(n=samples, random_state=42)
filtered_gdf.head()


Unnamed: 0,ID,GERK_PID,SIFRA_KMRS,AREA,RASTLINA,CROP_LAT_E,COLOR,EC_NUTS3,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
249112,6142379,3376530,405,395.823181,"me?ana raba (zelenjadnice, polj??ine, di?avnic...","Mixed use (vegetables, crops, aromatic plants ...",804040,SI034,"Mixed use (vegetables, crops, aromatic plants ...",arable_crops,3301000000,"POLYGON ((556117.67 122496.741, 556058.157 122..."
36595,6093356,3035069,405,197.66265,"me?ana raba (zelenjadnice, polj??ine, di?avnic...","Mixed use (vegetables, crops, aromatic plants ...",804040,SI035,"Mixed use (vegetables, crops, aromatic plants ...",arable_crops,3301000000,"POLYGON ((509277.92 111179.825, 509273.82 1111..."
139162,6176693,6056378,405,275.203647,"me?ana raba (zelenjadnice, polj??ine, di?avnic...","Mixed use (vegetables, crops, aromatic plants ...",804040,SI032,"Mixed use (vegetables, crops, aromatic plants ...",arable_crops,3301000000,"POLYGON ((563233.247 136904.826, 563234.999 13..."
158523,6699510,3279955,405,1004.529363,"me?ana raba (zelenjadnice, polj??ine, di?avnic...","Mixed use (vegetables, crops, aromatic plants ...",804040,SI044,"Mixed use (vegetables, crops, aromatic plants ...",arable_crops,3301000000,"POLYGON ((411357.665 33640.345, 411357.395 336..."
731525,6559033,3420790,405,207.861729,"me?ana raba (zelenjadnice, polj??ine, di?avnic...","Mixed use (vegetables, crops, aromatic plants ...",804040,SI037,"Mixed use (vegetables, crops, aromatic plants ...",arable_crops,3301000000,"POLYGON ((480522.649 63964.449, 480497.59 6394..."


We want to perform the polygon data extractions for S1/S2 in an efficient manner, to do so we need to group multiple geometries. Doing so allows us to execute multiple extractions in a single openEO job.

In [None]:
from dataframe_utils import split_s2sphere, combine_to_featurecollections
from pathlib import Path

split_gdf = split_s2sphere(filtered_gdf, max_points=50)
jobs_df = combine_to_featurecollections(
    split_gdf,
    property_fields=["ID", "EC_hcat_n"]
)
# Add temporal extent for each job
jobs_df["temporal_extent"] = [["2024-05-01", "2024-06-01"]] * len(jobs_df)
jobs_df["crs"] = ['EPSG:32721'] * len(jobs_df)
jobs_df["resolution"] = [10] * len(jobs_df)
jobs_df["temporal_extent"] = [["2024-05-01", "2024-06-01"]] * len(jobs_df)
jobs_df["dataset"] = [Path(path).stem] * len(jobs_df)

jobs_df

print(f"Combined: {len(filtered_gdf)} into {len(split_gdf)} jobs")
jobs_df

Combined: 1000 into 32 jobs


Unnamed: 0,s2sphere_cell_id,feature_count,properties,geometry,temporal_extent,crs,resolution,dataset
0,5144676479015059456,56,"[{'ID': 6142379, 'EC_hcat_n': 'arable_crops'},...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",EPSG:32721,10,SI_2021_EC21
1,5144641294642970624,69,"[{'ID': 6093356, 'EC_hcat_n': 'arable_crops'},...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",EPSG:32721,10,SI_2021_EC21
2,5147456044410077184,82,"[{'ID': 6176693, 'EC_hcat_n': 'arable_crops'},...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",EPSG:32721,10,SI_2021_EC21
3,5150798559758516224,19,"[{'ID': 6699510, 'EC_hcat_n': 'arable_crops'},...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",EPSG:32721,10,SI_2021_EC21
4,5144500557154615296,18,"[{'ID': 6559033, 'EC_hcat_n': 'arable_crops'},...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",EPSG:32721,10,SI_2021_EC21
5,5144535741526704128,75,"[{'ID': 6654316, 'EC_hcat_n': 'arable_crops'},...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",EPSG:32721,10,SI_2021_EC21
6,5150657822270160896,48,"[{'ID': 6540050, 'EC_hcat_n': 'arable_crops'},...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",EPSG:32721,10,SI_2021_EC21
7,5144711663387148288,78,"[{'ID': 6377164, 'EC_hcat_n': 'arable_crops'},...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",EPSG:32721,10,SI_2021_EC21
8,5144465372782526464,17,"[{'ID': 6331530, 'EC_hcat_n': 'arable_crops'},...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",EPSG:32721,10,SI_2021_EC21
9,5147596781898432512,11,"[{'ID': 6230469, 'EC_hcat_n': 'arable_crops'},...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",EPSG:32721,10,SI_2021_EC21


In [16]:
row = jobs_df.iloc[0]
list(row['properties'][0].keys())[0]

'ID'

The single openEO job will now process multiple , 'neighbouring', polygons per job. These polygons have been combined in a feature collection. 

In [23]:
#!pip install openeo
import openeo
import pandas as pd

def start_job(row: pd.Series, connection: openeo.Connection, **kwargs) -> openeo.BatchJob:

        temporal_extent = row["temporal_extent"]
        geometry = row["geometry"]
        crs = row["crs"]
        resolution = float(row["resolution"])
        dataset = row["dataset"]


        #run the s1 and s2 udp
        s1 = connection.datacube_from_process(
                "s1_weekly_statistics",
                namespace="https://raw.githubusercontent.com/earthpulse/eotdl/refs/heads/hv_openeoexample/tutorials/notebooks/openeo/s1_weekly_statistics.json",
                temporal_extent=temporal_extent,
                ).filter_spatial(geometry).resample_spatial(resolution=resolution, projection=crs, method='bilinear')
        
        s2 = connection.datacube_from_process(
                "s2_weekly_statistics",
                namespace="https://raw.githubusercontent.com/earthpulse/eotdl/refs/heads/hv_openeoexample/tutorials/notebooks/openeo/s2_weekly_statistics.json",
                temporal_extent=temporal_extent,
                ).filter_spatial(geometry).resample_spatial(resolution=resolution, projection=crs, method='bilinear')
        
        #merge both cubes and filter across the feature collection
        result = s2.merge_cubes(s1)
        
        #dedicated job settings to save the individual features within a collection seperately
        job = result.create_job(
                out_format="NetCDF",
                sample_by_feature = True,
                feature_id_property=list(row['properties'][0].keys())[0], #TODO issue; this is not consistent in all the files I checked; so now I assume ID is given first en then EChat
                filename_prefix = row["dataset"] + "_id_"
        )

        return job

In [None]:
test_df = jobs_df.iloc[:1]
from openeo.extra.job_management import MultiBackendJobManager, CsvJobDatabase

# Authenticate and add the backend
connection = openeo.connect(url="openeo.dataspace.copernicus.eu").authenticate_oidc()

# initialize the job manager
manager = MultiBackendJobManager()
manager.add_backend("cdse", connection=connection, parallel_jobs=2)

job_tracker = 'jobs.csv'
job_db = CsvJobDatabase(path=job_tracker)
df = manager._normalize_df(test_df)
job_db.persist(df)

manager.run_jobs(start_job=start_job, job_db=job_db)

Authenticated using refresh token.


## 4. Train a model with EOTDL

We will train a simple random forest model on the features.


In [35]:
data = pd.read_csv('data/features.csv')

data

Unnamed: 0.1,Unnamed: 0,geometry,feature_index,B02_P10,B02_P25,B02_P50,B02_P75,B02_P90,B03_P10,B03_P25,...,VH_P25,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90,job_id
0,0,"b'\x01\x03\x00\x00\x00\x01\x00\x00\x00""\x00\x0...",0,,,,,,,,...,0.005915,0.009845,0.013578,0.016449,0.039836,0.05057,0.072721,0.122933,0.147181,j-25051610175047429329ed6feb96dc20
1,1,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00:\x00\x0...",0,,,,,,,,...,0.003498,0.005155,0.007258,0.011056,0.021004,0.026354,0.035084,0.046927,0.063484,j-2505161018084db998fe92f976077609
2,2,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x17\x00...,0,,,,,,,,...,0.002803,0.003923,0.005395,0.007169,0.020292,0.024544,0.031204,0.040079,0.053383,j-25051610182546a38540155e6dc862e0
3,3,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00\x1b\x00...",0,,,,,,,,...,0.006838,0.009548,0.014226,0.020363,0.031789,0.037089,0.049702,0.068509,0.106466,j-2505161018424cf0ae7815d18f278e7c
4,4,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00#\x00\x0...,0,,,,,,,,...,0.003203,0.0045,0.006347,0.008537,0.020882,0.024984,0.031692,0.041187,0.052576,j-2505161018584abea5903b1fad46efcc
5,5,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,...,0.00595,0.008022,0.010358,0.014889,0.032923,0.039042,0.049017,0.06354,0.082097,j-250516101916430aa489f9f704370f29
6,6,b'\x01\x03\x00\x00\x00\x02\x00\x00\x00b\x00\x0...,0,,,,,,,,...,0.006577,0.009207,0.012007,0.015258,0.030508,0.039642,0.052435,0.069091,0.091799,j-2505161019344b0089b9eed895f0a803
7,7,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,...,0.00377,0.005692,0.007832,0.011728,0.027132,0.031695,0.038928,0.047977,0.057112,j-25051610195045fbbad3794af8313574
8,8,b'\x01\x03\x00\x00\x00\x03\x00\x00\x00\xe6\x00...,0,,,,,,,,...,0.007683,0.010565,0.0137,0.01697,0.030198,0.037728,0.048462,0.062068,0.075866,j-2505161020064465b5eed42de26d9a63
9,9,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x0c\x00...,0,,,,,,,,...,0.00197,0.003242,0.005464,0.008266,0.018165,0.023109,0.029741,0.038527,0.046957,j-2505161020234fe888a2c9f1bb45b4e7


In [36]:
parcels = gpd.read_file('data/filtered_gdf.shp')

parcels



Unnamed: 0,taotlusaas,pollu_id,pindala_ha,taotletud_,taotletu_1,niitmise_t,niitmise_1,viimase_mu,taotletu_2,taotleja_n,taotleja_r,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
0,2021,21339359,0.34,"vÃµilill, harilik",PÃµllukultuurid,Ei kuulu jÃ¤lgimisele,,2021/05/20 20:08:34.000,Kliimat ja keskkonda sÃ¤Ã¤stvate pÃµllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((24.11064 58.2612, 24.11064 58.26108,..."
1,2021,20435919,2.12,"vÃµilill, harilik",PÃµllukultuurid,Niidetud,03.08.2021-09.08.2021,2021/05/14 14:14:50.000,Kliimat ja keskkonda sÃ¤Ã¤stvate pÃµllumajandusta...,FIE,,Dandelion common,dandelions,3301081400,"POLYGON ((24.59144 59.20214, 24.59144 59.20215..."
2,2021,21666530,2.33,"vÃµilill, harilik",PÃµllukultuurid,Niidetud,10.07.2021-11.07.2021,2021/05/22 20:46:35.000,Kliimat ja keskkonda sÃ¤Ã¤stvate pÃµllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((24.73434 59.14479, 24.73418 59.14473..."
3,2021,21781505,0.17,"vÃµilill, harilik",PÃµllukultuurid,Ei kuulu jÃ¤lgimisele,,2021/05/26 15:43:20.000,Kliimat ja keskkonda sÃ¤Ã¤stvate pÃµllumajandusta...,OSAÃHING VIIVEKA,10040905.0,Dandelion common,dandelions,3301081400,"POLYGON ((24.80384 58.36676, 24.80395 58.36691..."
4,2021,20435920,0.93,"vÃµilill, harilik",PÃµllukultuurid,Ei kuulu jÃ¤lgimisele,,2021/05/14 14:14:50.000,Kliimat ja keskkonda sÃ¤Ã¤stvate pÃµllumajandusta...,FIE,,Dandelion common,dandelions,3301081400,"POLYGON ((24.60553 59.22487, 24.60602 59.22505..."
5,2021,20435918,0.46,"vÃµilill, harilik",PÃµllukultuurid,Ei kuulu jÃ¤lgimisele,,2021/05/14 14:14:50.000,Kliimat ja keskkonda sÃ¤Ã¤stvate pÃµllumajandusta...,FIE,,Dandelion common,dandelions,3301081400,"POLYGON ((24.5973 59.20052, 24.59728 59.20044,..."
6,2021,21625865,2.89,"vÃµilill, harilik",PÃµllukultuurid,Niidetud,30.07.2021-05.08.2021,2021/06/12 10:41:17.000,KeskkonnasÃµbraliku majandamise toetus;Kliimat ...,FIE,,Dandelion common,dandelions,3301081400,"POLYGON ((26.76172 57.97226, 26.76169 57.97218..."
7,2021,21339355,0.38,"vÃµilill, harilik",PÃµllukultuurid,Ei kuulu jÃ¤lgimisele,,2021/05/20 20:08:34.000,Kliimat ja keskkonda sÃ¤Ã¤stvate pÃµllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((24.11338 58.261, 24.11331 58.26101, ..."
8,2021,21520681,2.92,"vÃµilill, harilik",PÃµllukultuurid,Ei kuulu jÃ¤lgimisele,,2021/05/21 14:27:56.000,Kliimat ja keskkonda sÃ¤Ã¤stvate pÃµllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((27.1228 58.1266, 27.12282 58.12662, ..."
9,2021,20494621,0.47,"vÃµilill, harilik",PÃµllukultuurid,Ei kuulu jÃ¤lgimisele,,2021/05/16 09:36:00.000,Kliimat ja keskkonda sÃ¤Ã¤stvate pÃµllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((26.73513 58.27443, 26.73589 58.27454..."


> How can I match the features to the parcels? The only common column is the geometry...

Assuming both dataframes have same order (which is not likely the case).

In [38]:
data['target'] = parcels.EC_hcat_n

data

Unnamed: 0.1,Unnamed: 0,geometry,feature_index,B02_P10,B02_P25,B02_P50,B02_P75,B02_P90,B03_P10,B03_P25,...,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90,job_id,target
0,0,"b'\x01\x03\x00\x00\x00\x01\x00\x00\x00""\x00\x0...",0,,,,,,,,...,0.009845,0.013578,0.016449,0.039836,0.05057,0.072721,0.122933,0.147181,j-25051610175047429329ed6feb96dc20,dandelions
1,1,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00:\x00\x0...",0,,,,,,,,...,0.005155,0.007258,0.011056,0.021004,0.026354,0.035084,0.046927,0.063484,j-2505161018084db998fe92f976077609,dandelions
2,2,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x17\x00...,0,,,,,,,,...,0.003923,0.005395,0.007169,0.020292,0.024544,0.031204,0.040079,0.053383,j-25051610182546a38540155e6dc862e0,dandelions
3,3,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00\x1b\x00...",0,,,,,,,,...,0.009548,0.014226,0.020363,0.031789,0.037089,0.049702,0.068509,0.106466,j-2505161018424cf0ae7815d18f278e7c,dandelions
4,4,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00#\x00\x0...,0,,,,,,,,...,0.0045,0.006347,0.008537,0.020882,0.024984,0.031692,0.041187,0.052576,j-2505161018584abea5903b1fad46efcc,dandelions
5,5,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,...,0.008022,0.010358,0.014889,0.032923,0.039042,0.049017,0.06354,0.082097,j-250516101916430aa489f9f704370f29,dandelions
6,6,b'\x01\x03\x00\x00\x00\x02\x00\x00\x00b\x00\x0...,0,,,,,,,,...,0.009207,0.012007,0.015258,0.030508,0.039642,0.052435,0.069091,0.091799,j-2505161019344b0089b9eed895f0a803,dandelions
7,7,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,...,0.005692,0.007832,0.011728,0.027132,0.031695,0.038928,0.047977,0.057112,j-25051610195045fbbad3794af8313574,dandelions
8,8,b'\x01\x03\x00\x00\x00\x03\x00\x00\x00\xe6\x00...,0,,,,,,,,...,0.010565,0.0137,0.01697,0.030198,0.037728,0.048462,0.062068,0.075866,j-2505161020064465b5eed42de26d9a63,dandelions
9,9,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x0c\x00...,0,,,,,,,,...,0.003242,0.005464,0.008266,0.018165,0.023109,0.029741,0.038527,0.046957,j-2505161020234fe888a2c9f1bb45b4e7,dandelions


In [39]:
from sklearn.model_selection import train_test_split

# drop columns with nans

data_clean = data.dropna(axis=1)

# drop unused columns

data_clean = data_clean.drop(columns=['Unnamed: 0', 'geometry', 'job_id', 'feature_index'])

# split train/test

X_train, X_test, y_train, y_test = train_test_split(data_clean.drop(columns=['target']), data_clean['target'], test_size=0.2, random_state=42)

In [42]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

model.score(X_test, y_test)


0.75

TODO:
- ingest model to EOTDL
- ingest feature recipe to EOTDL

## 5. Run inference with EOTDL

In [None]:
# sample = X_test.iloc[3]

# pred = model.predict(sample.values.reshape(1, -1))

# pred

Let's perform inference on some new parcels.

In [57]:
ix = np.random.randint(0, len(shapefiles))
country = shapefiles[ix]

country

'data/DK_2019_EC21.shp'

In [58]:
gdf = gpd.read_file(path)

gdf.head()

Unnamed: 0,taotlusaas,pollu_id,pindala_ha,taotletud_,taotletu_1,niitmise_t,niitmise_1,viimase_mu,taotletu_2,taotleja_n,taotleja_r,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
0,2021,19994165,0.25,Karjatamine vÃ¤ljaspool pÃµllumaj. maad,Karjatamine vÃ¤ljaspool pÃµllumaj. maad,,,2021/05/02 14:37:52.000,,FIE,,Rough grazings,pasture_meadow_grassland_grass,3302000000,"POLYGON ((26.50243 59.31839, 26.50244 59.31843..."
1,2021,19990783,1.7,rohttaimed,PÃ¼sirohumaa,Niidetud,28.06.2021-04.07.2021,2021/05/02 06:59:17.000,Kliimat ja keskkonda sÃ¤Ã¤stvate pÃµllumajandusta...,ERAISIK,,grasses,pasture_meadow_grassland_grass,3302000000,"POLYGON ((24.54648 58.86884, 24.54674 58.86879..."
2,2021,19990784,0.49,rohttaimed,PÃ¼sirohumaa,Ei kuulu jÃ¤lgimisele,,2021/05/02 06:59:17.000,Kliimat ja keskkonda sÃ¤Ã¤stvate pÃµllumajandusta...,ERAISIK,,grasses,pasture_meadow_grassland_grass,3302000000,"POLYGON ((24.54597 58.86827, 24.54668 58.86816..."
3,2021,19996106,0.54,talinisu allakÃ¼lvita,PÃµllukultuurid,,,2021/05/02 20:58:12.000,Kliimat ja keskkonda sÃ¤Ã¤stvate pÃµllumajandusta...,ERAISIK,,Winter wheat,winter_common_soft_wheat,3301010101,"POLYGON ((27.42837 58.11975, 27.42839 58.11972..."
4,2021,19990620,2.48,"punane ristik (vÃ¤hemalt 80% ristikut, kuni 20%...",PÃµllukultuurid,Niidetud,06.07.2021-11.07.2021,2021/07/05 07:26:35.000,Kliimat ja keskkonda sÃ¤Ã¤stvate pÃµllumajandusta...,TAMSAMÃE OÃ,11350602.0,Red clover (at least 80% clover up to 20% gras...,clover,3301090303,"POLYGON ((26.66816 57.82049, 26.66815 57.8205,..."


In [59]:
gdf = gdf.sample(n=3)
gdf

Unnamed: 0,taotlusaas,pollu_id,pindala_ha,taotletud_,taotletu_1,niitmise_t,niitmise_1,viimase_mu,taotletu_2,taotleja_n,taotleja_r,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
162397,2021,22113589,21.37,talinisu allakÃ¼lvita,PÃµllukultuurid,,,2021/06/14 15:52:11.000,KeskkonnasÃµbraliku majandamise toetus;Kliimat ...,OÃ KÃO AGRO,10070214,Winter wheat,winter_common_soft_wheat,3301010101,"POLYGON ((25.66627 58.63345, 25.66638 58.63344..."
42649,2021,20657638,10.17,rohttaimed,PÃ¼sirohumaa,Niidetud,16.08.2021-21.08.2021,2021/05/17 16:18:18.000,Kliimat ja keskkonda sÃ¤Ã¤stvate pÃµllumajandusta...,AKTSIASELTS METSAKÃLA PIIM,10014380,grasses,pasture_meadow_grassland_grass,3302000000,"POLYGON ((24.37944 59.35179, 24.37945 59.35179..."
65776,2021,21102710,3.62,kÃ¼Ã¼slauk,PÃµllukultuurid,,,2021/05/19 16:40:15.000,Kliimat ja keskkonda sÃ¤Ã¤stvate pÃµllumajandusta...,OSAÃHING KASKEMA TALU,11017417,garlic,garlic,3301220200,"POLYGON ((24.36846 58.85519, 24.36852 58.85522..."


In [61]:
from eotdl.fe.openeo import point_extraction

# should be the same start_data and nb_monts; how can we save this  in the feature recipe?

point_extraction(gdf, start_date = "2024-01-01", nb_months = 2, job_tracker = 'jobs-inference.csv', parallel_jobs=10)

Authenticated using refresh token.


In [63]:
job = pd.read_csv("jobs-inference.csv")
job

Unnamed: 0,fid,geometry,crs,temporal_extent,id,backend_name,status,start_time,running_start_time,cpu,memory,duration
0,,"POLYGON ((25.6662734 58.63345021, 25.66637754 ...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-2505161103024444b6e50166f8aeb88b,cdse,finished,2025-05-16T11:03:02Z,2025-05-16T11:04:52Z,229.88571932 cpu-seconds,1565364.35546875 mb-seconds,150 seconds
1,,"POLYGON ((24.37944058 59.35179394, 24.37944933...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-250516110319411dbdd4ebb4189c99ec,cdse,finished,2025-05-16T11:03:19Z,2025-05-16T11:05:53Z,228.19191047700002 cpu-seconds,1323684.845703125 mb-seconds,193 seconds
2,,"POLYGON ((24.3684642 58.85519494, 24.36851719 ...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-25051611033745fea98123d8907b6e1c,cdse,finished,2025-05-16T11:03:37Z,2025-05-16T11:05:53Z,169.140624351 cpu-seconds,1299268.8515625 mb-seconds,172 seconds


In [64]:
# Initialize an empty list to store all dataframes
all_data = []

# Loop through each job and read its parquet file
for idx, _job in job.iterrows():
    try:
        job_data = pd.read_parquet(f'job_{_job["id"]}/timeseries.parquet')
        # Add job_id as a column to identify the source
        job_data['job_id'] = _job["id"]
        all_data.append(job_data)
    except Exception as e:
        print(f"Error reading job {_job['id']}: {e}")

# Concatenate all dataframes into one
if all_data:
    data = pd.concat(all_data, ignore_index=True)
    print(f"Successfully merged {len(all_data)} time series datasets")
else:
    data = pd.DataFrame()
    print("No time series data was loaded")

data

Successfully merged 3 time series datasets


Unnamed: 0,geometry,feature_index,B02_P10,B02_P25,B02_P50,B02_P75,B02_P90,B03_P10,B03_P25,B03_P50,...,VH_P25,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90,job_id
0,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00p\x00\x0...,0,,,,,,,,,...,0.003686,0.005511,0.007969,0.014515,0.032455,0.040394,0.053392,0.071857,0.116476,j-2505161103024444b6e50166f8aeb88b
1,b'\x01\x03\x00\x00\x00\x16\x00\x00\x00<\x04\x0...,0,,,,,,,,,...,0.005453,0.00784,0.010888,0.014063,0.036952,0.046182,0.060499,0.07805,0.096651,j-250516110319411dbdd4ebb4189c99ec
2,"b'\x01\x03\x00\x00\x00\x01\x00\x00\x00,\x00\x0...",0,,,,,,,,,...,0.002787,0.004535,0.007065,0.013067,0.036071,0.045936,0.062672,0.084033,0.129215,j-25051611033745fea98123d8907b6e1c


In [66]:
# drop columns with nans (should be defined in the feature recipe, how?)

data_clean = data.dropna(axis=1)

# drop unused columns (should be defined in the feature recipe, how? maybe better to define which columns to keep)

data_clean = data_clean.drop(columns=['geometry', 'job_id', 'feature_index'])

data_clean

Unnamed: 0,VH_P10,VH_P25,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90
0,0.002636,0.003686,0.005511,0.007969,0.014515,0.032455,0.040394,0.053392,0.071857,0.116476
1,0.003829,0.005453,0.00784,0.010888,0.014063,0.036952,0.046182,0.060499,0.07805,0.096651
2,0.001747,0.002787,0.004535,0.007065,0.013067,0.036071,0.045936,0.062672,0.084033,0.129215


In [67]:
preds = model.predict(data_clean.values)

preds




array(['spring_rapeseed_rape', 'spring_rapeseed_rape',
       'spring_rapeseed_rape'], dtype=object)

In [68]:
gdf.EC_hcat_n

162397          winter_common_soft_wheat
42649     pasture_meadow_grassland_grass
65776                             garlic
Name: EC_hcat_n, dtype: object

Of course model is not good, need to train with more parcels & classes. You can use this notebook to do so.

TODO:
- Stage model from EOTDL
- Stage feature recipe from EOTDL