In [1]:
%load_ext autoreload
%autoreload 2

In [1]:
import os

os.environ['EOTDL_API_URL'] = 'https://api.eotdl.com/'
# os.environ['EOTDL_API_URL'] = 'http://localhost:8000/'

In this use case we show how to perform feature engineering with openEO within EOTDL.

https://github.com/earthpulse/eotdl/issues/190


1. stage the EuroCrops dataset with EOTDL.
2. filter the EuroCrops Dataset to create a subset of parcels, e.g., 8 crop classes, each with 1000 examples, for one country
3. run feature engineering with openEO, creating temporal metrics from a S1 and S2 time series (temporally optimised for crops classe of interest). Store feature engineering process graph with the training datsets in EOTDL
4. Use EOTDL functionality to train a model (for this the features need to be retrieved..). Store the model along with the openEO process graph in EOTDL.
5. Use the model to run inference (from within EOTDL?) in an openEO platform such as CDSE or openEO platform. Make use of the feature engineering process graph stored along with the EOTDL model.

## 1 Stage EuroCrops from EOTDL

Dataset can be found at https://www.eotdl.com/datasets/EuroCrops/. The dataset contains a zip file, which in turn contains zip files for each country with the shapefiles (16 total).

> Uncomment the following cells to stage the dataset.

In [4]:
# !eotdl datasets get EuroCrops -v 1 -f -a
# !unzip -o ~/.cache/eotdl/datasets/EuroCrops/EuroCrops.zip -d data/

Staging assets:   0%|                                     | 0/1 [00:00<?, ?it/s]^C
Staging assets:   0%|                                     | 0/1 [00:14<?, ?it/s]


In [5]:
# from glob import glob

# zips = glob('data/*.zip')

# zips

In [6]:
# # unzip shapefiles

# import zipfile

# for zip_file in zips:
# 	with zipfile.ZipFile(zip_file, 'r') as zip_ref:
# 		zip_ref.extractall('data/')


In [None]:
# cleanup

# !rm -rf data/*.zip

'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\LT_2021_EC.shp'

List of all the shapefiles in the dataset.

In [1]:
from glob import glob
import geopandas as gpd
import numpy as np


shapefiles = glob('C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data/**/*.shp', recursive=True)
path = shapefiles[np.random.randint(0, len(shapefiles))]
gdf = gpd.read_file(path)
gdf.head()


Unnamed: 0,FEATURE,REFSIGPAC,CP,CMUNICIPIO,MUNICIPIO,POLIGONO,PARCELA,CRECINTO,IDUSO21,USO21,...,COMARCA,REGION,SUPINTEECO,GEOM_AREA,GEOM_PERI,BEGINLIFE,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
0,25000237,10100017001,0,1,Abáigar,1,17,1,FO,FORESTAL,...,Tierra Estella,,,1193.98,162.77,01/12/2020,FOREST,tree_wood_forest,3306000000,"POLYGON ((569939.795 4722602.27, 569954.121 47..."
1,25000237,10100024001,0,1,Abáigar,1,24,1,PR,PASTO ARBUSTIVO,...,Tierra Estella,503.0,,3406.23,266.73,01/12/2020,SHRUBRY PASTURE,pasture_meadow_grassland_grass,3302000000,"POLYGON ((570160.749 4722441.89, 570164.54 472..."
2,25000237,10100024002,0,1,Abáigar,1,24,2,IM,IMPRODUCTIVOS,...,Tierra Estella,,,264.75,78.14,01/12/2020,UNPRODUCTIVE,not_known_and_other,3399000000,"POLYGON ((570160.749 4722441.89, 570160.657 47..."
3,25000237,10100025001,0,1,Abáigar,1,25,1,TA,TIERRA ARABLE,...,Tierra Estella,701.0,,6231.07,380.89,01/12/2020,ARABLE LAND,arable_crops,3301000000,"POLYGON ((570097.62 4722313.609, 570095.036 47..."
4,25000237,10100025002,0,1,Abáigar,1,25,2,PR,PASTO ARBUSTIVO,...,Tierra Estella,701.0,,529.55,260.36,01/12/2020,SHRUBRY PASTURE,pasture_meadow_grassland_grass,3302000000,"POLYGON ((570026.714 4722383.344, 570018.504 4..."


## 2. Filter EuroCropsDataset

Filter the EuroCropsDataset to create a subset of parcels, e.g., 8 crop classes, each with 1000 examples, for one country

In [2]:
crop_classes = gdf['EC_hcat_n'].unique()
num_samples_per_class = {class_: len(gdf[gdf['EC_hcat_n'] == class_]) for class_ in crop_classes}
num_samples_per_class = dict(sorted(num_samples_per_class.items(), key=lambda x: x[1], reverse=True))

crop_classes

array(['tree_wood_forest', 'pasture_meadow_grassland_grass',
       'not_known_and_other', 'arable_crops', 'fresh_vegetables',
       'olive_plantations', 'orchards_fruits',
       'vineyards_wine_vine_rebland_grapes', 'nuts',
       'greenhouse_foil_film'], dtype=object)

In [3]:
# filter 1000 examples per class
import numpy as np

# Each job runs separately, so we need to limit the number of classes and samples per class
samples = 1000
num_classes = 5

# keep classes with at least 1000 samples
classes = [class_ for class_, count in num_samples_per_class.items() if count >= samples]

# random 8 classes
classes = np.random.choice(classes, num_classes, replace=False)

filtered_gdf = gdf[gdf['EC_hcat_n'].isin(classes)]
filtered_gdf = filtered_gdf.groupby('EC_hcat_n').sample(n=samples, random_state=42)
filtered_gdf.head()


Unnamed: 0,FEATURE,REFSIGPAC,CP,CMUNICIPIO,MUNICIPIO,POLIGONO,PARCELA,CRECINTO,IDUSO21,USO21,...,COMARCA,REGION,SUPINTEECO,GEOM_AREA,GEOM_PERI,BEGINLIFE,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
227212,25000237,681300582001,0,68,Cascante,13,582,1,TA,TIERRA ARABLE,...,Ribera Baja,1601,,12769.2,451.43,01/12/2020,ARABLE LAND,arable_crops,3301000000,"POLYGON ((607494.578 4655185.828, 607453.102 4..."
892454,25000237,6901900406001,0,690,Bardenas Reales,19,406,1,TA,TIERRA ARABLE,...,Ribera Baja,301,,18070.99,579.28,01/12/2020,ARABLE LAND,arable_crops,3301000000,"POLYGON ((629434.456 4676726.739, 629449.109 4..."
313748,25000237,860500041001,0,86,Valle de Egüés / Eguesibar,5,41,1,TA,TIERRA ARABLE,...,Cuenca de Pamplona,901,,30496.2,694.58,01/12/2020,ARABLE LAND,arable_crops,3301000000,"POLYGON ((619886.211 4740694.319, 619888.546 4..."
788899,25000237,2370200243001,0,237,Unciti,2,243,1,TA,TIERRA ARABLE,...,Pirineos,901,,1581.5,199.23,01/12/2020,ARABLE LAND,arable_crops,3301000000,"POLYGON ((623981.986 4731952.193, 623968.854 4..."
933286,25000237,1710100293009,0,171,Miranda de Arga,1,293,9,TA,TIERRA ARABLE,...,Ribera Alta-Aragón,503,,277.09,137.33,01/12/2020,ARABLE LAND,arable_crops,3301000000,"POLYGON ((598367.569 4703227.237, 598354.416 4..."


We want to perform the polygon data extractions for S1/S2 in an efficient manner, to do so we need to group multiple geometries. Doing so allows us to execute multiple extractions in a single openEO job.

In [9]:
from pathlib import Path
from dataframe_utils import FeatureCollectionBuilder

builder = FeatureCollectionBuilder(
     resolution=10,
     property_fields=['EC_hcat_n'],
     max_points=50,
     start_level=6
 )

#  Build feature collections
jobs_df = builder.build([filtered_gdf])

#add additional info
jobs_df["temporal_extent"] = [["2024-05-01", "2024-06-01"]] * len(jobs_df)
jobs_df["dataset"] = [Path(path).stem] * len(jobs_df)

jobs_df



Input #1: split into 266 parts, kept 266
Total partitions: 266
Generated 266 records, skipped 0
Final DataFrame: 266 rows from 5000 input features


Unnamed: 0,s2_cell_id,feature_count,properties,geometry,temporal_extent,dataset
0,962085868443533312,23,"[{'FEATURE': 25000237, 'EC_hcat_n': 'arable_cr...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",ES_NA_2020_EC21
1,962091366001672192,45,"[{'FEATURE': 25000237, 'EC_hcat_n': 'arable_cr...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",ES_NA_2020_EC21
2,962093565024927744,14,"[{'FEATURE': 25000237, 'EC_hcat_n': 'arable_cr...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",ES_NA_2020_EC21
3,962095764048183296,12,"[{'FEATURE': 25000237, 'EC_hcat_n': 'arable_cr...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",ES_NA_2020_EC21
4,962097963071438848,23,"[{'FEATURE': 25000237, 'EC_hcat_n': 'arable_cr...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",ES_NA_2020_EC21
...,...,...,...,...,...,...
261,959543797560115200,12,"[{'FEATURE': 25000237, 'EC_hcat_n': 'greenhous...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",ES_NA_2020_EC21
262,959618564350803968,27,"[{'FEATURE': 25000237, 'EC_hcat_n': 'arable_cr...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",ES_NA_2020_EC21
263,959759301839159296,2,"[{'FEATURE': 25000237, 'EC_hcat_n': 'orchards_...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",ES_NA_2020_EC21
264,961237045466890240,22,"[{'FEATURE': 25000237, 'EC_hcat_n': 'arable_cr...","{'type': 'FeatureCollection', 'features': [{'t...","[2024-05-01, 2024-06-01]",ES_NA_2020_EC21


The single openEO job will now process multiple , 'neighbouring', polygons per job. These polygons have been combined in a feature collection. 

In [10]:
#!pip install openeo
import openeo
import pandas as pd

def start_job(row: pd.Series, connection: openeo.Connection, **kwargs) -> openeo.BatchJob:

        temporal_extent = row["temporal_extent"]
        geometry = row["geometry"]

        #run the s1 and s2 udp
        s1 = connection.datacube_from_process(
                "s1_weekly_statistics",
                namespace="https://raw.githubusercontent.com/earthpulse/eotdl/refs/heads/hv_openeoexample/tutorials/notebooks/openeo/s1_weekly_statistics.json",
                temporal_extent=temporal_extent,
                ) #TODO reprojected needed as this is the geometry of geometry passed in filter spatial
        
        s2 = connection.datacube_from_process(
                "s2_weekly_statistics",
                namespace="https://raw.githubusercontent.com/earthpulse/eotdl/refs/heads/hv_openeoexample/tutorials/notebooks/openeo/s2_weekly_statistics.json",
                temporal_extent=temporal_extent,
                )
        
        #merge both cubes and filter across the feature collection
        merged = s2.merge_cubes(s1)
        result = merged.resample_spatial(resolution = 0.00009, projection='EPSG:4326', method='bilinear').filter_spatial(geometry)
        
        #dedicated job settings to save the individual features within a collection seperately
        job = result.create_job(
                out_format="NetCDF",
                sample_by_feature = True,
                feature_id_property=list(row['properties'][0].keys())[0], #add the ID marker to the output name
                filename_prefix = row["dataset"] + "_id"
        )

        return job

In [None]:
test_df = jobs_df.iloc[:2]
from openeo.extra.job_management import MultiBackendJobManager, CsvJobDatabase

# Authenticate and add the backend
connection = openeo.connect(url="openeo.dataspace.copernicus.eu").authenticate_oidc()

# initialize the job manager
manager = MultiBackendJobManager()
manager.add_backend("cdse", connection=connection, parallel_jobs=10)

job_tracker = 'jobs.csv'
job_db = CsvJobDatabase(path=job_tracker)
df = manager._normalize_df(test_df)
job_db.persist(df)

manager.run_jobs(start_job=start_job, job_db=job_db)

Authenticated using refresh token.


## 4. Train a model with EOTDL

We will train a simple random forest model on the features.


In [35]:
data = pd.read_csv('data/features.csv')

data

Unnamed: 0.1,Unnamed: 0,geometry,feature_index,B02_P10,B02_P25,B02_P50,B02_P75,B02_P90,B03_P10,B03_P25,...,VH_P25,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90,job_id
0,0,"b'\x01\x03\x00\x00\x00\x01\x00\x00\x00""\x00\x0...",0,,,,,,,,...,0.005915,0.009845,0.013578,0.016449,0.039836,0.05057,0.072721,0.122933,0.147181,j-25051610175047429329ed6feb96dc20
1,1,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00:\x00\x0...",0,,,,,,,,...,0.003498,0.005155,0.007258,0.011056,0.021004,0.026354,0.035084,0.046927,0.063484,j-2505161018084db998fe92f976077609
2,2,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x17\x00...,0,,,,,,,,...,0.002803,0.003923,0.005395,0.007169,0.020292,0.024544,0.031204,0.040079,0.053383,j-25051610182546a38540155e6dc862e0
3,3,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00\x1b\x00...",0,,,,,,,,...,0.006838,0.009548,0.014226,0.020363,0.031789,0.037089,0.049702,0.068509,0.106466,j-2505161018424cf0ae7815d18f278e7c
4,4,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00#\x00\x0...,0,,,,,,,,...,0.003203,0.0045,0.006347,0.008537,0.020882,0.024984,0.031692,0.041187,0.052576,j-2505161018584abea5903b1fad46efcc
5,5,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,...,0.00595,0.008022,0.010358,0.014889,0.032923,0.039042,0.049017,0.06354,0.082097,j-250516101916430aa489f9f704370f29
6,6,b'\x01\x03\x00\x00\x00\x02\x00\x00\x00b\x00\x0...,0,,,,,,,,...,0.006577,0.009207,0.012007,0.015258,0.030508,0.039642,0.052435,0.069091,0.091799,j-2505161019344b0089b9eed895f0a803
7,7,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,...,0.00377,0.005692,0.007832,0.011728,0.027132,0.031695,0.038928,0.047977,0.057112,j-25051610195045fbbad3794af8313574
8,8,b'\x01\x03\x00\x00\x00\x03\x00\x00\x00\xe6\x00...,0,,,,,,,,...,0.007683,0.010565,0.0137,0.01697,0.030198,0.037728,0.048462,0.062068,0.075866,j-2505161020064465b5eed42de26d9a63
9,9,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x0c\x00...,0,,,,,,,,...,0.00197,0.003242,0.005464,0.008266,0.018165,0.023109,0.029741,0.038527,0.046957,j-2505161020234fe888a2c9f1bb45b4e7


In [36]:
parcels = gpd.read_file('data/filtered_gdf.shp')

parcels



Unnamed: 0,taotlusaas,pollu_id,pindala_ha,taotletud_,taotletu_1,niitmise_t,niitmise_1,viimase_mu,taotletu_2,taotleja_n,taotleja_r,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
0,2021,21339359,0.34,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/20 20:08:34.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((24.11064 58.2612, 24.11064 58.26108,..."
1,2021,20435919,2.12,"võilill, harilik",Põllukultuurid,Niidetud,03.08.2021-09.08.2021,2021/05/14 14:14:50.000,Kliimat ja keskkonda säästvate põllumajandusta...,FIE,,Dandelion common,dandelions,3301081400,"POLYGON ((24.59144 59.20214, 24.59144 59.20215..."
2,2021,21666530,2.33,"võilill, harilik",Põllukultuurid,Niidetud,10.07.2021-11.07.2021,2021/05/22 20:46:35.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((24.73434 59.14479, 24.73418 59.14473..."
3,2021,21781505,0.17,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/26 15:43:20.000,Kliimat ja keskkonda säästvate põllumajandusta...,OSAÜHING VIIVEKA,10040905.0,Dandelion common,dandelions,3301081400,"POLYGON ((24.80384 58.36676, 24.80395 58.36691..."
4,2021,20435920,0.93,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/14 14:14:50.000,Kliimat ja keskkonda säästvate põllumajandusta...,FIE,,Dandelion common,dandelions,3301081400,"POLYGON ((24.60553 59.22487, 24.60602 59.22505..."
5,2021,20435918,0.46,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/14 14:14:50.000,Kliimat ja keskkonda säästvate põllumajandusta...,FIE,,Dandelion common,dandelions,3301081400,"POLYGON ((24.5973 59.20052, 24.59728 59.20044,..."
6,2021,21625865,2.89,"võilill, harilik",Põllukultuurid,Niidetud,30.07.2021-05.08.2021,2021/06/12 10:41:17.000,Keskkonnasõbraliku majandamise toetus;Kliimat ...,FIE,,Dandelion common,dandelions,3301081400,"POLYGON ((26.76172 57.97226, 26.76169 57.97218..."
7,2021,21339355,0.38,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/20 20:08:34.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((24.11338 58.261, 24.11331 58.26101, ..."
8,2021,21520681,2.92,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/21 14:27:56.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((27.1228 58.1266, 27.12282 58.12662, ..."
9,2021,20494621,0.47,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/16 09:36:00.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((26.73513 58.27443, 26.73589 58.27454..."


> How can I match the features to the parcels? The only common column is the geometry...

Assuming both dataframes have same order (which is not likely the case).

In [38]:
data['target'] = parcels.EC_hcat_n

data

Unnamed: 0.1,Unnamed: 0,geometry,feature_index,B02_P10,B02_P25,B02_P50,B02_P75,B02_P90,B03_P10,B03_P25,...,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90,job_id,target
0,0,"b'\x01\x03\x00\x00\x00\x01\x00\x00\x00""\x00\x0...",0,,,,,,,,...,0.009845,0.013578,0.016449,0.039836,0.05057,0.072721,0.122933,0.147181,j-25051610175047429329ed6feb96dc20,dandelions
1,1,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00:\x00\x0...",0,,,,,,,,...,0.005155,0.007258,0.011056,0.021004,0.026354,0.035084,0.046927,0.063484,j-2505161018084db998fe92f976077609,dandelions
2,2,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x17\x00...,0,,,,,,,,...,0.003923,0.005395,0.007169,0.020292,0.024544,0.031204,0.040079,0.053383,j-25051610182546a38540155e6dc862e0,dandelions
3,3,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00\x1b\x00...",0,,,,,,,,...,0.009548,0.014226,0.020363,0.031789,0.037089,0.049702,0.068509,0.106466,j-2505161018424cf0ae7815d18f278e7c,dandelions
4,4,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00#\x00\x0...,0,,,,,,,,...,0.0045,0.006347,0.008537,0.020882,0.024984,0.031692,0.041187,0.052576,j-2505161018584abea5903b1fad46efcc,dandelions
5,5,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,...,0.008022,0.010358,0.014889,0.032923,0.039042,0.049017,0.06354,0.082097,j-250516101916430aa489f9f704370f29,dandelions
6,6,b'\x01\x03\x00\x00\x00\x02\x00\x00\x00b\x00\x0...,0,,,,,,,,...,0.009207,0.012007,0.015258,0.030508,0.039642,0.052435,0.069091,0.091799,j-2505161019344b0089b9eed895f0a803,dandelions
7,7,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,...,0.005692,0.007832,0.011728,0.027132,0.031695,0.038928,0.047977,0.057112,j-25051610195045fbbad3794af8313574,dandelions
8,8,b'\x01\x03\x00\x00\x00\x03\x00\x00\x00\xe6\x00...,0,,,,,,,,...,0.010565,0.0137,0.01697,0.030198,0.037728,0.048462,0.062068,0.075866,j-2505161020064465b5eed42de26d9a63,dandelions
9,9,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x0c\x00...,0,,,,,,,,...,0.003242,0.005464,0.008266,0.018165,0.023109,0.029741,0.038527,0.046957,j-2505161020234fe888a2c9f1bb45b4e7,dandelions


In [39]:
from sklearn.model_selection import train_test_split

# drop columns with nans

data_clean = data.dropna(axis=1)

# drop unused columns

data_clean = data_clean.drop(columns=['Unnamed: 0', 'geometry', 'job_id', 'feature_index'])

# split train/test

X_train, X_test, y_train, y_test = train_test_split(data_clean.drop(columns=['target']), data_clean['target'], test_size=0.2, random_state=42)

In [42]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

model.score(X_test, y_test)


0.75

TODO:
- ingest model to EOTDL
- ingest feature recipe to EOTDL

## 5. Run inference with EOTDL

In [None]:
# sample = X_test.iloc[3]

# pred = model.predict(sample.values.reshape(1, -1))

# pred

Let's perform inference on some new parcels.

In [57]:
ix = np.random.randint(0, len(shapefiles))
country = shapefiles[ix]

country

'data/DK_2019_EC21.shp'

In [58]:
gdf = gpd.read_file(path)

gdf.head()

Unnamed: 0,taotlusaas,pollu_id,pindala_ha,taotletud_,taotletu_1,niitmise_t,niitmise_1,viimase_mu,taotletu_2,taotleja_n,taotleja_r,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
0,2021,19994165,0.25,Karjatamine väljaspool põllumaj. maad,Karjatamine väljaspool põllumaj. maad,,,2021/05/02 14:37:52.000,,FIE,,Rough grazings,pasture_meadow_grassland_grass,3302000000,"POLYGON ((26.50243 59.31839, 26.50244 59.31843..."
1,2021,19990783,1.7,rohttaimed,Püsirohumaa,Niidetud,28.06.2021-04.07.2021,2021/05/02 06:59:17.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,grasses,pasture_meadow_grassland_grass,3302000000,"POLYGON ((24.54648 58.86884, 24.54674 58.86879..."
2,2021,19990784,0.49,rohttaimed,Püsirohumaa,Ei kuulu jälgimisele,,2021/05/02 06:59:17.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,grasses,pasture_meadow_grassland_grass,3302000000,"POLYGON ((24.54597 58.86827, 24.54668 58.86816..."
3,2021,19996106,0.54,talinisu allakülvita,Põllukultuurid,,,2021/05/02 20:58:12.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Winter wheat,winter_common_soft_wheat,3301010101,"POLYGON ((27.42837 58.11975, 27.42839 58.11972..."
4,2021,19990620,2.48,"punane ristik (vähemalt 80% ristikut, kuni 20%...",Põllukultuurid,Niidetud,06.07.2021-11.07.2021,2021/07/05 07:26:35.000,Kliimat ja keskkonda säästvate põllumajandusta...,TAMSAMÄE OÜ,11350602.0,Red clover (at least 80% clover up to 20% gras...,clover,3301090303,"POLYGON ((26.66816 57.82049, 26.66815 57.8205,..."


In [59]:
gdf = gdf.sample(n=3)
gdf

Unnamed: 0,taotlusaas,pollu_id,pindala_ha,taotletud_,taotletu_1,niitmise_t,niitmise_1,viimase_mu,taotletu_2,taotleja_n,taotleja_r,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
162397,2021,22113589,21.37,talinisu allakülvita,Põllukultuurid,,,2021/06/14 15:52:11.000,Keskkonnasõbraliku majandamise toetus;Kliimat ...,OÜ KÕO AGRO,10070214,Winter wheat,winter_common_soft_wheat,3301010101,"POLYGON ((25.66627 58.63345, 25.66638 58.63344..."
42649,2021,20657638,10.17,rohttaimed,Püsirohumaa,Niidetud,16.08.2021-21.08.2021,2021/05/17 16:18:18.000,Kliimat ja keskkonda säästvate põllumajandusta...,AKTSIASELTS METSAKÜLA PIIM,10014380,grasses,pasture_meadow_grassland_grass,3302000000,"POLYGON ((24.37944 59.35179, 24.37945 59.35179..."
65776,2021,21102710,3.62,küüslauk,Põllukultuurid,,,2021/05/19 16:40:15.000,Kliimat ja keskkonda säästvate põllumajandusta...,OSAÜHING KASKEMA TALU,11017417,garlic,garlic,3301220200,"POLYGON ((24.36846 58.85519, 24.36852 58.85522..."


In [61]:
from eotdl.fe.openeo import point_extraction

# should be the same start_data and nb_monts; how can we save this  in the feature recipe?

point_extraction(gdf, start_date = "2024-01-01", nb_months = 2, job_tracker = 'jobs-inference.csv', parallel_jobs=10)

Authenticated using refresh token.


In [63]:
job = pd.read_csv("jobs-inference.csv")
job

Unnamed: 0,fid,geometry,crs,temporal_extent,id,backend_name,status,start_time,running_start_time,cpu,memory,duration
0,,"POLYGON ((25.6662734 58.63345021, 25.66637754 ...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-2505161103024444b6e50166f8aeb88b,cdse,finished,2025-05-16T11:03:02Z,2025-05-16T11:04:52Z,229.88571932 cpu-seconds,1565364.35546875 mb-seconds,150 seconds
1,,"POLYGON ((24.37944058 59.35179394, 24.37944933...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-250516110319411dbdd4ebb4189c99ec,cdse,finished,2025-05-16T11:03:19Z,2025-05-16T11:05:53Z,228.19191047700002 cpu-seconds,1323684.845703125 mb-seconds,193 seconds
2,,"POLYGON ((24.3684642 58.85519494, 24.36851719 ...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-25051611033745fea98123d8907b6e1c,cdse,finished,2025-05-16T11:03:37Z,2025-05-16T11:05:53Z,169.140624351 cpu-seconds,1299268.8515625 mb-seconds,172 seconds


In [64]:
# Initialize an empty list to store all dataframes
all_data = []

# Loop through each job and read its parquet file
for idx, _job in job.iterrows():
    try:
        job_data = pd.read_parquet(f'job_{_job["id"]}/timeseries.parquet')
        # Add job_id as a column to identify the source
        job_data['job_id'] = _job["id"]
        all_data.append(job_data)
    except Exception as e:
        print(f"Error reading job {_job['id']}: {e}")

# Concatenate all dataframes into one
if all_data:
    data = pd.concat(all_data, ignore_index=True)
    print(f"Successfully merged {len(all_data)} time series datasets")
else:
    data = pd.DataFrame()
    print("No time series data was loaded")

data

Successfully merged 3 time series datasets


Unnamed: 0,geometry,feature_index,B02_P10,B02_P25,B02_P50,B02_P75,B02_P90,B03_P10,B03_P25,B03_P50,...,VH_P25,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90,job_id
0,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00p\x00\x0...,0,,,,,,,,,...,0.003686,0.005511,0.007969,0.014515,0.032455,0.040394,0.053392,0.071857,0.116476,j-2505161103024444b6e50166f8aeb88b
1,b'\x01\x03\x00\x00\x00\x16\x00\x00\x00<\x04\x0...,0,,,,,,,,,...,0.005453,0.00784,0.010888,0.014063,0.036952,0.046182,0.060499,0.07805,0.096651,j-250516110319411dbdd4ebb4189c99ec
2,"b'\x01\x03\x00\x00\x00\x01\x00\x00\x00,\x00\x0...",0,,,,,,,,,...,0.002787,0.004535,0.007065,0.013067,0.036071,0.045936,0.062672,0.084033,0.129215,j-25051611033745fea98123d8907b6e1c


In [66]:
# drop columns with nans (should be defined in the feature recipe, how?)

data_clean = data.dropna(axis=1)

# drop unused columns (should be defined in the feature recipe, how? maybe better to define which columns to keep)

data_clean = data_clean.drop(columns=['geometry', 'job_id', 'feature_index'])

data_clean

Unnamed: 0,VH_P10,VH_P25,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90
0,0.002636,0.003686,0.005511,0.007969,0.014515,0.032455,0.040394,0.053392,0.071857,0.116476
1,0.003829,0.005453,0.00784,0.010888,0.014063,0.036952,0.046182,0.060499,0.07805,0.096651
2,0.001747,0.002787,0.004535,0.007065,0.013067,0.036071,0.045936,0.062672,0.084033,0.129215


In [67]:
preds = model.predict(data_clean.values)

preds




array(['spring_rapeseed_rape', 'spring_rapeseed_rape',
       'spring_rapeseed_rape'], dtype=object)

In [68]:
gdf.EC_hcat_n

162397          winter_common_soft_wheat
42649     pasture_meadow_grassland_grass
65776                             garlic
Name: EC_hcat_n, dtype: object

Of course model is not good, need to train with more parcels & classes. You can use this notebook to do so.

TODO:
- Stage model from EOTDL
- Stage feature recipe from EOTDL