In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os

os.environ['EOTDL_API_URL'] = 'https://api.eotdl.com/'
# os.environ['EOTDL_API_URL'] = 'http://localhost:8000/'

In this use case we show how to perform feature engineering with openEO within EOTDL.

https://github.com/earthpulse/eotdl/issues/190


1. stage the EuroCrops dataset with EOTDL.
2. filter the EuroCrops Dataset to create a subset of parcels, e.g., 8 crop classes, each with 1000 examples, for one country
3. run feature engineering with openEO, creating temporal metrics from a S1 and S2 time series (temporally optimised for crops classe of interest). Store feature engineering process graph with the training datsets in EOTDL
4. Use EOTDL functionality to train a model (for this the features need to be retrieved..). Store the model along with the openEO process graph in EOTDL.
5. Use the model to run inference (from within EOTDL?) in an openEO platform such as CDSE or openEO platform. Make use of the feature engineering process graph stored along with the EOTDL model.

## 1 Stage EuroCrops from EOTDL

Dataset can be found at https://www.eotdl.com/datasets/EuroCrops/. The dataset contains a zip file, which in turn contains zip files for each country with the shapefiles (16 total).

> Uncomment the following cells to stage the dataset.

In [4]:
# !eotdl datasets get EuroCrops -v 1 -f -a
# !unzip -o ~/.cache/eotdl/datasets/EuroCrops/EuroCrops.zip -d data/

Staging assets:   0%|                                     | 0/1 [00:00<?, ?it/s]^C
Staging assets:   0%|                                     | 0/1 [00:14<?, ?it/s]


In [5]:
# from glob import glob

# zips = glob('data/*.zip')

# zips

In [6]:
# # unzip shapefiles

# import zipfile

# for zip_file in zips:
# 	with zipfile.ZipFile(zip_file, 'r') as zip_ref:
# 		zip_ref.extractall('data/')


In [7]:
# cleanup

# !rm -rf data/*.zip

List of all the shapefiles in the dataset.

In [2]:
from glob import glob

shapefiles = glob('C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data/**/*.shp', recursive=True)

shapefiles

['C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\AT_2021_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\BE_VLG_2021_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\DE_LS_2021_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\DE_NRW_2021_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\DK_2019_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\EE_2021_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\LT_2021_EC.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\LV_2021_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\NL_2020_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\SI_2021_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\SK_2021_EC21.shp',
 'C:/Users/VROMPAYH/OneDrive - VIT

In [3]:
import geopandas as gpd

path = shapefiles[0]
gdf = gpd.read_file(path)
gdf.head()


Unnamed: 0,fid,FS_KENNUNG,SNAR_BEZEI,SL_FLAECHE,GEO_ID,INSPIRE_ID,GML_ID,GML_IDENTI,SNAR_CODE,GEO_PART_K,LOG_PKEY,GEOM_DATE_,FART_ID,GEO_TYPE,GML_LENGTH,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
0,1.0,91101535.0,MÄHWIESE/-WEIDE ZWEI NUTZUNGEN,1.420018,1769861.0,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,AT.0095.89e7d5e0-385f-40b2-9e42-fb3bb29e52a9.e...,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,716,14.0,2.0,2018/07/31 14:03:57.000,1696.0,POLYGON,4469.0,MOWING MEADOW / PASTURE (TWO USES),pasture_meadow_grassland_grass,3302000000,"POLYGON ((462338.342 506712.461, 462414.438 50..."
1,2.0,91101833.0,MÄHWIESE/-WEIDE DREI UND MEHR NUTZUNGEN,4.378824,102165978.0,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,AT.0095.89e7d5e0-385f-40b2-9e42-fb3bb29e52a9.e...,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,717,14.0,2.0,2019/05/05 19:59:57.000,1696.0,POLYGON,4439.0,MOWING MEADOW / PASTURE (THREE AND MORE USES),pasture_meadow_grassland_grass,3302000000,"POLYGON ((507455.76 508698.679, 507466.157 508..."
2,3.0,91101838.0,"WECHSELWIESE (EGART, ACKERWEIDE)",0.741652,1669265.0,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,AT.0095.89e7d5e0-385f-40b2-9e42-fb3bb29e52a9.e...,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,636,14.0,2.0,2018/07/31 13:58:35.000,1696.0,POLYGON,846.0,ALTERNATE MEADOW (EGART,pasture_meadow_grassland_grass,3302000000,"POLYGON ((507773.855 508548.978, 507861.236 50..."
3,4.0,91101843.0,KLEEGRAS,1.113695,103675744.0,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,AT.0095.89e7d5e0-385f-40b2-9e42-fb3bb29e52a9.e...,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,634,14.0,2.0,2020/10/14 11:25:53.000,1696.0,POLYGON,1656.0,CLOVER-GRASS,clover,3301090303,"POLYGON ((507687.863 508733.104, 507693.01 508..."
4,5.0,91101841.0,SPEISEKARTOFFELN / FELDGEMÜSE,0.128307,103244270.0,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,AT.0095.89e7d5e0-385f-40b2-9e42-fb3bb29e52a9.e...,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,526,14.0,2.0,2020/05/15 09:37:56.000,1696.0,POLYGON,448.0,EDIBLE POTATOES / FIELD VEGETABLES,potatoes,3301030000,"POLYGON ((507855.951 508860.244, 507864.729 50..."


In [5]:
# columns
gdf.columns

Index(['fid', 'FS_KENNUNG', 'SNAR_BEZEI', 'SL_FLAECHE', 'GEO_ID', 'INSPIRE_ID',
       'GML_ID', 'GML_IDENTI', 'SNAR_CODE', 'GEO_PART_K', 'LOG_PKEY',
       'GEOM_DATE_', 'FART_ID', 'GEO_TYPE', 'GML_LENGTH', 'EC_trans_n',
       'EC_hcat_n', 'EC_hcat_c', 'geometry'],
      dtype='object')

## 2. Filter EuroCropsDataset

Filter the EuroCropsDataset to create a subset of parcels, e.g., 8 crop classes, each with 1000 examples, for one country

In [7]:
# random country

import numpy as np

ix = np.random.randint(0, len(shapefiles))
country = shapefiles[ix]

country

'C:/Users/VROMPAYH/OneDrive - VITO/Desktop/openeo/data\\EuroCrops\\HR\\HR_2020_EC21.shp'

In [8]:
crop_classes = gdf['EC_hcat_n'].unique()

crop_classes

array(['pasture_meadow_grassland_grass', 'clover', 'potatoes',
       'tree_wood_forest', 'winter_triticale', 'green_silo_maize',
       'not_known_and_other', 'other_arable_land_crops', 'summer_barley',
       'winter_rye', 'fallow_land_not_crop', 'winter_barley',
       'summer_oats', 'winter_meslin', 'fresh_vegetables',
       'winter_common_soft_wheat', 'millet_sorghum',
       'winter_rapeseed_rape', 'shrubberries_shrubs',
       'grain_maize_corn_popcorn', 'peas', 'winter_durum_hard_wheat',
       'soy_soybeans', 'sugar_beet', 'pumpkin_squash_gourd',
       'vineyards_wine_vine_rebland_grapes', 'apples', 'beans',
       'orchards_fruits', 'sunflower', 'winter_spelt', 'alfalfa_lucerne',
       'oilseed_crops', 'summer_durum_hard_wheat', 'pears', 'apricots',
       'winter_poppy', 'mustard', 'summer_meslin',
       'spring_common_soft_wheat', 'summer_rye', 'vetches',
       'poaceae_grasses', 'summer_poppy', 'cucumber_pickle',
       'flax_linseed', 'hemp_cannabis', 'phacelia', 'sw

In [9]:
# number of samples per class

num_samples_per_class = {class_: len(gdf[gdf['EC_hcat_n'] == class_]) for class_ in crop_classes}

num_samples_per_class = dict(sorted(num_samples_per_class.items(), key=lambda x: x[1], reverse=True))

num_samples_per_class

{'pasture_meadow_grassland_grass': 1317640,
 'tree_wood_forest': 162470,
 'vineyards_wine_vine_rebland_grapes': 161483,
 'fallow_land_not_crop': 147379,
 'grain_maize_corn_popcorn': 125321,
 'winter_common_soft_wheat': 111399,
 'not_known_and_other': 73231,
 'clover': 61228,
 'green_silo_maize': 54482,
 'winter_barley': 50158,
 'soy_soybeans': 36925,
 'winter_triticale': 34760,
 'pumpkin_squash_gourd': 23941,
 'winter_rye': 23874,
 'potatoes': 23323,
 'summer_barley': 22436,
 'summer_oats': 21614,
 'sunflower': 12785,
 'other_arable_land_crops': 12646,
 'sugar_beet': 12579,
 'winter_rapeseed_rape': 12084,
 'alfalfa_lucerne': 11394,
 'winter_spelt': 11242,
 'fresh_vegetables': 10013,
 'millet_sorghum': 7790,
 'peas': 6160,
 'apples': 5253,
 'winter_durum_hard_wheat': 4600,
 'beans': 3922,
 'spring_common_soft_wheat': 3348,
 'greenhouse_foil_film': 3176,
 'summer_durum_hard_wheat': 2362,
 'vetches': 2079,
 'nuts': 2026,
 'buckwheat': 1876,
 'orchards_fruits': 1833,
 'summer_meslin': 1803

In [11]:
# import matplotlib.pyplot as plt

# plt.figure(figsize=(5, 25))
# plt.barh(list(num_samples_per_class.keys()), list(num_samples_per_class.values()))
# plt.tight_layout()
# plt.show()

In [10]:
# filter 1000 examples per class

# Each job runs separately, so we need to limit the number of classes and samples per class
# samples = 1000
# num_classes = 8

samples = 100
num_classes = 10

# keep classes with at least 1000 samples
classes = [class_ for class_, count in num_samples_per_class.items() if count >= samples]

# random 8 classes
classes = np.random.choice(classes, num_classes, replace=False)

classes


array(['winter_rye', 'caraway', 'winter_meslin',
       'vineyards_wine_vine_rebland_grapes', 'poaceae_grasses',
       'sweet_chestnuts', 'winter_emmer', 'nurseries_nursery',
       'hemp_cannabis', 'summer_barley'], dtype='<U47')

In [11]:
filtered_gdf = gdf[gdf['EC_hcat_n'].isin(classes)]

filtered_gdf = filtered_gdf.groupby('EC_hcat_n').sample(n=samples, random_state=42)

filtered_gdf.head()

Unnamed: 0,fid,FS_KENNUNG,SNAR_BEZEI,SL_FLAECHE,GEO_ID,INSPIRE_ID,GML_ID,GML_IDENTI,SNAR_CODE,GEO_PART_K,LOG_PKEY,GEOM_DATE_,FART_ID,GEO_TYPE,GML_LENGTH,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
413866,413867.0,91488395.0,WINTERKÜMMEL,1.60351,104433312.0,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,AT.0095.89e7d5e0-385f-40b2-9e42-fb3bb29e52a9.e...,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,536,14.0,2.0,2021/05/11 08:07:24.000,1696.0,POLYGON,416.0,WINTER CARAWAY,caraway,3301061211,"POLYGON ((464634.63 482248.36, 464630.139 4823..."
1986825,1986826.0,92243310.0,WINTERKÜMMEL,0.351564,104410218.0,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,AT.0095.89e7d5e0-385f-40b2-9e42-fb3bb29e52a9.e...,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,536,14.0,2.0,2021/05/07 11:26:49.000,1696.0,POLYGON,1092.0,WINTER CARAWAY,caraway,3301061211,"POLYGON ((431692.496 473860.954, 431694.285 47..."
2267711,2267712.0,92004596.0,WINTERKÜMMEL,5.74716,103227531.0,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,AT.0095.89e7d5e0-385f-40b2-9e42-fb3bb29e52a9.e...,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,536,14.0,2.0,2020/05/14 10:32:32.000,1696.0,POLYGON,1836.0,WINTER CARAWAY,caraway,3301061211,"POLYGON ((427105.523 462977.917, 427143.281 46..."
1988678,1988679.0,92210422.0,WINTERKÜMMEL,1.7014,101467861.0,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,AT.0095.89e7d5e0-385f-40b2-9e42-fb3bb29e52a9.e...,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,536,10.0,2.0,2019/02/02 18:35:30.000,1696.0,POLYGON,990.0,WINTER CARAWAY,caraway,3301061211,"POLYGON ((389345.402 476938.001, 389342.779 47..."
2135057,2135058.0,92270299.0,WINTERKÜMMEL,2.322768,5063992.0,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,AT.0095.89e7d5e0-385f-40b2-9e42-fb3bb29e52a9.e...,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,536,14.0,2.0,2018/07/31 17:52:30.000,1696.0,POLYGON,888.0,WINTER CARAWAY,caraway,3301061211,"POLYGON ((477086.947 474743.395, 477062.969 47..."


We want to perform the polygon data extractions for S1/S2 in an efficient manner, to do so we need to group multiple geometries. Doing so allows us to execute multiple extractions in a single openEO job.

In [29]:
#!pip install s2sphere
import s2sphere
from typing import List

def split_s2sphere(
    gdf: gpd.GeoDataFrame, max_points=500, start_level=8
) -> List[gpd.GeoDataFrame]:
    """
    EXPERIMENTAL
    Split a GeoDataFrame into multiple groups based on the S2geometry cell ID of each geometry.

    S2geometry is a library that provides a way to index and query spatial data. This function splits
    the GeoDataFrame into groups based on the S2 cell ID of each geometry, based on it's centroid.

    If a cell contains more points than max_points, it will be recursively split into
    smaller cells until each cell contains at most max_points points.

    More information on S2geometry can be found at https://s2geometry.io/
    An overview of the S2 cell hierarchy can be found at https://s2geometry.io/resources/s2cell_statistics.html

    :param gdf: GeoDataFrame containing points to split
    :param max_points: Maximum number of points per group
    :param start_level: Starting S2 cell level
    :return: List of GeoDataFrames containing the split groups
    """

    if "geometry" not in gdf.columns:
        raise ValueError("The GeoDataFrame must contain a 'geometry' column.")

    if gdf.crs is None:
        raise ValueError("The GeoDataFrame must contain a CRS")

    # Store the original CRS of the GeoDataFrame and reproject to EPSG:3857
    original_crs = gdf.crs
    gdf = gdf.to_crs(epsg=3857)

    # Add a centroid column to the GeoDataFrame and convert it to EPSG:4326
    gdf["centroid"] = gdf.geometry.centroid

    # Reproject the GeoDataFrame to its orginial CRS
    gdf = gdf.to_crs(original_crs)

    # Set the GeoDataFrame's geometry to the centroid column and reproject to EPSG:4326
    gdf = gdf.set_geometry("centroid")
    gdf = gdf.to_crs(epsg=4326)

    # Create a dictionary to store points by their S2 cell ID
    cell_dict = {}

    # Iterate over each point in the GeoDataFrame
    for _, row in gdf.iterrows():
        # Get the S2 cell ID for the point at a given level
        cell_id = _get_s2cell_id(row.centroid, start_level)

        if cell_id not in cell_dict:
            cell_dict[cell_id] = []

        cell_dict[cell_id].append(row)

    result_groups = []

    # Function to recursively split cells if they contain more points than max_points
    def _split_s2cell(cell_id, points, current_level=start_level):
        if len(points) <= max_points:
            if len(points) > 0:
                points = gpd.GeoDataFrame(
                    points, crs=original_crs, geometry="geometry"
                ).drop(columns=["centroid"])
                points["s2sphere_cell_id"] = cell_id
                points["s2sphere_cell_level"] = current_level
                result_groups.append(gpd.GeoDataFrame(points))
        else:
            children = s2sphere.CellId(cell_id).children()
            child_cells = {child.id(): [] for child in children}

            for point in points:
                child_cell_id = _get_s2cell_id(point.centroid, current_level + 1)
                child_cells[child_cell_id].append(point)

            for child_cell_id, child_points in child_cells.items():
                _split_s2cell(child_cell_id, child_points, current_level + 1)

    # Split cells that contain more points than max_points
    for cell_id, points in cell_dict.items():
        _split_s2cell(cell_id, points)

    return result_groups


def _get_s2cell_id(point, level):
    lat, lon = point.y, point.x
    cell_id = s2sphere.CellId.from_lat_lng(
        s2sphere.LatLng.from_degrees(lat, lon)
    ).parent(level)
    return cell_id.id()

split_gdf = split_s2sphere(filtered_gdf, max_points=500)

print(f"Combined: {len(filtered_gdf)} into {len(split_gdf)} jobs")


Combined: 1000 into 59 jobs


The single openEO job will now process multiple , 'neighbouring', polygons per job 

In [28]:
split_gdf[0].head()

Unnamed: 0,fid,FS_KENNUNG,SNAR_BEZEI,SL_FLAECHE,GEO_ID,INSPIRE_ID,GML_ID,GML_IDENTI,SNAR_CODE,GEO_PART_K,...,GEOM_DATE_,FART_ID,GEO_TYPE,GML_LENGTH,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry,s2sphere_cell_id,s2sphere_cell_level
413866,413867.0,91488395.0,WINTERKÜMMEL,1.60351,104433312.0,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,AT.0095.89e7d5e0-385f-40b2-9e42-fb3bb29e52a9.e...,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,536,14.0,...,2021/05/11 08:07:24.000,1696.0,POLYGON,416.0,WINTER CARAWAY,caraway,3301061211,"POLYGON ((464634.63 482248.361, 464630.139 482...",5148617128689008640,8
2090623,2090624.0,94788240.0,WINTERKÜMMEL,0.633456,102058666.0,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,AT.0095.89e7d5e0-385f-40b2-9e42-fb3bb29e52a9.e...,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,536,14.0,...,2019/04/25 09:43:46.000,1696.0,POLYGON,482.0,WINTER CARAWAY,caraway,3301061211,"POLYGON ((462888.948 480074.27, 462943.689 480...",5148617128689008640,8
2092688,2092689.0,94790917.0,WINTERKÜMMEL,1.580566,104119759.0,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,AT.0095.89e7d5e0-385f-40b2-9e42-fb3bb29e52a9.e...,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,536,14.0,...,2021/04/07 15:21:39.000,1696.0,POLYGON,1023.0,WINTER CARAWAY,caraway,3301061211,"POLYGON ((464625.148 479362.26, 464619.38 4793...",5148617128689008640,8
1533372,1533373.0,91431719.0,WINTERKÜMMEL,4.447597,103997550.0,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,AT.0095.89e7d5e0-385f-40b2-9e42-fb3bb29e52a9.e...,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,536,14.0,...,2021/03/18 15:14:58.000,1696.0,POLYGON,1161.0,WINTER CARAWAY,caraway,3301061211,"POLYGON ((470375.9 485246.698, 470413.503 4852...",5148617128689008640,8
1129154,1129155.0,91516094.0,WINTERKÜMMEL,0.079359,104238644.0,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,AT.0095.89e7d5e0-385f-40b2-9e42-fb3bb29e52a9.e...,https://data.inspire.gv.at/0095/89e7d5e0-385f-...,536,14.0,...,2021/04/21 08:44:12.000,1696.0,POLYGON,415.0,WINTER CARAWAY,caraway,3301061211,"POLYGON ((455993.348 488968.862, 455981.374 48...",5148617128689008640,8


Now we turn this list of dataframes into a job dataframe which is openEO compatible

In [34]:
#!pip install geojson
import pandas as pd
from typing import List
import geopandas as gpd
from geojson import Feature, FeatureCollection

def chunks_to_featurecollections(
    split_jobs: List[gpd.GeoDataFrame],
    id_field: str = "s2sphere_cell_id",
    level_field: str = "s2sphere_cell_level"
) -> pd.DataFrame:
    """
    Convert a list of GeoDataFrame chunks into a DataFrame of job-ready rows.
    Each row has:
      - the chunk's S2 ID & level
      - feature_count
      - all original feature attributes (as a list of dicts)
      - a geojson.FeatureCollection of the chunk
    """
    records = []
    for job in split_jobs:
        # grab s2 metadata
        cell_id    = job[id_field].iloc[0]
        feat_count = len(job)

        # build properties list (all columns except geometry)
        props = job.drop(columns=["geometry"]).to_dict(orient="records")

        # build FeatureCollection
        features = [
            Feature(geometry=geom.__geo_interface__, properties=prop)
            for geom, prop in zip(job.geometry, props)
        ]
        fc = FeatureCollection(features)

        records.append({
            id_field:   cell_id,
            "feature_count": feat_count,
            "properties":    props,
            "feature_collection": fc
        })

    df = pd.DataFrame(records)
    # If you really want a GeoDataFrame, you can union each chunk's geometries:
    # df["geometry"] = [g.unary_union for g in split_jobs]
    # return gpd.GeoDataFrame(df, geometry="geometry", crs=split_jobs[0].crs)
    return df

jobs_df = chunks_to_featurecollections(split_gdf)
jobs_df


Unnamed: 0,s2sphere_cell_id,feature_count,properties,feature_collection
0,5148617128689008640,39,"[{'fid': 413867.0, 'FS_KENNUNG': 91488395.0, '...","{'type': 'FeatureCollection', 'features': [{'t..."
1,5148757866177363968,15,"[{'fid': 1986826.0, 'FS_KENNUNG': 92243310.0, ...","{'type': 'FeatureCollection', 'features': [{'t..."
2,5148793050549452800,23,"[{'fid': 1988679.0, 'FS_KENNUNG': 92210422.0, ...","{'type': 'FeatureCollection', 'features': [{'t..."
3,5148652313061097472,60,"[{'fid': 2135058.0, 'FS_KENNUNG': 92270299.0, ...","{'type': 'FeatureCollection', 'features': [{'t..."
4,5146893094456655872,39,"[{'fid': 41572.0, 'FS_KENNUNG': 91538119.0, 'S...","{'type': 'FeatureCollection', 'features': [{'t..."
5,5146752356968300544,11,"[{'fid': 377774.0, 'FS_KENNUNG': 91137143.0, '...","{'type': 'FeatureCollection', 'features': [{'t..."
6,5148828234921541632,16,"[{'fid': 2474415.0, 'FS_KENNUNG': 92702378.0, ...","{'type': 'FeatureCollection', 'features': [{'t..."
7,5149250447386607616,3,"[{'fid': 2248025.0, 'FS_KENNUNG': 92234033.0, ...","{'type': 'FeatureCollection', 'features': [{'t..."
8,5148722681805275136,19,"[{'fid': 1282300.0, 'FS_KENNUNG': 92146295.0, ...","{'type': 'FeatureCollection', 'features': [{'t..."
9,5148511575572742144,10,"[{'fid': 267874.0, 'FS_KENNUNG': 91924666.0, '...","{'type': 'FeatureCollection', 'features': [{'t..."


## 4. Train a model with EOTDL

We will train a simple random forest model on the features.


In [35]:
data = pd.read_csv('data/features.csv')

data

Unnamed: 0.1,Unnamed: 0,geometry,feature_index,B02_P10,B02_P25,B02_P50,B02_P75,B02_P90,B03_P10,B03_P25,...,VH_P25,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90,job_id
0,0,"b'\x01\x03\x00\x00\x00\x01\x00\x00\x00""\x00\x0...",0,,,,,,,,...,0.005915,0.009845,0.013578,0.016449,0.039836,0.05057,0.072721,0.122933,0.147181,j-25051610175047429329ed6feb96dc20
1,1,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00:\x00\x0...",0,,,,,,,,...,0.003498,0.005155,0.007258,0.011056,0.021004,0.026354,0.035084,0.046927,0.063484,j-2505161018084db998fe92f976077609
2,2,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x17\x00...,0,,,,,,,,...,0.002803,0.003923,0.005395,0.007169,0.020292,0.024544,0.031204,0.040079,0.053383,j-25051610182546a38540155e6dc862e0
3,3,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00\x1b\x00...",0,,,,,,,,...,0.006838,0.009548,0.014226,0.020363,0.031789,0.037089,0.049702,0.068509,0.106466,j-2505161018424cf0ae7815d18f278e7c
4,4,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00#\x00\x0...,0,,,,,,,,...,0.003203,0.0045,0.006347,0.008537,0.020882,0.024984,0.031692,0.041187,0.052576,j-2505161018584abea5903b1fad46efcc
5,5,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,...,0.00595,0.008022,0.010358,0.014889,0.032923,0.039042,0.049017,0.06354,0.082097,j-250516101916430aa489f9f704370f29
6,6,b'\x01\x03\x00\x00\x00\x02\x00\x00\x00b\x00\x0...,0,,,,,,,,...,0.006577,0.009207,0.012007,0.015258,0.030508,0.039642,0.052435,0.069091,0.091799,j-2505161019344b0089b9eed895f0a803
7,7,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,...,0.00377,0.005692,0.007832,0.011728,0.027132,0.031695,0.038928,0.047977,0.057112,j-25051610195045fbbad3794af8313574
8,8,b'\x01\x03\x00\x00\x00\x03\x00\x00\x00\xe6\x00...,0,,,,,,,,...,0.007683,0.010565,0.0137,0.01697,0.030198,0.037728,0.048462,0.062068,0.075866,j-2505161020064465b5eed42de26d9a63
9,9,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x0c\x00...,0,,,,,,,,...,0.00197,0.003242,0.005464,0.008266,0.018165,0.023109,0.029741,0.038527,0.046957,j-2505161020234fe888a2c9f1bb45b4e7


In [36]:
parcels = gpd.read_file('data/filtered_gdf.shp')

parcels



Unnamed: 0,taotlusaas,pollu_id,pindala_ha,taotletud_,taotletu_1,niitmise_t,niitmise_1,viimase_mu,taotletu_2,taotleja_n,taotleja_r,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
0,2021,21339359,0.34,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/20 20:08:34.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((24.11064 58.2612, 24.11064 58.26108,..."
1,2021,20435919,2.12,"võilill, harilik",Põllukultuurid,Niidetud,03.08.2021-09.08.2021,2021/05/14 14:14:50.000,Kliimat ja keskkonda säästvate põllumajandusta...,FIE,,Dandelion common,dandelions,3301081400,"POLYGON ((24.59144 59.20214, 24.59144 59.20215..."
2,2021,21666530,2.33,"võilill, harilik",Põllukultuurid,Niidetud,10.07.2021-11.07.2021,2021/05/22 20:46:35.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((24.73434 59.14479, 24.73418 59.14473..."
3,2021,21781505,0.17,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/26 15:43:20.000,Kliimat ja keskkonda säästvate põllumajandusta...,OSAÜHING VIIVEKA,10040905.0,Dandelion common,dandelions,3301081400,"POLYGON ((24.80384 58.36676, 24.80395 58.36691..."
4,2021,20435920,0.93,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/14 14:14:50.000,Kliimat ja keskkonda säästvate põllumajandusta...,FIE,,Dandelion common,dandelions,3301081400,"POLYGON ((24.60553 59.22487, 24.60602 59.22505..."
5,2021,20435918,0.46,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/14 14:14:50.000,Kliimat ja keskkonda säästvate põllumajandusta...,FIE,,Dandelion common,dandelions,3301081400,"POLYGON ((24.5973 59.20052, 24.59728 59.20044,..."
6,2021,21625865,2.89,"võilill, harilik",Põllukultuurid,Niidetud,30.07.2021-05.08.2021,2021/06/12 10:41:17.000,Keskkonnasõbraliku majandamise toetus;Kliimat ...,FIE,,Dandelion common,dandelions,3301081400,"POLYGON ((26.76172 57.97226, 26.76169 57.97218..."
7,2021,21339355,0.38,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/20 20:08:34.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((24.11338 58.261, 24.11331 58.26101, ..."
8,2021,21520681,2.92,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/21 14:27:56.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((27.1228 58.1266, 27.12282 58.12662, ..."
9,2021,20494621,0.47,"võilill, harilik",Põllukultuurid,Ei kuulu jälgimisele,,2021/05/16 09:36:00.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Dandelion common,dandelions,3301081400,"POLYGON ((26.73513 58.27443, 26.73589 58.27454..."


> How can I match the features to the parcels? The only common column is the geometry...

Assuming both dataframes have same order (which is not likely the case).

In [38]:
data['target'] = parcels.EC_hcat_n

data

Unnamed: 0.1,Unnamed: 0,geometry,feature_index,B02_P10,B02_P25,B02_P50,B02_P75,B02_P90,B03_P10,B03_P25,...,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90,job_id,target
0,0,"b'\x01\x03\x00\x00\x00\x01\x00\x00\x00""\x00\x0...",0,,,,,,,,...,0.009845,0.013578,0.016449,0.039836,0.05057,0.072721,0.122933,0.147181,j-25051610175047429329ed6feb96dc20,dandelions
1,1,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00:\x00\x0...",0,,,,,,,,...,0.005155,0.007258,0.011056,0.021004,0.026354,0.035084,0.046927,0.063484,j-2505161018084db998fe92f976077609,dandelions
2,2,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x17\x00...,0,,,,,,,,...,0.003923,0.005395,0.007169,0.020292,0.024544,0.031204,0.040079,0.053383,j-25051610182546a38540155e6dc862e0,dandelions
3,3,"b""\x01\x03\x00\x00\x00\x01\x00\x00\x00\x1b\x00...",0,,,,,,,,...,0.009548,0.014226,0.020363,0.031789,0.037089,0.049702,0.068509,0.106466,j-2505161018424cf0ae7815d18f278e7c,dandelions
4,4,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00#\x00\x0...,0,,,,,,,,...,0.0045,0.006347,0.008537,0.020882,0.024984,0.031692,0.041187,0.052576,j-2505161018584abea5903b1fad46efcc,dandelions
5,5,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,...,0.008022,0.010358,0.014889,0.032923,0.039042,0.049017,0.06354,0.082097,j-250516101916430aa489f9f704370f29,dandelions
6,6,b'\x01\x03\x00\x00\x00\x02\x00\x00\x00b\x00\x0...,0,,,,,,,,...,0.009207,0.012007,0.015258,0.030508,0.039642,0.052435,0.069091,0.091799,j-2505161019344b0089b9eed895f0a803,dandelions
7,7,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x11\x00...,0,,,,,,,,...,0.005692,0.007832,0.011728,0.027132,0.031695,0.038928,0.047977,0.057112,j-25051610195045fbbad3794af8313574,dandelions
8,8,b'\x01\x03\x00\x00\x00\x03\x00\x00\x00\xe6\x00...,0,,,,,,,,...,0.010565,0.0137,0.01697,0.030198,0.037728,0.048462,0.062068,0.075866,j-2505161020064465b5eed42de26d9a63,dandelions
9,9,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00\x0c\x00...,0,,,,,,,,...,0.003242,0.005464,0.008266,0.018165,0.023109,0.029741,0.038527,0.046957,j-2505161020234fe888a2c9f1bb45b4e7,dandelions


In [39]:
from sklearn.model_selection import train_test_split

# drop columns with nans

data_clean = data.dropna(axis=1)

# drop unused columns

data_clean = data_clean.drop(columns=['Unnamed: 0', 'geometry', 'job_id', 'feature_index'])

# split train/test

X_train, X_test, y_train, y_test = train_test_split(data_clean.drop(columns=['target']), data_clean['target'], test_size=0.2, random_state=42)

In [42]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

model.score(X_test, y_test)


0.75

TODO:
- ingest model to EOTDL
- ingest feature recipe to EOTDL

## 5. Run inference with EOTDL

In [None]:
# sample = X_test.iloc[3]

# pred = model.predict(sample.values.reshape(1, -1))

# pred

Let's perform inference on some new parcels.

In [57]:
ix = np.random.randint(0, len(shapefiles))
country = shapefiles[ix]

country

'data/DK_2019_EC21.shp'

In [58]:
gdf = gpd.read_file(path)

gdf.head()

Unnamed: 0,taotlusaas,pollu_id,pindala_ha,taotletud_,taotletu_1,niitmise_t,niitmise_1,viimase_mu,taotletu_2,taotleja_n,taotleja_r,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
0,2021,19994165,0.25,Karjatamine väljaspool põllumaj. maad,Karjatamine väljaspool põllumaj. maad,,,2021/05/02 14:37:52.000,,FIE,,Rough grazings,pasture_meadow_grassland_grass,3302000000,"POLYGON ((26.50243 59.31839, 26.50244 59.31843..."
1,2021,19990783,1.7,rohttaimed,Püsirohumaa,Niidetud,28.06.2021-04.07.2021,2021/05/02 06:59:17.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,grasses,pasture_meadow_grassland_grass,3302000000,"POLYGON ((24.54648 58.86884, 24.54674 58.86879..."
2,2021,19990784,0.49,rohttaimed,Püsirohumaa,Ei kuulu jälgimisele,,2021/05/02 06:59:17.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,grasses,pasture_meadow_grassland_grass,3302000000,"POLYGON ((24.54597 58.86827, 24.54668 58.86816..."
3,2021,19996106,0.54,talinisu allakülvita,Põllukultuurid,,,2021/05/02 20:58:12.000,Kliimat ja keskkonda säästvate põllumajandusta...,ERAISIK,,Winter wheat,winter_common_soft_wheat,3301010101,"POLYGON ((27.42837 58.11975, 27.42839 58.11972..."
4,2021,19990620,2.48,"punane ristik (vähemalt 80% ristikut, kuni 20%...",Põllukultuurid,Niidetud,06.07.2021-11.07.2021,2021/07/05 07:26:35.000,Kliimat ja keskkonda säästvate põllumajandusta...,TAMSAMÄE OÜ,11350602.0,Red clover (at least 80% clover up to 20% gras...,clover,3301090303,"POLYGON ((26.66816 57.82049, 26.66815 57.8205,..."


In [59]:
gdf = gdf.sample(n=3)
gdf

Unnamed: 0,taotlusaas,pollu_id,pindala_ha,taotletud_,taotletu_1,niitmise_t,niitmise_1,viimase_mu,taotletu_2,taotleja_n,taotleja_r,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
162397,2021,22113589,21.37,talinisu allakülvita,Põllukultuurid,,,2021/06/14 15:52:11.000,Keskkonnasõbraliku majandamise toetus;Kliimat ...,OÜ KÕO AGRO,10070214,Winter wheat,winter_common_soft_wheat,3301010101,"POLYGON ((25.66627 58.63345, 25.66638 58.63344..."
42649,2021,20657638,10.17,rohttaimed,Püsirohumaa,Niidetud,16.08.2021-21.08.2021,2021/05/17 16:18:18.000,Kliimat ja keskkonda säästvate põllumajandusta...,AKTSIASELTS METSAKÜLA PIIM,10014380,grasses,pasture_meadow_grassland_grass,3302000000,"POLYGON ((24.37944 59.35179, 24.37945 59.35179..."
65776,2021,21102710,3.62,küüslauk,Põllukultuurid,,,2021/05/19 16:40:15.000,Kliimat ja keskkonda säästvate põllumajandusta...,OSAÜHING KASKEMA TALU,11017417,garlic,garlic,3301220200,"POLYGON ((24.36846 58.85519, 24.36852 58.85522..."


In [61]:
from eotdl.fe.openeo import point_extraction

# should be the same start_data and nb_monts; how can we save this  in the feature recipe?

point_extraction(gdf, start_date = "2024-01-01", nb_months = 2, job_tracker = 'jobs-inference.csv', parallel_jobs=10)

Authenticated using refresh token.


In [63]:
job = pd.read_csv("jobs-inference.csv")
job

Unnamed: 0,fid,geometry,crs,temporal_extent,id,backend_name,status,start_time,running_start_time,cpu,memory,duration
0,,"POLYGON ((25.6662734 58.63345021, 25.66637754 ...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-2505161103024444b6e50166f8aeb88b,cdse,finished,2025-05-16T11:03:02Z,2025-05-16T11:04:52Z,229.88571932 cpu-seconds,1565364.35546875 mb-seconds,150 seconds
1,,"POLYGON ((24.37944058 59.35179394, 24.37944933...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-250516110319411dbdd4ebb4189c99ec,cdse,finished,2025-05-16T11:03:19Z,2025-05-16T11:05:53Z,228.19191047700002 cpu-seconds,1323684.845703125 mb-seconds,193 seconds
2,,"POLYGON ((24.3684642 58.85519494, 24.36851719 ...",EPSG:4326,"['2024-01-01', '2024-03-01']",j-25051611033745fea98123d8907b6e1c,cdse,finished,2025-05-16T11:03:37Z,2025-05-16T11:05:53Z,169.140624351 cpu-seconds,1299268.8515625 mb-seconds,172 seconds


In [64]:
# Initialize an empty list to store all dataframes
all_data = []

# Loop through each job and read its parquet file
for idx, _job in job.iterrows():
    try:
        job_data = pd.read_parquet(f'job_{_job["id"]}/timeseries.parquet')
        # Add job_id as a column to identify the source
        job_data['job_id'] = _job["id"]
        all_data.append(job_data)
    except Exception as e:
        print(f"Error reading job {_job['id']}: {e}")

# Concatenate all dataframes into one
if all_data:
    data = pd.concat(all_data, ignore_index=True)
    print(f"Successfully merged {len(all_data)} time series datasets")
else:
    data = pd.DataFrame()
    print("No time series data was loaded")

data

Successfully merged 3 time series datasets


Unnamed: 0,geometry,feature_index,B02_P10,B02_P25,B02_P50,B02_P75,B02_P90,B03_P10,B03_P25,B03_P50,...,VH_P25,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90,job_id
0,b'\x01\x03\x00\x00\x00\x01\x00\x00\x00p\x00\x0...,0,,,,,,,,,...,0.003686,0.005511,0.007969,0.014515,0.032455,0.040394,0.053392,0.071857,0.116476,j-2505161103024444b6e50166f8aeb88b
1,b'\x01\x03\x00\x00\x00\x16\x00\x00\x00<\x04\x0...,0,,,,,,,,,...,0.005453,0.00784,0.010888,0.014063,0.036952,0.046182,0.060499,0.07805,0.096651,j-250516110319411dbdd4ebb4189c99ec
2,"b'\x01\x03\x00\x00\x00\x01\x00\x00\x00,\x00\x0...",0,,,,,,,,,...,0.002787,0.004535,0.007065,0.013067,0.036071,0.045936,0.062672,0.084033,0.129215,j-25051611033745fea98123d8907b6e1c


In [66]:
# drop columns with nans (should be defined in the feature recipe, how?)

data_clean = data.dropna(axis=1)

# drop unused columns (should be defined in the feature recipe, how? maybe better to define which columns to keep)

data_clean = data_clean.drop(columns=['geometry', 'job_id', 'feature_index'])

data_clean

Unnamed: 0,VH_P10,VH_P25,VH_P50,VH_P75,VH_P90,VV_P10,VV_P25,VV_P50,VV_P75,VV_P90
0,0.002636,0.003686,0.005511,0.007969,0.014515,0.032455,0.040394,0.053392,0.071857,0.116476
1,0.003829,0.005453,0.00784,0.010888,0.014063,0.036952,0.046182,0.060499,0.07805,0.096651
2,0.001747,0.002787,0.004535,0.007065,0.013067,0.036071,0.045936,0.062672,0.084033,0.129215


In [67]:
preds = model.predict(data_clean.values)

preds




array(['spring_rapeseed_rape', 'spring_rapeseed_rape',
       'spring_rapeseed_rape'], dtype=object)

In [68]:
gdf.EC_hcat_n

162397          winter_common_soft_wheat
42649     pasture_meadow_grassland_grass
65776                             garlic
Name: EC_hcat_n, dtype: object

Of course model is not good, need to train with more parcels & classes. You can use this notebook to do so.

TODO:
- Stage model from EOTDL
- Stage feature recipe from EOTDL