https://github.com/earthpulse/eotdl/issues/190


1. find and explore the EuroCropsDataset, stage it in the EODTL workspace
2. filter the EuroCropsDataset dataset using EOTDL functionality, to create a subset of parcels,
   e.g., 8 crop classes, each with 1000 examples, for one country
3. run feature engineering with openEO, creating temporal metrics from a S1 and S2 time series (temporally optimised for crops classe of interest). Store feature engineering process graph with the training datsets in EOTDL
4. Use EOTDL functionality to train a model (for this the features need to be retrieved..). Store the model along with the openEO process graph in EOTDL.
5. Use the model to run inference (from within EOTDL?) in an openEO platform such as CDSE or openEO platform. Make use of the feature engineering process graph stored along with the EOTDL model.

## 1 Ingest EuroCrops to EOTDL

We already have Q0 https://www.eotdl.com/datasets/EuroCrops/. The dataset contains a zip file, which in turn contains zip files for each country with the shapefiles (16 total).

In [4]:
# !eotdl datasets get EuroCrops -v 1
# !unzip -o ~/.cache/eotdl/datasets/EuroCrops/v1/EuroCrops.zip -d data/

In [5]:
# from glob import glob

# zips = glob('data/*.zip')

# zips

In [6]:
# # unzip shapefiles

# import zipfile

# for zip_file in zips:
# 	with zipfile.ZipFile(zip_file, 'r') as zip_ref:
# 		zip_ref.extractall('data/')


In [7]:
# cleanup

# !rm -rf data/*.zips

In [8]:
from glob import glob

shapefiles = glob('data/**/*.shp', recursive=True)

shapefiles

['data/DE_NRW_2021_EC21.shp',
 'data/EE_2021_EC21.shp',
 'data/LV_2021_EC21.shp',
 'data/SK_2021_EC21.shp',
 'data/NL_2020_EC21.shp',
 'data/BE_VLG_2021_EC21.shp',
 'data/DK_2019_EC21.shp',
 'data/SI_2021_EC21.shp',
 'data/LT_2021_EC.shp',
 'data/AT_2021_EC21.shp',
 'data/DE_LS_2021_EC21.shp',
 'data/RO/RO_ny_EC21.shp',
 'data/SE/SE_2021_EC21.shp',
 'data/FR/FR_2018_EC21.shp',
 'data/HR/HR_2020_EC21.shp',
 'data/NA/ES_NA_2020_EC21.shp']

In [10]:
import geopandas as gpd

path = shapefiles[0]

gdf = gpd.read_file(path)

gdf.head()


Unnamed: 0,ID,INSPIRE_ID,FLIK,AREA_HA,CODE,CODE_TXT,USE_CODE,USE_TXT,D_PG,CROPDIV,EFA,ELER,WJ,DAT_BEARB,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
0,4598773,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,DENWLI0544130746,1.5204,311,Winterraps,OE,Ölsaaten,N,N,N,N,2021,2021-03-12,Winter rape,winter_rapeseed_rape,3301060401,"POLYGON ((428647.74 5711831.893, 428651.689 57..."
1,4598772,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,DENWLI0544130596,2.2812,131,Wintergerste,GT,Getreide,N,N,N,N,2021,2021-03-12,Winter barley,winter_barley,3301010401,"POLYGON ((427717.449 5710011.129, 427709.347 5..."
2,4598771,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,DENWLI0544130402,0.8311,115,Winterweichweizen,GT,Getreide,N,N,N,N,2021,2021-03-12,Winter soft wheat,winter_common_soft_wheat,3301010101,"POLYGON ((427337.557 5710068.068, 427332.544 5..."
3,5447571,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,DENWLI0548091835,4.7241,459,Grünland (Dauergrünland),GL,Dauergrünland,Y,N,N,Y,2021,2021-09-24,Grassland (permanent grassland),pasture_meadow_grassland_grass,3302000000,"POLYGON ((376283.353 5665431.25, 376308.653 56..."
4,5447586,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,DENWLI0548091988,6.1005,459,Grünland (Dauergrünland),GL,Dauergrünland,Y,N,N,Y,2021,2021-09-24,Grassland (permanent grassland),pasture_meadow_grassland_grass,3302000000,"POLYGON ((376495.069 5665848.269, 376496.653 5..."


In [12]:
# columns
gdf.columns

Index(['ID', 'INSPIRE_ID', 'FLIK', 'AREA_HA', 'CODE', 'CODE_TXT', 'USE_CODE',
       'USE_TXT', 'D_PG', 'CROPDIV', 'EFA', 'ELER', 'WJ', 'DAT_BEARB',
       'EC_trans_n', 'EC_hcat_n', 'EC_hcat_c', 'geometry'],
      dtype='object')

TODO: create and ingest Q1/Q2.

## 2. Filter EuroCropsDataset

Filter the EuroCropsDataset dataset using EOTDL functionality, to create a subset of parcels, e.g., 8 crop classes, each with 1000 examples, for one country

Filter from Q0.

In [29]:
# random country

import numpy as np

ix = np.random.randint(0, len(shapefiles))
country = shapefiles[ix]

country

'data/HR/HR_2020_EC21.shp'

In [34]:
gdf = gpd.read_file(path)

In [None]:
crop_classes = gdf['EC_hcat_n'].unique()

crop_classes

In [39]:
# number of samples per class

num_samples_per_class = {class_: len(gdf[gdf['EC_hcat_n'] == class_]) for class_ in crop_classes}

num_samples_per_class = dict(sorted(num_samples_per_class.items(), key=lambda x: x[1], reverse=True))

num_samples_per_class

{'pasture_meadow_grassland_grass': 314671,
 'green_silo_maize': 75863,
 'winter_common_soft_wheat': 64827,
 'winter_barley': 43119,
 'not_known_and_other': 32067,
 'flowers_ornamental_plants': 31132,
 'grain_maize_corn_popcorn': 25188,
 'winter_triticale': 18743,
 'winter_rye': 13833,
 'sugar_beet': 13284,
 'winter_rapeseed_rape': 11240,
 'unmaintained': 11060,
 'potatoes': 10534,
 'orchards_fruits': 9213,
 'fallow_land_not_crop': 7047,
 'clover': 5434,
 'other_arable_land_crops': 4741,
 'summer_oats': 3931,
 'summer_barley': 3647,
 'beans': 3550,
 'peas': 2172,
 'asparagus': 1851,
 'alfalfa_lucerne': 1830,
 'winter_spelt': 1746,
 'nurseries_nursery': 1685,
 'strawberries': 1669,
 'tree_wood_forest': 1619,
 'legumes_dried_pulses_protein_crops': 1573,
 'spring_common_soft_wheat': 1316,
 'winter_durum_hard_wheat': 1164,
 'fresh_vegetables': 1009,
 'carrots_daucus': 973,
 'arable_land_seed_seedlings': 918,
 'alliums': 882,
 'berries_berry_species': 783,
 'miscanthus_silvergrass': 758,
 'r

In [46]:
# import matplotlib.pyplot as plt

# plt.figure(figsize=(5, 25))
# plt.barh(list(num_samples_per_class.keys()), list(num_samples_per_class.values()))
# plt.tight_layout()
# plt.show()

In [51]:
# filter 1000 examples per class

samples = 1000
num_classes = 8

# keep classes with at least 1000 samples
classes = [class_ for class_, count in num_samples_per_class.items() if count >= samples]

# random 8 classes
classes = np.random.choice(classes, num_classes, replace=False)

classes


array(['winter_triticale', 'winter_rapeseed_rape', 'strawberries',
       'green_silo_maize', 'winter_common_soft_wheat',
       'grain_maize_corn_popcorn', 'flowers_ornamental_plants',
       'legumes_dried_pulses_protein_crops'], dtype='<U34')

In [52]:
filtered_gdf = gdf[gdf['EC_hcat_n'].isin(classes)]

filtered_gdf = filtered_gdf.groupby('EC_hcat_n').sample(n=samples, random_state=42)

filtered_gdf.head()

Unnamed: 0,ID,INSPIRE_ID,FLIK,AREA_HA,CODE,CODE_TXT,USE_CODE,USE_TXT,D_PG,CROPDIV,EFA,ELER,WJ,DAT_BEARB,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
94898,4633224,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,DENWLI0541170967,0.1515,574,Blühstreifen (nur AUM),SL,Stilllegung,N,Y,N,Y,2021,2021-03-19,Flower strips (only AUM),flowers_ornamental_plants,3301080000,"POLYGON ((474949.457 5741132.298, 475047.27 57..."
653406,5448987,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,DENWLI0538201932,0.1968,575,Blühfläche (nur AUM),SL,Stilllegung,N,Y,Y,Y,2021,2021-09-24,Flowering area (only AUM),flowers_ornamental_plants,3301080000,"POLYGON ((500364.956 5773221.209, 500365.782 5..."
254522,4926625,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,DENWLI0552050700,0.2338,574,Blühstreifen (nur AUM),SL,Stilllegung,N,Y,N,Y,2021,2021-04-28,Flower strips (only AUM),flowers_ornamental_plants,3301080000,"POLYGON ((331347.924 5619497.387, 331400.804 5..."
75332,5119898,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,DENWLI0553053426,0.3979,574,Blühstreifen (nur AUM),SL,Stilllegung,N,Y,N,Y,2021,2021-05-10,Flower strips (only AUM),flowers_ornamental_plants,3301080000,"POLYGON ((332401.151 5616932.313, 332401.348 5..."
194946,4844845,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,DENWLI0538174279,0.2443,575,Blühfläche (nur AUM),SL,Stilllegung,N,Y,N,Y,2021,2021-04-20,Flowering area (only AUM),flowers_ornamental_plants,3301080000,"POLYGON ((472403.55 5778300.636, 472405.889 57..."


In [55]:
assert len(filtered_gdf) == num_classes * samples

# save to disk
filtered_gdf.to_file('data/filtered_gdf.shp')


TODO: Use STAC/GeoDB to filter the dataset. This will return a STAC catalog with the filtered items, that can be staged with EOTDL.

Note: GeoDB only stores the STAC metadata. For this filtering, we need the actual data (crop type), which is not in the STAC metadata. Hence, we will not be able to do this filtering directly with GeoDB nor with the STAC metadata (even locally).


## 3. Feature Engineering with openEO

Run feature engineering with openEO, creating temporal metrics from a S1 and S2 time series (temporally optimised for crops classe of interest). Store feature engineering process graph with the training datsets in EOTDL

In [2]:
import geopandas as gpd

gdf = gpd.read_file('data/filtered_gdf.shp')

gdf.shape

(8000, 18)

## Transform GeoDataFrame for MultiBackendJobManager

This function processes an input GeoDataFrame and prepares it for use with openEO's **MultiBackendJobManager**. The job manager enables launching and tracking multiple openEO jobs simultaneously, which is essential for large-scale data extractions. 

### Note

It is important to note, that for this simple example we have opted to not group the various geometries into feature collections. This utility is only illustrated in the more advanced example. The impact for this choice is that for each polygon, a singly openEO job will need to be launched, leading to a more time and cost extensive extraction workflow.


### Parameters

#### Temporal Parameters:
- **Start Date:** Start of the temporal extent (e.g., `"2020-01-01"`).  
- **Number of Months:** Duration of the temporal extent in months.

In [3]:
from dataframe_utils import *

# Constants
start_date = "2020-01-01"
nb_months = 3

job_df = process_geodataframe(gdf, start_date, nb_months)

job_df

Unnamed: 0,fid,geometry,crs,temporal_extent
0,,"POLYGON ((8.63655 51.82046, 8.63797 51.82068, ...",EPSG:4326,"[2020-01-01, 2020-04-01]"
1,,"POLYGON ((9.00533 52.10954, 9.00534 52.10951, ...",EPSG:4326,"[2020-01-01, 2020-04-01]"
2,,"POLYGON ((6.6116 50.7028, 6.61234 50.70289, 6....",EPSG:4326,"[2020-01-01, 2020-04-01]"
3,,"POLYGON ((6.62766 50.68006, 6.62766 50.68006, ...",EPSG:4326,"[2020-01-01, 2020-04-01]"
4,,"POLYGON ((8.59662 52.15451, 8.59666 52.1545, 8...",EPSG:4326,"[2020-01-01, 2020-04-01]"
...,...,...,...,...
7995,,"POLYGON ((7.01404 52.02925, 7.01405 52.02929, ...",EPSG:4326,"[2020-01-01, 2020-04-01]"
7996,,"POLYGON ((6.74059 50.61102, 6.74074 50.61105, ...",EPSG:4326,"[2020-01-01, 2020-04-01]"
7997,,"POLYGON ((7.89622 51.98207, 7.89623 51.98211, ...",EPSG:4326,"[2020-01-01, 2020-04-01]"
7998,,"POLYGON ((7.90675 51.7403, 7.90674 51.74055, 7...",EPSG:4326,"[2020-01-01, 2020-04-01]"


## Start Job with Standardized UDPs and Feature Collection Filtering

This function initializes an openEO batch job using standardized **User-Defined Processes (UDPs)** for Sentinel-1 and Sentinel-2 data processing. It employs a spatial aggregation in order to get a time series per polygon.

### Key Features

1. **Use of Standardized UDPs**  
   - **S1 Weekly Statistics:** Computes weekly statistics from Sentinel-1 data.  
   - **S2 Weekly Statistics:** Computes weekly statistics from Sentinel-2 data.  
   - UDPs are defined in external JSON files.

2. **Spatial aggregation across polygons**  
   - an average is calculated for each individual polygon

3. **Cube Merging**  
   - Merges Sentinel-1 and Sentinel-2 datacubes for combined analysis.

4. **Job Configuration**  
   - Outputs results in **parquet** format with filenames derived

In [4]:
import openeo
from s3proxy_utils import upload_geoparquet_file

def start_job(row: pd.Series, connection: openeo.Connection, **kwargs) -> openeo.BatchJob:

        temporal_extent = row["temporal_extent"]

        # set up load url in order to allow non-latlon feature collections for spatial filtering
        geometry = row["geometry"]

        #run the s1 and s2 udp
        s1 = connection.datacube_from_process(
                "s1_weekly_statistics",
                namespace="https://raw.githubusercontent.com/earthpulse/eotdl/refs/heads/hv_openeoexample/tutorials/notebooks/openeo/s1_weekly_statistics.json",
                temporal_extent=temporal_extent,
                )
        
        s2 = connection.datacube_from_process(
                "s2_weekly_statistics",
                namespace="https://raw.githubusercontent.com/earthpulse/eotdl/refs/heads/hv_openeoexample/tutorials/notebooks/openeo/s2_weekly_statistics.json",
                temporal_extent=temporal_extent,
                )
        
        #merge both cubes and filter across the feature collection
        merged = s2.merge_cubes(s1)
        result = merged.aggregate_spatial(geometries = geometry, reducer = "mean")
        
        #dedicated job settings to save the individual features within a collection seperately
        job = result.create_job(
                out_format="parquet",
        )

        return job

### Submit Extraction Jobs

Using the openEO backend, we authenticate and submit the jobs to process the EO data. 
Each job extracts Sentinel 1 and Sentinel 2 training features.

In [23]:
import openeo
from openeo.extra.job_management import MultiBackendJobManager, ParquetJobDatabase

# Authenticate and add the backend

job_tracker = 'jobs.parquet'

# initialize the job manager
manager = MultiBackendJobManager()
connection = openeo.connect(url="openeo.dataspace.copernicus.eu").authenticate_oidc()
manager.add_backend("cdse", connection=connection, parallel_jobs=2)

# job_db = CsvJobDatabase(path=job_tracker)
job_db = ParquetJobDatabase(path=job_tracker)
if not job_db.exists():
    df = manager._normalize_df(job_df)
    job_db.persist(df)

manager.run_jobs(start_job=start_job, job_db=job_tracker)


Authenticated using refresh token.


Preflight process graph validation failed: Object of type ndarray is not JSON serializable


TypeError: Object of type ndarray is not JSON serializable

In [15]:
openeo.__version__

'0.37.0'

TODO: fix error, ingest resulting parquet files to EOTDL (features as a new dataset), save process graph (json) as a reusable Feature recipe (should be used for inference).

## 4. Train a model with EOTDL

TODO:
- Load parquet files with features (staged from EOTDL)
- Split train/test
- Train model
- Evaluate model
- Export model (ONNX)
- Ingest model to EOTDL

## 5. Run inference with EOTDL

TODO:
- Stage model from EOTDL
- Stage feature recipe from EOTDL
- Generate nuew subset of parcels
- Compute features with reusable feature recipe
- Run inference with model
- Explore results