## EO Data Extraction Workflow
This notebook demonstrates a streamlined workflow for extracting and processing Earth Observation (EO) data 
using the **openEO** Python client. 

### Key Steps:
1. Load and align input data (shapefile).
2. Transform GeoDataFrame for MultiBackendJobManager
3. Run the extraction process using openEO backends.
4. View and analyze the outputs (e.g., NetCDF files).

### Required Libraries:
- `openeo` for interacting with EO backends.
- `openeo-gfmap` for handling geospatial data.

### Step 1: load in the shapefile

In [2]:
import os
import zipfile
import eotdl
from eotdl.datasets import download_dataset

download_dataset("EuroCrops", version=1, path="data", force=True)

os.makedirs("data/EuroCrops", exist_ok=True)

with zipfile.ZipFile("./data/EuroCrops/v1/EuroCrops.zip", 'r') as zip_ref:
    zip_ref.extractall("data/EuroCrops")


ModuleNotFoundError: No module named 'eotdl'

## Extract the desired geodataframe

In [3]:
import geopandas as gpd

# Define the file path
file_path = r"C:\Git_projects\eotdl\tutorials\notebooks\openeo\data\EuroCrops\BE_VLG_2021\BE_VLG_2021_EC21.shp"

# Load the shapefile
gdf = gpd.read_file(file_path)


## Subset the dataset for testing purposes

In [4]:
gdf = gdf[0:5]
gdf

Unnamed: 0,fid,GRAF_OPP,REF_ID,GWSCOD_V,GWSNAM_V,GWSCOD_H,GWSNAM_H,GWSGRP_H,GWSGRPH_LB,GWSCOD_N,...,PRC_NIS,X_REF,Y_REF,WGS84_LG,WGS84_BG,EC_NUTS3,EC_trans_n,EC_hcat_n,EC_hcat_c,geometry
0,56.0,1.0038,2195943000.0,,,898,Permacultuur,Andere subsidiabele gewassen groenten - gebrui...,Overige gewassen,,...,31033,59203.23,191139.13,"3°4'28""","51°1'24""",BE251,permaculture,permanent_crops_perennial,3303000000,"POLYGON ((59139.6 191171.57, 59230.75 191204.2..."
1,72.0,1.76,2192143000.0,,,898,Permacultuur,Andere subsidiabele gewassen groenten - gebrui...,Overige gewassen,,...,31040,66180.66,206540.11,"3°10'13""","51°9'46""",BE251,permaculture,permanent_crops_perennial,3303000000,"POLYGON ((66098.5 206582.75, 66141.25 206653.7..."
2,87.0,0.9844,2077538000.0,,,8,Volkstuinpark,,Overige gewassen,,...,31005,68503.64,222895.71,"3°11'59""","51°18'36""",BE251,People garden park,not_known_and_other,3399000000,"POLYGON ((68451.17 222959.44, 68445.16 222966...."
3,110.0,0.0418,2197113000.0,,,8,Volkstuinpark,,Overige gewassen,,...,31042,60749.72,214797.48,"3°5'26""","51°14'10""",BE251,People garden park,not_known_and_other,3399000000,"POLYGON ((60736.35 214800.35, 60753.1 214812.4..."
4,193.0,0.0828,1963867000.0,,,81,Braakliggend land zonder minimale activiteit,,Overige gewassen,,...,31033,60481.1,196996.21,"3°5'28""","51°4'34""",BE251,Derelict land with no minimum activity,unmaintained,3308000000,"POLYGON ((60447.13 197005.27, 60513.69 196999...."


# Transform GeoDataFrame for MultiBackendJobManager

This function processes an input GeoDataFrame and prepares it for use with openEO's **MultiBackendJobManager**. The job manager enables launching and tracking multiple openEO jobs simultaneously, which is essential for large-scale data extractions. 

### Note

It is important to note, that for this simple example we have opted to not group the various geometries into feature collections. This utility is only illustrated in the more advanced example. The impact for this choice is that for each polygon, a singly openEO job will need to be launched, leading to a more time and cost extensive extraction workflow.


### Parameters

#### Temporal Parameters:
- **Start Date:** Start of the temporal extent (e.g., `"2020-01-01"`).  
- **Number of Months:** Duration of the temporal extent in months.




In [9]:
from dataframe_utils import *

# Constants
start_date = "2020-01-01"
nb_months = 3

job_df = process_geodataframe(gdf, start_date, nb_months)

job_df

Unnamed: 0,fid,geometry,crs,temporal_extent
0,56.0,"POLYGON ((3.07366 51.02361, 3.07495 51.02392, ...",EPSG:4326,"[2020-01-01, 2020-04-01]"
1,72.0,"POLYGON ((3.16928 51.16317, 3.16988 51.16382, ...",EPSG:4326,"[2020-01-01, 2020-04-01]"
2,87.0,"POLYGON ((3.19923 51.31069, 3.19914 51.31075, ...",EPSG:4326,"[2020-01-01, 2020-04-01]"
3,110.0,"POLYGON ((3.09062 51.23622, 3.09086 51.23633, ...",EPSG:4326,"[2020-01-01, 2020-04-01]"
4,193.0,"POLYGON ((3.09086 51.07625, 3.09181 51.07621, ...",EPSG:4326,"[2020-01-01, 2020-04-01]"


# Start Job with Standardized UDPs and Feature Collection Filtering

This function initializes an openEO batch job using standardized **User-Defined Processes (UDPs)** for Sentinel-1 and Sentinel-2 data processing. It employs a spatial aggregation in order to get a time series per polygon.

### Key Features

1. **Use of Standardized UDPs**  
   - **S1 Weekly Statistics:** Computes weekly statistics from Sentinel-1 data.  
   - **S2 Weekly Statistics:** Computes weekly statistics from Sentinel-2 data.  
   - UDPs are defined in external JSON files.

2. **Spatial aggregation across polygons**  
   - an average is calculated for each individual polygon

3. **Cube Merging**  
   - Merges Sentinel-1 and Sentinel-2 datacubes for combined analysis.

4. **Job Configuration**  
   - Outputs results in **parquet** format with filenames derived

In [14]:
import geojson
import openeo
from s3proxy_utils import upload_geoparquet_file

def start_job(row: pd.Series, connection: openeo.Connection, **kwargs) -> openeo.BatchJob:

        temporal_extent = row["temporal_extent"]

        # set up load url in order to allow non-latlon feature collections for spatial filtering
        geometry = row["geometry"]

        #run the s1 and s2 udp
        s1 = connection.datacube_from_process(
                "s1_weekly_statistics",
                namespace="https://raw.githubusercontent.com/earthpulse/eotdl/refs/heads/hv_openeoexample/tutorials/notebooks/openeo/s1_weekly_statistics.json",
                temporal_extent=temporal_extent,
                )
        
        s2 = connection.datacube_from_process(
                "s2_weekly_statistics",
                namespace="https://raw.githubusercontent.com/earthpulse/eotdl/refs/heads/hv_openeoexample/tutorials/notebooks/openeo/s2_weekly_statistics.json",
                temporal_extent=temporal_extent,
                )
        
        #merge both cubes and filter across the feature collection
        merged = s2.merge_cubes(s1)
        result = merged.aggregate_spatial(geometries = geometry, reducer = "mean")
        
        #dedicated job settings to save the individual features within a collection seperately
        job = result.create_job(
                out_format="parquet",
        )

        return job

### Submit Extraction Jobs

Using the openEO backend, we authenticate and submit the jobs to process the EO data. 
Each job extracts Sentinel 1 and Sentinel 2 training features.

In [15]:
import openeo
from openeo.extra.job_management import MultiBackendJobManager, CsvJobDatabase

# Authenticate and add the backend
connection = openeo.connect(url="openeo.dataspace.copernicus.eu").authenticate_oidc()

# initialize the job manager
manager = MultiBackendJobManager()
manager.add_backend("cdse", connection=connection, parallel_jobs=2)

job_tracker = 'jobs.csv'
job_db = CsvJobDatabase(path=job_tracker)
if not job_db.exists():
    df = manager._normalize_df(job_df)
    job_db.persist(df)

manager.run_jobs(start_job=start_job, job_db=job_db)


Authenticated using refresh token.


defaultdict(int,
            {'job_db persist': 20,
             'track_statuses': 15,
             'job_db get_by_status': 10,
             'start_job call': 5,
             'job get status': 10,
             'job start': 5,
             'job launch': 5,
             'run_jobs loop': 15,
             'sleep': 15,
             'job describe': 25,
             'job started running': 5,
             'job finished': 5})

In [17]:
import pandas as pd

# Load the Parquet file
df = pd.read_parquet('job_j-241205a25cbe43b39749aac9314dae6a/timeseries.parquet')

# Display the first few rows
print(df.head())

# Print information about the data
print(df.info())

# Check the column names and types
print(df.dtypes)

                                            geometry  feature_index  \
0  b"\x01\x03\x00\x00\x00\x03\x00\x00\x00\x15\x00...              0   

      B02_P10     B02_P25     B02_P50     B02_P75      B02_P90     B03_P10  \
0  590.226804  618.757732  674.304124  753.104381  1267.213918  675.652062   

      B03_P25     B03_P50  ...   VH_P10    VH_P25    VH_P50    VH_P75  \
0  712.110825  796.889175  ...  0.01587  0.018861  0.022346  0.026172   

     VH_P90    VV_P10    VV_P25    VV_P50    VV_P75    VV_P90  
0  0.029244  0.090603  0.117799  0.142298  0.164967  0.188902  

[1 rows x 62 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 62 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   geometry       1 non-null      object 
 1   feature_index  1 non-null      int64  
 2   B02_P10        1 non-null      float64
 3   B02_P25        1 non-null      float64
 4   B02_P50        1 non-null      f

Step 4: We run the various openEO Jobs. Note all data will be locally downloaded as netcdfs named after the file_name property within the individual features (see process_file)

TODO;
1) simplified example. (non cost efficient)
--> point extraction (parcel averaged)
2) integrate with eotdl example
3) export workspace
4) advanced patch extraction
5) signed URL