# Land Cover Mapping Feature Extraction

This notebook serves as an example of how to use GFMap and openEO to extract point features for training a machine learning model to do Land Cover Mapping. 

The example uses the following steps:
- Load the labelled points and ditribute them into spatial hexagons.
- Define the pre-processing steps for extracting the features from Sentinel-1 and Sentinel-2 data.
- Set-up the Sentinel-1 and Sentinel-2 fetchers with GFMap and launch the openEO jobs to fetch the data.
- Combine the results from all the batch jobs into one dataframe.
- Train a random forrest classifier using the extracted features.

In [1]:
import openeo
from openeo.extra.spectral_indices.spectral_indices import compute_and_rescale_indices
import openeo.processes as eop

import geopandas as gpd
import pandas as pd
import json
import geojson
from pathlib import Path
import datetime
from shapely.geometry import box
from typing import List
import logging

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, ConfusionMatrixDisplay

from openeo_gfmap.manager import _log
from openeo_gfmap import TemporalContext, Backend, BackendContext, FetchType
from openeo_gfmap.manager.job_splitters import split_job_hex
from openeo_gfmap.manager.job_manager import GFMAPJobManager
from openeo_gfmap.manager import _log
from openeo_gfmap.backend import cdse_connection, vito_connection
from openeo_gfmap.fetching import build_sentinel2_l2a_extractor, build_sentinel1_grd_extractor

In [2]:
_log.setLevel(logging.DEBUG)

stream_handler = logging.StreamHandler()
_log.addHandler(stream_handler)

formatter = logging.Formatter('%(asctime)s|%(name)s|%(levelname)s:  %(message)s')
stream_handler.setFormatter(formatter)

# Exclude the other loggers from other libraries
class MyLoggerFilter(logging.Filter):
    def filter(self, record):
        return record.name == _log.name

stream_handler.addFilter(MyLoggerFilter())

## Distribute labelled points

First, we load in a dataset with target labels. In order for the model to work, the target labels need to be integers. Also, we extract some target points from the target polygons.

In [3]:
mask = box(4.4, 50.2, 5.6, 51.2)
input_gpkg = gpd.read_file("https://artifactory.vgt.vito.be/auxdata-public/openeo/LUCAS_2018_Copernicus.gpkg",mask=mask)
input_gpkg["geometry"] = input_gpkg["geometry"].apply(lambda x: x.centroid)
input_gpkg["target"] = input_gpkg["LC1"].apply(lambda x: ord(x[0])-65)
input_gpkg = input_gpkg[['target','geometry', 'YEAR']]
input_gpkg

Unnamed: 0,target,geometry,YEAR
0,1,POINT (4.41237 50.62403),2018
1,1,POINT (4.40775 50.65988),2018
2,2,POINT (4.42316 50.76876),2018
3,2,POINT (4.45993 50.69850),2018
4,2,POINT (4.45355 50.75219),2018
...,...,...,...
216,4,POINT (5.59582 50.69350),2018
217,1,POINT (5.52965 51.08706),2018
218,4,POINT (5.55633 51.10611),2018
219,1,POINT (5.55983 51.07032),2018


To extract the target point features, we use GFMap to distribute the target points over multiple hexagons. Each hexagon extraction will be performed in a separate openeo job. 
Splitting up jobs is necessary because processing a large area in one job would cause memory issues.

We use `split_job_hex` for distributing the target points over multiple hexagons.

In [4]:
input_split = split_job_hex(input_gpkg, max_points=500, grid_resolution=4)


  polygons["h3index"] = polygons.geometry.centroid.apply(


We then create a dataframe where each row represents a single hexagon, and thus batch_job.

In [5]:
def create_job_dataframe(split_jobs: List[gpd.GeoDataFrame]) -> pd.DataFrame:
    """Create a dataframe from the split jobs, containg all the necessary information to run the job."""
    rows = []
    for job in split_jobs:
        start_date = datetime.datetime(job.YEAR.min(), 1, 1)
        end_date = datetime.datetime(job.YEAR.max(), 12, 31)
        rows.append(pd.Series({
            'out_prefix': 'S1S2-stats',
            'out_extension': '.csv',
            'start_date': start_date,
            'end_date': end_date,
            'geometry': job.to_json()
        }))
    return pd.DataFrame(rows)

job_df = create_job_dataframe(input_split)

In [6]:
# job_df = job_df.iloc[:3] # testing: only run one job for now

## Define feature extraction

Next, we will define wich features we want to extract from openeo.

First we define the process graph, except the actual loading of a collection. This will be done by using the GFMap specific methods.

In [7]:
def timesteps_as_bands(base_features):
    band_names = [band + "_m" + str(i+1) for band in base_features.metadata.band_names for i in range(12)]
    result =  base_features.apply_dimension(
        dimension='t', 
        target_dimension='bands', 
        process=lambda d: eop.array_create(data=d)
    )
    return result.rename_labels('bands', band_names)

def compute_statistics(base_features):
    """
    Computes  MEAN, STDDEV, MIN, P25, MEDIAN, P75, MAX over a datacube.
    """
    def computeStats(input_timeseries):
        result = eop.array_concat(
            input_timeseries.mean(),
            input_timeseries.sd()
        )
        result = eop.array_concat(result, input_timeseries.min())
        result = eop.array_concat(result, input_timeseries.quantiles(probabilities=[0.25]))
        result = eop.array_concat(result, input_timeseries.median())
        result = eop.array_concat(result, input_timeseries.quantiles(probabilities=[0.75]))
        result = eop.array_concat(result, input_timeseries.max())
        return result
    
    stats = base_features.apply_dimension(dimension='t', target_dimension='bands', process=computeStats)
    all_bands = [band + "_" + stat for band in base_features.metadata.band_names for stat in ["mean", "stddev", "min", "p25", "median", "p75", "max"]]
    return stats.rename_labels('bands', all_bands)

def get_s1_features(
        s1_datacube
) -> openeo.DataCube:
    s1 = s1_datacube.linear_scale_range(0, 30, 0,30000)
    s1_month = s1.aggregate_temporal_period(period="month", reducer="mean")

    s1_month = s1_month.apply_dimension(dimension="t", process="array_interpolate_linear")

    s1_features = timesteps_as_bands(s1_month)
    return s1_features

def get_s2_features(
        s2_datacube,
        s2_list,
        s2_index_dict,
) -> openeo.DataCube:
    # TODO compare with BAP or NDVIweighted
    s2 = s2_datacube.process("mask_scl_dilation", data=s2_datacube, scl_band_name="SCL").filter_bands(s2_datacube.metadata.band_names[:-1])

    indices = compute_and_rescale_indices(s2, s2_index_dict, append=False)
    idx_dekad = indices.aggregate_temporal_period("dekad", reducer="mean")
    idx_stats = compute_statistics(idx_dekad)

    s2_montly = s2.filter_bands(s2_list).aggregate_temporal_period("month", reducer="mean") #TODO check whether to use mean or median
    s2_montly = s2_montly.apply_dimension(dimension="t", process="array_interpolate_linear")
    s2_features = timesteps_as_bands(s2_montly).merge_cubes(idx_stats)
    return s2_features

def preprocess_features(
        s2_datacube,
        s1_datacube,
) -> openeo.DataCube:
    s2_list = ["B02", "B03", "B04", "B08", "B11", "B12"]
    s2_index_dict = {
        "collection": {
            "input_range": [0, 8000],
            "output_range": [0, 30_000]
        },
        "indices": {
            "NDVI": {"input_range": [-1,1], "output_range": [0, 30_000]}
        }
    }
    
    s2_features = get_s2_features(s2_datacube, s2_list, s2_index_dict)
    s1_features = get_s1_features(s1_datacube)
    # TODO add topopraphic features: elevation, slope and aspect 

    features = s2_features.merge_cubes(s1_features)
    return features

## Fetching the data

### Set-up the Sentinel-1 and Sentinel-2 fetchers

Next we use the extractor methods of GFMap to load the collection. Using these methods allows the backend independant loading of collections (e.g. wether or not we still have to calculate the backscatter on S1 data or not).

The loaded collections are pre-processed and then aggregated for the target points.

In [8]:
def sentinel2_collection(
        row : pd.Series,
        connection: openeo.DataCube,
        geometry: geojson.FeatureCollection
    )-> openeo.DataCube:
    bands = ["B02", "B03", "B04", "B05", "B06", "B07", "B08", "B11", "B12", "SCL"]
    bands_with_platform = ["S2-L2A-" + band for band in bands]

    extractor = build_sentinel2_l2a_extractor(
        backend_context=BackendContext(Backend(row.backend_name)),
        bands=bands_with_platform,
        fetch_type=FetchType.POINT,
    )

    temporal_context = TemporalContext(row.start_date, row.end_date)

    s2 = extractor.get_cube(connection, geometry, temporal_context)
    # TODO add max_cloud_cover 80
    s2 = s2.rename_labels("bands", bands)
    return s2

def sentinel1_collection(
        row: pd.Series,
        connection : openeo.DataCube,
        geometry: geojson.FeatureCollection,
    )-> openeo.DataCube:
    bands = ["VH", "VV"]
    bands_with_platform = ["S1-SIGMA0-" + band for band in bands]

    extractor = build_sentinel1_grd_extractor(
        backend_context=BackendContext(Backend(row.backend_name)),
        bands=bands_with_platform,
        fetch_type=FetchType.POINT,
    )

    temporal_context = TemporalContext(row.start_date, row.end_date)

    s1 = extractor.get_cube(connection, geometry, temporal_context)
    s1 = s1.rename_labels("bands", bands)
    return s1

def load_lc_features(
    row: pd.Series,
    connection : openeo.DataCube,
    **kwargs
):
    geometry = geojson.loads(row.geometry)
    
    s2_collection = sentinel2_collection(
        row=row,
        connection=connection,
        geometry=geometry
    )

    s1_collection = sentinel1_collection(
        row=row,
        connection=connection,
        geometry=geometry
    )

    features = preprocess_features(s2_collection, s1_collection)

    # Currently, aggregate_spatial and vectorcubes do not keep the band names, so we'll need to rename them later on
    global final_band_names
    final_band_names = [b.name for b in features.metadata.band_dimension.bands]

    features = features.aggregate_spatial(geometry, reducer="median")
    
    job_options = {
        "executor-memory": "3G", # Increase this value if a job fails due to memory issues
        "executor-memoryOverhead": "2G",
        "soft-errors": True
    }

    return features.create_job(
        out_format="csv",
        title=f"GFMAP_Extraction_{geometry.features[0].properties['h3index']}",
        job_options=job_options,
    )

# Global variable to store the final band names
final_band_names = None

### Launch the openEO jobs to fetch the data

In order to launch the jobs, we have to define a function that fill determine the outputfile name and create the job manager.

In [9]:
def generate_output_path(
    root_folder: Path,
    tmp_path: Path,
    geometry_index: int,
    row: pd.Series
) -> Path:
    features = geojson.loads(row.geometry)
    h3index = features[geometry_index].properties['h3index']
    result = root_folder / f"{row.out_prefix}_{h3index}_{geometry_index}{row.out_extension}"
    print("output_path:", result)
    return result

In [10]:
base_output_path = Path("output")
base_output_path.mkdir(exist_ok=True)

timenow = datetime.datetime.now()
timestr = timenow.strftime("%Y%m%d-%Hh%M")
print(f"Timestr: {timestr}")
tracking_file = base_output_path / f"tracking_{timestr}.csv"


Timestr: 20240313-15h25


In [11]:
manager = GFMAPJobManager(
    output_dir=base_output_path / timestr,
    output_path_generator=generate_output_path,
    poll_sleep=60,
    n_threads=1,
    collection_id="LC_feature_extraction",
)

In [12]:
manager.add_backend(Backend.CDSE, cdse_connection, parallel_jobs=2)

We then run the prepared jobs.

In [13]:
manager.run_jobs(
    job_df,
    load_lc_features,
    tracking_file
)

2024-03-13 15:25:02,116|openeo_gfmap.manager|INFO:  Starting job manager using 1 worker threads.
2024-03-13 15:25:02,119|openeo_gfmap.manager|INFO:  Workers started, creating and running jobs.
2024-03-13 15:25:02,123|openeo_gfmap.manager|DEBUG:  Normalizing dataframe. Columns: Index(['out_prefix', 'out_extension', 'start_date', 'end_date', 'geometry',
       'status', 'id', 'start_time', 'cpu', 'memory', 'duration',
       'backend_name', 'description', 'costs'],
      dtype='object')


Authenticated using refresh token.
DataCube(<PGNode 'dimension_labels' at 0x23c2aa5afd0>)
DataCube(<PGNode 'dimension_labels' at 0x23c2aa69ea0>)
DataCube(<PGNode 'dimension_labels' at 0x23c2aa8fa70>)
DataCube(<PGNode 'dimension_labels' at 0x23c2aa8ddb0>)


2024-03-13 15:26:38,688|openeo_gfmap.manager|DEBUG:  Status of job j-24031351b99d42edab6191abe29cb2df is running (on backend Backend.CDSE).
2024-03-13 15:26:39,220|openeo_gfmap.manager|DEBUG:  Status of job j-2403132379804fc582d1d8a0f70f91bb is running (on backend Backend.CDSE).
2024-03-13 15:27:41,293|openeo_gfmap.manager|DEBUG:  Status of job j-24031351b99d42edab6191abe29cb2df is running (on backend Backend.CDSE).
2024-03-13 15:27:42,868|openeo_gfmap.manager|DEBUG:  Status of job j-2403132379804fc582d1d8a0f70f91bb is running (on backend Backend.CDSE).
2024-03-13 15:28:43,635|openeo_gfmap.manager|DEBUG:  Status of job j-24031351b99d42edab6191abe29cb2df is running (on backend Backend.CDSE).
2024-03-13 15:28:44,218|openeo_gfmap.manager|DEBUG:  Status of job j-2403132379804fc582d1d8a0f70f91bb is running (on backend Backend.CDSE).
2024-03-13 15:29:44,670|openeo_gfmap.manager|DEBUG:  Status of job j-24031351b99d42edab6191abe29cb2df is running (on backend Backend.CDSE).
2024-03-13 15:29:46,

DataCube(<PGNode 'dimension_labels' at 0x23c2aabc410>)
DataCube(<PGNode 'dimension_labels' at 0x23c2aabd720>)


2024-03-13 15:41:09,660|openeo_gfmap.manager|DEBUG:  Generating output path for asset timeseries.csv from job j-24031351b99d42edab6191abe29cb2df...
2024-03-13 15:41:09,663|openeo_gfmap.manager|DEBUG:  Downloading asset timeseries.csv from job j-24031351b99d42edab6191abe29cb2df -> output\20240313-15h25\S1S2-stats_841fa01ffffffff_0.csv


output_path: output\20240313-15h25\S1S2-stats_841fa01ffffffff_0.csv


2024-03-13 15:41:10,809|openeo_gfmap.manager|INFO:  Downloaded asset timeseries.csv from job j-24031351b99d42edab6191abe29cb2df -> output\20240313-15h25\S1S2-stats_841fa01ffffffff_0.csv
2024-03-13 15:41:12,974|openeo_gfmap.manager|INFO:  Added 0 items to the STAC collection.
2024-03-13 15:41:12,976|openeo_gfmap.manager|INFO:  Job j-24031351b99d42edab6191abe29cb2df and post job action finished successfully.
2024-03-13 15:42:23,309|openeo_gfmap.manager|DEBUG:  Status of job j-2403132379804fc582d1d8a0f70f91bb is running (on backend Backend.CDSE).
2024-03-13 15:42:24,726|openeo_gfmap.manager|DEBUG:  Status of job j-24031367e1d04f1986cda2836e72c6d5 is running (on backend Backend.CDSE).
2024-03-13 15:43:25,173|openeo_gfmap.manager|DEBUG:  Status of job j-2403132379804fc582d1d8a0f70f91bb is running (on backend Backend.CDSE).
2024-03-13 15:43:25,508|openeo_gfmap.manager|DEBUG:  Status of job j-24031367e1d04f1986cda2836e72c6d5 is running (on backend Backend.CDSE).
2024-03-13 15:44:25,748|openeo

DataCube(<PGNode 'dimension_labels' at 0x23c2aabcff0>)
DataCube(<PGNode 'dimension_labels' at 0x23c2aabd310>)


2024-03-13 15:45:27,939|openeo_gfmap.manager|DEBUG:  Generating output path for asset timeseries.csv from job j-2403132379804fc582d1d8a0f70f91bb...
2024-03-13 15:45:27,941|openeo_gfmap.manager|DEBUG:  Downloading asset timeseries.csv from job j-2403132379804fc582d1d8a0f70f91bb -> output\20240313-15h25\S1S2-stats_841fa05ffffffff_0.csv


output_path: output\20240313-15h25\S1S2-stats_841fa05ffffffff_0.csv


2024-03-13 15:45:28,831|openeo_gfmap.manager|INFO:  Downloaded asset timeseries.csv from job j-2403132379804fc582d1d8a0f70f91bb -> output\20240313-15h25\S1S2-stats_841fa05ffffffff_0.csv
2024-03-13 15:45:30,266|openeo_gfmap.manager|INFO:  Added 0 items to the STAC collection.
2024-03-13 15:45:30,267|openeo_gfmap.manager|INFO:  Job j-2403132379804fc582d1d8a0f70f91bb and post job action finished successfully.
2024-03-13 15:46:46,334|openeo_gfmap.manager|DEBUG:  Status of job j-24031367e1d04f1986cda2836e72c6d5 is running (on backend Backend.CDSE).
2024-03-13 15:46:46,754|openeo_gfmap.manager|DEBUG:  Status of job j-240313c5da8b4937a5dc258ac40044c0 is running (on backend Backend.CDSE).
2024-03-13 15:47:47,197|openeo_gfmap.manager|DEBUG:  Status of job j-24031367e1d04f1986cda2836e72c6d5 is running (on backend Backend.CDSE).
2024-03-13 15:47:48,297|openeo_gfmap.manager|DEBUG:  Status of job j-240313c5da8b4937a5dc258ac40044c0 is running (on backend Backend.CDSE).
2024-03-13 15:49:02,855|openeo

output_path: output\20240313-15h25\S1S2-stats_841fa09ffffffff_0.csv


2024-03-13 16:13:22,224|openeo_gfmap.manager|INFO:  Downloaded asset timeseries.csv from job j-24031367e1d04f1986cda2836e72c6d5 -> output\20240313-15h25\S1S2-stats_841fa09ffffffff_0.csv
2024-03-13 16:13:23,580|openeo_gfmap.manager|DEBUG:  Status of job j-240313c5da8b4937a5dc258ac40044c0 is running (on backend Backend.CDSE).


DataCube(<PGNode 'dimension_labels' at 0x23c2aabdea0>)
DataCube(<PGNode 'dimension_labels' at 0x23c2ab548c0>)


2024-03-13 16:14:47,593|openeo_gfmap.manager|INFO:  Added 0 items to the STAC collection.
2024-03-13 16:14:47,596|openeo_gfmap.manager|INFO:  Job j-24031367e1d04f1986cda2836e72c6d5 and post job action finished successfully.
2024-03-13 16:16:24,207|openeo_gfmap.manager|DEBUG:  Status of job j-240313c5da8b4937a5dc258ac40044c0 is running (on backend Backend.CDSE).
2024-03-13 16:16:43,128|openeo_gfmap.manager|DEBUG:  Status of job j-240313c94f574951a60df88536847159 is running (on backend Backend.CDSE).
2024-03-13 16:17:45,775|openeo_gfmap.manager|DEBUG:  Status of job j-240313c5da8b4937a5dc258ac40044c0 is running (on backend Backend.CDSE).
2024-03-13 16:17:49,244|openeo_gfmap.manager|DEBUG:  Status of job j-240313c94f574951a60df88536847159 is running (on backend Backend.CDSE).
2024-03-13 16:18:49,538|openeo_gfmap.manager|DEBUG:  Status of job j-240313c5da8b4937a5dc258ac40044c0 is running (on backend Backend.CDSE).
2024-03-13 16:18:54,556|openeo_gfmap.manager|DEBUG:  Status of job j-240313c

output_path: output\20240313-15h25\S1S2-stats_841fa0dffffffff_0.csv


2024-03-13 19:52:34,919|openeo_gfmap.manager|DEBUG:  Status of job j-240313c94f574951a60df88536847159 is finished (on backend Backend.CDSE).
2024-03-13 19:52:34,919|openeo_gfmap.manager|INFO:  Job j-240313c94f574951a60df88536847159 finished successfully, queueing on_job_done...


DataCube(<PGNode 'dimension_labels' at 0x23c2aabe670>)
DataCube(<PGNode 'dimension_labels' at 0x23c2aabe2b0>)


2024-03-13 19:52:35,965|openeo_gfmap.manager|INFO:  Downloaded asset timeseries.csv from job j-240313c5da8b4937a5dc258ac40044c0 -> output\20240313-15h25\S1S2-stats_841fa0dffffffff_0.csv
2024-03-13 19:52:39,244|openeo_gfmap.manager|INFO:  Added 0 items to the STAC collection.
2024-03-13 19:52:39,244|openeo_gfmap.manager|INFO:  Job j-240313c5da8b4937a5dc258ac40044c0 and post job action finished successfully.
2024-03-13 19:52:39,244|openeo_gfmap.manager|DEBUG:  Worker thread Thread-5 (_post_job_worker): polled finished job with status PostJobStatus.FINISHED.
2024-03-13 19:52:41,182|openeo_gfmap.manager|DEBUG:  Generating output path for asset timeseries.csv from job j-240313c94f574951a60df88536847159...
2024-03-13 19:52:41,182|openeo_gfmap.manager|DEBUG:  Downloading asset timeseries.csv from job j-240313c94f574951a60df88536847159 -> output\20240313-15h25\S1S2-stats_841fa41ffffffff_0.csv


output_path: output\20240313-15h25\S1S2-stats_841fa41ffffffff_0.csv
DataCube(<PGNode 'dimension_labels' at 0x23c2ab86940>)
DataCube(<PGNode 'dimension_labels' at 0x23c2ab872f0>)


2024-03-13 19:53:05,900|openeo_gfmap.manager|INFO:  Downloaded asset timeseries.csv from job j-240313c94f574951a60df88536847159 -> output\20240313-15h25\S1S2-stats_841fa41ffffffff_0.csv
2024-03-13 19:53:07,956|openeo_gfmap.manager|INFO:  Added 0 items to the STAC collection.
2024-03-13 19:53:07,957|openeo_gfmap.manager|INFO:  Job j-240313c94f574951a60df88536847159 and post job action finished successfully.
2024-03-13 19:54:22,051|openeo_gfmap.manager|DEBUG:  Status of job j-240313acfc5c4008b960ae613a47780f is running (on backend Backend.CDSE).
2024-03-13 19:54:22,551|openeo_gfmap.manager|DEBUG:  Status of job j-24031327147749e092bd2c6a7e827cd2 is running (on backend Backend.CDSE).
2024-03-13 19:55:23,341|openeo_gfmap.manager|DEBUG:  Status of job j-240313acfc5c4008b960ae613a47780f is running (on backend Backend.CDSE).
2024-03-13 19:55:23,842|openeo_gfmap.manager|DEBUG:  Status of job j-24031327147749e092bd2c6a7e827cd2 is running (on backend Backend.CDSE).
Ignoring connection error (con

DataCube(<PGNode 'dimension_labels' at 0x23c2aae0c30>)
DataCube(<PGNode 'dimension_labels' at 0x23c2aae2670>)


2024-03-15 08:59:42,316|openeo_gfmap.manager|DEBUG:  Generating output path for asset timeseries.csv from job j-240313acfc5c4008b960ae613a47780f...
2024-03-15 08:59:42,325|openeo_gfmap.manager|DEBUG:  Downloading asset timeseries.csv from job j-240313acfc5c4008b960ae613a47780f -> output\20240313-15h25\S1S2-stats_841fa43ffffffff_0.csv


output_path: output\20240313-15h25\S1S2-stats_841fa43ffffffff_0.csv
DataCube(<PGNode 'dimension_labels' at 0x23c2aae14a0>)
DataCube(<PGNode 'dimension_labels' at 0x23c2aad47d0>)


2024-03-15 08:59:59,559|openeo_gfmap.manager|INFO:  Downloaded asset timeseries.csv from job j-240313acfc5c4008b960ae613a47780f -> output\20240313-15h25\S1S2-stats_841fa43ffffffff_0.csv
2024-03-15 09:00:50,444|openeo_gfmap.manager|INFO:  Added 0 items to the STAC collection.
2024-03-15 09:00:50,444|openeo_gfmap.manager|INFO:  Job j-240313acfc5c4008b960ae613a47780f and post job action finished successfully.
2024-03-15 09:00:50,444|openeo_gfmap.manager|DEBUG:  Worker thread Thread-5 (_post_job_worker): polled finished job with status PostJobStatus.FINISHED.
2024-03-15 09:01:33,260|openeo_gfmap.manager|DEBUG:  Status of job j-240315c62071442abde062b425dc268f is created (on backend Backend.CDSE).
2024-03-15 09:01:35,001|openeo_gfmap.manager|DEBUG:  Status of job j-24031513bf6b4a62a936968d9566e7a8 is created (on backend Backend.CDSE).
2024-03-15 09:02:14,271|openeo_gfmap.manager|DEBUG:  Generating output path for asset timeseries.csv from job j-24031327147749e092bd2c6a7e827cd2...
2024-03-15

output_path: output\20240313-15h25\S1S2-stats_841fa45ffffffff_0.csv


2024-03-15 09:02:22,149|openeo_gfmap.manager|INFO:  Downloaded asset timeseries.csv from job j-24031327147749e092bd2c6a7e827cd2 -> output\20240313-15h25\S1S2-stats_841fa45ffffffff_0.csv
2024-03-15 09:02:35,597|openeo_gfmap.manager|DEBUG:  Status of job j-240315c62071442abde062b425dc268f is running (on backend Backend.CDSE).
2024-03-15 09:02:36,384|openeo_gfmap.manager|DEBUG:  Status of job j-24031513bf6b4a62a936968d9566e7a8 is running (on backend Backend.CDSE).
2024-03-15 09:02:41,115|openeo_gfmap.manager|INFO:  Added 0 items to the STAC collection.
2024-03-15 09:02:41,115|openeo_gfmap.manager|INFO:  Job j-24031327147749e092bd2c6a7e827cd2 and post job action finished successfully.
2024-03-15 09:03:37,460|openeo_gfmap.manager|DEBUG:  Status of job j-240315c62071442abde062b425dc268f is running (on backend Backend.CDSE).
2024-03-15 09:03:37,789|openeo_gfmap.manager|DEBUG:  Status of job j-24031513bf6b4a62a936968d9566e7a8 is running (on backend Backend.CDSE).
2024-03-15 09:04:38,587|openeo

DataCube(<PGNode 'dimension_labels' at 0x23c2aaf21c0>)
DataCube(<PGNode 'dimension_labels' at 0x23c2aaf13b0>)


2024-03-15 09:11:07,515|openeo_gfmap.manager|DEBUG:  Generating output path for asset timeseries.csv from job j-24031513bf6b4a62a936968d9566e7a8...
2024-03-15 09:11:07,515|openeo_gfmap.manager|DEBUG:  Downloading asset timeseries.csv from job j-24031513bf6b4a62a936968d9566e7a8 -> output\20240313-15h25\S1S2-stats_841fa61ffffffff_0.csv


output_path: output\20240313-15h25\S1S2-stats_841fa61ffffffff_0.csv


2024-03-15 09:11:45,806|openeo_gfmap.manager|INFO:  Downloaded asset timeseries.csv from job j-24031513bf6b4a62a936968d9566e7a8 -> output\20240313-15h25\S1S2-stats_841fa61ffffffff_0.csv
2024-03-15 09:12:16,393|openeo_gfmap.manager|INFO:  Added 0 items to the STAC collection.
2024-03-15 09:12:16,393|openeo_gfmap.manager|INFO:  Job j-24031513bf6b4a62a936968d9566e7a8 and post job action finished successfully.
2024-03-15 09:12:51,029|openeo_gfmap.manager|DEBUG:  Status of job j-240315c62071442abde062b425dc268f is running (on backend Backend.CDSE).
2024-03-15 09:12:51,630|openeo_gfmap.manager|DEBUG:  Status of job j-24031578e8fc4dcd8a79ab170a80a046 is running (on backend Backend.CDSE).
2024-03-15 09:13:52,676|openeo_gfmap.manager|DEBUG:  Status of job j-240315c62071442abde062b425dc268f is running (on backend Backend.CDSE).
2024-03-15 09:13:53,145|openeo_gfmap.manager|DEBUG:  Status of job j-24031578e8fc4dcd8a79ab170a80a046 is running (on backend Backend.CDSE).
2024-03-15 09:15:00,712|openeo

DataCube(<PGNode 'dimension_labels' at 0x23c2aaf2c10>)
DataCube(<PGNode 'dimension_labels' at 0x23c2ab3e170>)


2024-03-15 09:58:09,540|openeo_gfmap.manager|DEBUG:  Generating output path for asset timeseries.csv from job j-240315c62071442abde062b425dc268f...
2024-03-15 09:58:09,545|openeo_gfmap.manager|DEBUG:  Downloading asset timeseries.csv from job j-240315c62071442abde062b425dc268f -> output\20240313-15h25\S1S2-stats_841fa47ffffffff_0.csv


output_path: output\20240313-15h25\S1S2-stats_841fa47ffffffff_0.csv


2024-03-15 09:58:49,799|openeo_gfmap.manager|INFO:  Downloaded asset timeseries.csv from job j-240315c62071442abde062b425dc268f -> output\20240313-15h25\S1S2-stats_841fa47ffffffff_0.csv
2024-03-15 09:59:00,881|openeo_gfmap.manager|INFO:  Added 0 items to the STAC collection.
2024-03-15 09:59:00,882|openeo_gfmap.manager|INFO:  Job j-240315c62071442abde062b425dc268f and post job action finished successfully.
2024-03-15 09:59:00,883|openeo_gfmap.manager|DEBUG:  Worker thread Thread-5 (_post_job_worker): polled finished job with status PostJobStatus.FINISHED.
2024-03-15 09:59:18,017|openeo_gfmap.manager|DEBUG:  Status of job j-240315943f4140fab4cabb697441225c is running (on backend Backend.CDSE).
2024-03-15 09:59:40,050|openeo_gfmap.manager|DEBUG:  Generating output path for asset timeseries.csv from job j-24031578e8fc4dcd8a79ab170a80a046...
2024-03-15 09:59:40,056|openeo_gfmap.manager|DEBUG:  Downloading asset timeseries.csv from job j-24031578e8fc4dcd8a79ab170a80a046 -> output\20240313-1

output_path: output\20240313-15h25\S1S2-stats_841fa63ffffffff_0.csv


2024-03-15 10:00:06,833|openeo_gfmap.manager|INFO:  Downloaded asset timeseries.csv from job j-24031578e8fc4dcd8a79ab170a80a046 -> output\20240313-15h25\S1S2-stats_841fa63ffffffff_0.csv
2024-03-15 10:00:18,336|openeo_gfmap.manager|DEBUG:  Status of job j-240315943f4140fab4cabb697441225c is running (on backend Backend.CDSE).
2024-03-15 10:00:24,572|openeo_gfmap.manager|INFO:  Added 0 items to the STAC collection.
2024-03-15 10:00:24,573|openeo_gfmap.manager|INFO:  Job j-24031578e8fc4dcd8a79ab170a80a046 and post job action finished successfully.
2024-03-15 10:01:18,729|openeo_gfmap.manager|DEBUG:  Status of job j-240315943f4140fab4cabb697441225c is running (on backend Backend.CDSE).
2024-03-15 10:02:22,305|openeo_gfmap.manager|DEBUG:  Status of job j-240315943f4140fab4cabb697441225c is running (on backend Backend.CDSE).
2024-03-15 10:03:22,733|openeo_gfmap.manager|DEBUG:  Status of job j-240315943f4140fab4cabb697441225c is running (on backend Backend.CDSE).
2024-03-15 10:04:23,401|openeo

output_path: output\20240313-15h25\S1S2-stats_841fa6bffffffff_0.csv


2024-03-15 10:55:11,237|openeo_gfmap.manager|INFO:  Downloaded asset timeseries.csv from job j-240315943f4140fab4cabb697441225c -> output\20240313-15h25\S1S2-stats_841fa6bffffffff_0.csv
2024-03-15 10:55:39,216|openeo_gfmap.manager|INFO:  Added 0 items to the STAC collection.
2024-03-15 10:55:39,218|openeo_gfmap.manager|INFO:  Job j-240315943f4140fab4cabb697441225c and post job action finished successfully.


## Combine the results

We combine all the different extractions into one dataframe to train and test the model.

In [14]:
## Run these lines to post-process older results
# timestr = "20240312-09h58"
# tracking_file = base_output_path / f"tracking_{timestr}.csv"

In [16]:

tracker_df = pd.read_csv(tracking_file)
df = pd.DataFrame(columns = final_band_names + ['target', 'geometry'])

for index, row in tracker_df.iterrows():
    if row.status == "finished":
        try:
            # Get the target and geometry from the input
            geometry = gpd.read_file(row.geometry, driver='geojson')
            geometry['id'] = geometry['id'].astype('int64')
            h3index = geometry.iloc[0]['h3index']
            filename = f"S1S2-stats_{h3index}_0.csv"
            target_df = geometry[['id', 'target', 'geometry']]

            # Read the stats
            stats_df = pd.read_csv(base_output_path/timestr/filename)
            stats_df.columns = ['id'] + final_band_names

            # Merge the target and geometry with the stats
            stats_df = stats_df.merge(target_df, how='left', on='id')
            stats_df = stats_df.drop(columns=['id'])

            # Append to the dataframe
            df = pd.concat([df, stats_df])
        except FileNotFoundError as e:
            print(f"File not found: {filename}")
            pass


  df = pd.concat([df, stats_df])


Here we filter out features that contain NaN values. These often correspond to the months January and December.

In [None]:
## drop NA columns
# nan_columns = df.columns[df.isna().any()].tolist()
# print(f"Dropping columns containing NaN: {nan_columns}")
# df.drop(nan_columns, axis=1, inplace=True)

In [17]:
df.to_csv(base_output_path / timestr / "features.csv", index=False)

## Training and testing a random forrest model
The Following is just an example of local training a random forrest.

In [18]:
X = df.drop(columns=['target', 'geometry'])
y = df['target'].astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [100, None],
    'max_features': [4, 'log2'],
    'min_samples_leaf': [1, 2],
    'min_samples_split': [2, 3],
    'n_estimators': [100, 200, 300]
}
rf = RandomForestClassifier()
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2)

grid_search.fit(X_train, y_train)
grid_search.best_params_

Fitting 3 folds for each of 48 candidates, totalling 144 fits




{'max_depth': 100,
 'max_features': 'log2',
 'min_samples_leaf': 2,
 'min_samples_split': 2,
 'n_estimators': 200}

In [20]:
y_pred = grid_search.predict(X_test)
print("Accuracy on test set: "+str(accuracy_score(y_test,y_pred))[0:5])

Accuracy on test set: 0.865
