# Multi step join of monthly Sentinel-2 data to points of interest

This notebook rquires the following packages:

In [None]:
import geopandas as gpd
import geoengine as ge

First, connect to the Geo Engine instance:

In [2]:
ge.initialize("http://localhost:3030/api", credentials=("admin@localhost", "admin1234"))

In [3]:
session = ge.get_session()
user_id = session.user_id
session

Server:              http://localhost:3030/api
User Id:             d5328854-6190-4af9-ad69-4e74b0961ac9
Session Id:          1c740831-66ef-42a4-8829-622358f2ae2c
Session valid until: 2023-03-31T22:29:52.497Z

To track how much work is done, get the used quota:

In [4]:
used_quota_start = ge.get_quota(user_id)['used']
used_quota_start

0

Set the area of interest. It is defined as a bounding box in EPSG:32632.
It is locted in NRW, Germany and covers the area between Willingen, Lippstadt and Werl.

In [5]:
bounds_array = [421395,  5681078, 476201, 5727833]
xmin = bounds_array[0]
ymin = bounds_array[1]
xmax = bounds_array[2]
ymax = bounds_array[3]

(xmin, ymin, xmax, ymax)

(421395, 5681078, 476201, 5727833)

Using the bounding box, a time interval and a resolution, we define the area of interest as a temporal raster space-time cube.

In [6]:
from datetime import datetime
time_start = datetime(2021, 1, 1)
time_end = datetime(2021, 12, 31)

study_area = ge.api.RasterQueryRectangle(
    spatialBounds=ge.SpatialPartition2D(xmin, ymin, xmax, ymax).to_api_dict(),
    timeInterval=ge.TimeInterval(time_start, time_end).to_api_dict(),
    spatialResolution=ge.SpatialResolution(10.0, 10.0).to_api_dict(),
)
study_area

{'spatialBounds': {'upperLeftCoordinate': {'x': 421395, 'y': 5727833},
  'lowerRightCoordinate': {'x': 476201, 'y': 5681078}},
 'timeInterval': {'start': '2021-01-01T00:00:00.000+00:00',
  'end': '2021-12-31T00:00:00.000+00:00'},
 'spatialResolution': {'x': 10.0, 'y': 10.0}}

For the training, we use the Sentinel-2 data of the bands 02, 03, 04, and 08. The NDVI is calculated using an expression on band 4 and 8.

For each band as well as the mask (SCL), we create a workflow, that downloads the Sentinel-2 data for the area of interest and stores it as a new dataset.

In [7]:
status_download = {}

for b in ["B02", "B03", "B04", "B08", "SCL"]:
    print(b)
    sentinel2_band_workflow = ge.unstable.workflow_blueprints.sentinel2_band(b)
    sentinel2_band_workflow_id = ge.register_workflow(sentinel2_band_workflow.to_workflow_dict())
    sentinel2_band_workflow_dataset_task = sentinel2_band_workflow_id.save_as_dataset(study_area, f"sentinel2_nrw_crop_10m_{b}")
    sentinel2_band_workflow_dataset_task.wait_for_finish(print_status=False)
    print(sentinel2_band_workflow_dataset_task.get_status())
    status_download[b] = sentinel2_band_workflow_dataset_task.get_status()

status_download

B02
status=completed, time_started=2023-03-31 21:29:52.914000+00:00, info={'dataset': 'c5b7184d-6f64-400b-a12a-14db8955e32a', 'upload': '9e2ff289-1038-4fea-9aa8-cf4f8bb6e540'}, time_total=00:40:04
B03
status=completed, time_started=2023-03-31 22:10:01.207000+00:00, info={'dataset': '4cb8f322-4308-40c4-96f4-14d7178bc2b0', 'upload': '7caa690c-2472-432e-a6b2-90d9c3d98506'}, time_total=00:40:12
B04
status=completed, time_started=2023-03-31 22:50:14.504000+00:00, info={'dataset': '707454bb-c6f6-474a-86fb-50ffd1fa822e', 'upload': '58914921-2acc-418c-bce7-8a37c18c8042'}, time_total=00:40:17
B08
status=completed, time_started=2023-03-31 23:30:32.830000+00:00, info={'dataset': '0f11c660-e9d6-458c-a9a5-67ea99f864b0', 'upload': 'c56b98e8-bd30-4723-b70d-05021e03b115'}, time_total=00:41:20
SCL
status=completed, time_started=2023-04-01 00:11:56.178000+00:00, info={'dataset': '8554cc40-8825-4da9-94c1-7148f3399bba', 'upload': 'cc694f72-093d-4ea7-9793-4db0ed6156e5'}, time_total=00:04:31


{'B02': TaskStatusInfo(status='completed', time_started=datetime.datetime(2023, 3, 31, 21, 29, 52, 914000, tzinfo=datetime.timezone.utc),info = {'dataset': 'c5b7184d-6f64-400b-a12a-14db8955e32a', 'upload': '9e2ff289-1038-4fea-9aa8-cf4f8bb6e540'}, time_total = '00:40:04'),
 'B03': TaskStatusInfo(status='completed', time_started=datetime.datetime(2023, 3, 31, 22, 10, 1, 207000, tzinfo=datetime.timezone.utc),info = {'dataset': '4cb8f322-4308-40c4-96f4-14d7178bc2b0', 'upload': '7caa690c-2472-432e-a6b2-90d9c3d98506'}, time_total = '00:40:12'),
 'B04': TaskStatusInfo(status='completed', time_started=datetime.datetime(2023, 3, 31, 22, 50, 14, 504000, tzinfo=datetime.timezone.utc),info = {'dataset': '707454bb-c6f6-474a-86fb-50ffd1fa822e', 'upload': '58914921-2acc-418c-bce7-8a37c18c8042'}, time_total = '00:40:17'),
 'B08': TaskStatusInfo(status='completed', time_started=datetime.datetime(2023, 3, 31, 23, 30, 32, 830000, tzinfo=datetime.timezone.utc),info = {'dataset': '0f11c660-e9d6-458c-a9a5-6

Check the quota used for downloading the bands:

In [8]:
used_quota_download = ge.get_quota(user_id)['used'] - used_quota_start
used_quota_download

17460

Using the bands dataset_ids, we create a map from band name to Geo Engine "InternalDataId". The "InternalDataId" is used to reference the datasets in the workflow.

In [9]:
## IF you ran the workflow, you can get the ids from the status
band_data_map = { name: task_status.info['dataset'] for name, task_status in status_download.items() }
## IF you have the data already added, youn simply define the ids.
#band_data_map = {
#    'b02': '48d30acd-a378-4f2a-89b3-d73430c0f29e',
#    'b03': '200ed06c-fcd1-41a3-b2dd-68b5d0b338a9',
#    'b04': 'b9ececef-5a0c-4294-be16-334e8017f60f',
#    'b08': '8e635367-35ff-42f9-bb21-88ae9db8be8d',
#    'scl': '64636605-10da-4576-bf21-a265cc1f7d9c'
#    }

band_data_id_map = {name: 
    ge.api.InternalDataId(
        type="internal",
        datasetId=id) for name, id in band_data_map.items()
} 

band_data_id_map

{'B02': {'type': 'internal',
  'datasetId': 'c5b7184d-6f64-400b-a12a-14db8955e32a'},
 'B03': {'type': 'internal',
  'datasetId': '4cb8f322-4308-40c4-96f4-14d7178bc2b0'},
 'B04': {'type': 'internal',
  'datasetId': '707454bb-c6f6-474a-86fb-50ffd1fa822e'},
 'B08': {'type': 'internal',
  'datasetId': '0f11c660-e9d6-458c-a9a5-67ea99f864b0'},
 'SCL': {'type': 'internal',
  'datasetId': '8554cc40-8825-4da9-94c1-7148f3399bba'}}

Now we create a workflow to aggregate each band to monthly means. We also create a monthly mean of the NDVI which is calculated using an expression on band 4 and 8. The resulting datasets are stored as new datasets.

In [10]:
status_agg = {}

for b in ["B02", "B03", "B04", "B08", "NDVI"]:
    print(b)
    sentinel2_band_workflow = ge.unstable.workflow_blueprints.s2_cloud_free_aggregated_band_custom_input(b, band_data_id_map, granularity="months", window_size=1, aggregation_type="mean")
    sentinel2_band_workflow_id = ge.register_workflow(sentinel2_band_workflow.to_workflow_dict())
    sentinel2_band_workflow_dataset_task = sentinel2_band_workflow_id.save_as_dataset(study_area, f"sentinel2_nrw_crop_10m_cf_monthly_{b}")
    sentinel2_band_workflow_dataset_task.wait_for_finish(print_status=False)
    print(sentinel2_band_workflow_dataset_task.get_status())
    status_agg[b] = sentinel2_band_workflow_dataset_task.get_status()

status_agg

B02
status=completed, time_started=2023-04-01 00:16:35.650000+00:00, info={'dataset': 'acda443c-30d3-4437-b0d1-49e36f94b977', 'upload': '799772f4-be57-4239-aa8e-7f8460607c1b'}, time_total=00:01:26
B03
status=completed, time_started=2023-04-01 00:18:05.762000+00:00, info={'dataset': '4b67722c-db22-404a-bb93-ab18c90251e3', 'upload': 'a240ec5d-8f2d-4fa1-a5ab-e4a3dc0d8b75'}, time_total=00:01:23
B04
status=completed, time_started=2023-04-01 00:19:30.871000+00:00, info={'dataset': '45cc22e0-37b2-48f4-bcb6-198ba6321114', 'upload': '6175db1e-fade-4bb1-9595-f15b5e2b19d7'}, time_total=00:01:25
B08
status=completed, time_started=2023-04-01 00:20:55.972000+00:00, info={'dataset': '920c598d-50e0-4c65-9b7f-da2e4c56dc93', 'upload': '7253607a-190b-4362-9e00-6253dd7ef067'}, time_total=00:01:27
NDVI
status=completed, time_started=2023-04-01 00:22:26.071000+00:00, info={'dataset': '12f35dbd-7188-4bb1-8296-db6ab15e8c84', 'upload': 'dc649734-6a61-4c12-bf58-697f3d3cf5af'}, time_total=00:01:54


{'B02': TaskStatusInfo(status='completed', time_started=datetime.datetime(2023, 4, 1, 0, 16, 35, 650000, tzinfo=datetime.timezone.utc),info = {'dataset': 'acda443c-30d3-4437-b0d1-49e36f94b977', 'upload': '799772f4-be57-4239-aa8e-7f8460607c1b'}, time_total = '00:01:26'),
 'B03': TaskStatusInfo(status='completed', time_started=datetime.datetime(2023, 4, 1, 0, 18, 5, 762000, tzinfo=datetime.timezone.utc),info = {'dataset': '4b67722c-db22-404a-bb93-ab18c90251e3', 'upload': 'a240ec5d-8f2d-4fa1-a5ab-e4a3dc0d8b75'}, time_total = '00:01:23'),
 'B04': TaskStatusInfo(status='completed', time_started=datetime.datetime(2023, 4, 1, 0, 19, 30, 871000, tzinfo=datetime.timezone.utc),info = {'dataset': '45cc22e0-37b2-48f4-bcb6-198ba6321114', 'upload': '6175db1e-fade-4bb1-9595-f15b5e2b19d7'}, time_total = '00:01:25'),
 'B08': TaskStatusInfo(status='completed', time_started=datetime.datetime(2023, 4, 1, 0, 20, 55, 972000, tzinfo=datetime.timezone.utc),info = {'dataset': '920c598d-50e0-4c65-9b7f-da2e4c56d

Check the quota used for aggregating the bands (and calculating NDVI):

In [11]:
used_quota_agg = ge.get_quota(user_id)['used'] - used_quota_download
used_quota_agg

58704

Again, we create a map from band name to Geo Engine "InternalDataId" for the aggregated bands.

In [12]:
monthly_band_data_map = { name: task_status.info['dataset'] for name, task_status in status_agg.items() }

#monthly_band_data_map = {
#    'b02': 'a291c42d-ec5a-4702-954e-a76b17154752',
#    'b03': '8a15acd3-e135-4cc1-83b3-a883a52d69d7',
#    'b04': '9d3f5b09-3cb6-4dac-a673-9e5ead88a221',
#    'b08': 'd7775c4c-ab71-4f8e-8fac-ccedb154861a',
#    'ndvi': '60fb38a6-9002-4254-8554-15ac293876c7'
#    }

monthly_band_data_id_map = {name: 
    ge.api.InternalDataId(
        type="internal",
        datasetId=id) for name, id in monthly_band_data_map.items()
} 


Upload the points to the Geo Engine:

In [13]:
points_df = gpd.read_file("group_sample_frac1_inspireId_utm32n.gpkg")
points_id = ge.upload_dataframe(points_df, "group_sample_frac1_inspireId")
points_id

0b7ddaf2-27b8-4a66-8c36-56804be46f19

Create a souce operator that provides the points to a workflow:

In [14]:
points_source_operator = ge.unstable.workflow_operators.OgrSource(str(points_id))
points_source_operator.to_workflow_dict()

{'type': 'Vector',
 'operator': {'type': 'OgrSource',
  'params': {'data': {'type': 'internal',
    'datasetId': '0b7ddaf2-27b8-4a66-8c36-56804be46f19'},
   'attributeProjection': None,
   'attributeFilters': None}}}

To attach the Sentinel-2 data to the points, we use the raster-vector join operator. It takes a vector source (the points) and raster sources (the aggregated bands) as input and creates a point-time-series as output.

In [15]:
# projected_points = ge.unstable.workflow_operators.Reprojection(points_source_operator, target_spatial_reference="EPSG:32632") # only needed if input not projected

points_with_s2_cloud_free = ge.unstable.workflow_operators.RasterVectorJoin(
    raster_sources=[ge.unstable.workflow_operators.GdalSource(x) for x in monthly_band_data_id_map.values() ],
    vector_source=points_source_operator, #projected_points,
    new_column_names=[x for x in monthly_band_data_id_map.keys()],
)

points_with_s2_cloud_free.to_workflow_dict()

workflow = ge.register_workflow(points_with_s2_cloud_free.to_workflow_dict())
workflow

4a732008-56da-5f11-8972-cd2baf6688b8

Create datetime objects for the start and end of the time interval we use to query the final wrkflow:

In [16]:
start_dt = datetime(2021, 1, 1, 0, 0, 0)
end_dt = datetime(2022, 1, 1, 0, 0, 0)

start_dt, end_dt

(datetime.datetime(2021, 1, 1, 0, 0), datetime.datetime(2022, 1, 1, 0, 0))

Now, we query the workflow that attaches the Sentinel-2 data to the points from python and store the result in a gpkg file.

In [17]:
gp_res = await workflow.vector_stream_into_geopandas(
    ge.QueryRectangle(
        spatial_bounds=ge.BoundingBox2D(
            xmin=xmin,
            ymin=ymin,
            xmax=xmax,
            ymax=ymax,
        ),
        time_interval=ge.TimeInterval(
            start=start_dt,
            end=end_dt,
        ),
        resolution=ge.SpatialResolution(
            10.0,
            10.0,
        ),
        srs="EPSG:32632",
))

gp_res.to_file("gp_res_10_frac1_monthly_utm32n_multi_steps.gpkg", driver="GPKG")
gp_res

Unnamed: 0,INSPIRE_ID,B04,B03,CODE,ID,index,B02,B08,NDVI,geometry,time_start,time_end
0,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,,,311,0,0,,,0.033419,MULTIPOINT (428690.027 5711938.189),2021-01-01 00:00:00+00:00,2021-02-01 00:00:00+00:00
1,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,,,131,1,1,,,0.040039,MULTIPOINT (427819.337 5710040.545),2021-01-01 00:00:00+00:00,2021-02-01 00:00:00+00:00
2,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,,,115,2,2,,,0.043195,MULTIPOINT (427320.866 5710158.178),2021-01-01 00:00:00+00:00,2021-02-01 00:00:00+00:00
3,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,,,459,3,265,,,0.025341,MULTIPOINT (431527.388 5693772.886),2021-01-01 00:00:00+00:00,2021-02-01 00:00:00+00:00
4,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,,,459,4,266,,,0.016331,MULTIPOINT (431535.193 5693614.690),2021-01-01 00:00:00+00:00,2021-02-01 00:00:00+00:00
...,...,...,...,...,...,...,...,...,...,...,...,...
593083,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,623.0,780.0,115,49419,732423,596.0,3531.0,0.122704,MULTIPOINT (472357.075 5696612.529),2021-12-01 00:00:00+00:00,2022-01-01 00:00:00+00:00
593084,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,723.0,537.0,115,49420,732424,466.0,1102.0,0.093200,MULTIPOINT (472016.875 5697690.039),2021-12-01 00:00:00+00:00,2022-01-01 00:00:00+00:00
593085,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,463.0,530.0,115,49421,732425,354.0,2438.0,0.132107,MULTIPOINT (471981.413 5696219.338),2021-12-01 00:00:00+00:00,2022-01-01 00:00:00+00:00
593086,https://geodaten.nrw.de/id/inspire-lu-ts/exist...,609.0,489.0,411,49422,732426,473.0,1214.0,0.088917,MULTIPOINT (471704.064 5697043.769),2021-12-01 00:00:00+00:00,2022-01-01 00:00:00+00:00


Check the quota used for querying the workflow:

In [18]:
used_quota_rvjoin = ge.get_quota(user_id)['used'] - used_quota_agg
used_quota_rvjoin

29815