# Masking data-cubes using geometry objects

In [None]:
import matplotlib.pyplot as plt

from earthkit import transforms as ekt
from earthkit import data as ekd

from earthkit.data.testing import earthkit_remote_test_data_file

## Load some test data

All `earthkit-transforms` methods can be called with `earthkit-data` objects (Readers and Wrappers) or with the 
pre-loaded `xarray` or `geopandas` objects.

In this example we will use hourly ERA5 2m temperature data on a 0.5x0.5 spatial grid for the year 2015 as
our physical data; and we will use the NUTS geometries which are stored in a geojson file.

First we lazily load the ERA5 data  and NUTS geometries from our test-data repository.

Note the data is only downloaded when
we use it, e.g. at the `.to_xarray` line, additionally, the download is cached so the next time you run this
cell you will not need to re-download the file (unless it has been a very long time since you have run the
code, please see tutorials in `earthkit-data` for more details in cache management).

In [None]:
# Get some demonstration ERA5 data, this could be any url or path to an ERA5 grib or netCDF file.
# remote_era5_file = earthkit_remote_test_data_file("test-data", "era5_temperature_europe_2015.grib") # Large file
remote_era5_file = earthkit_remote_test_data_file("test-data", "era5_temperature_europe_20150101.grib")
era5_data = ekd.from_source("url", remote_era5_file)

# Open as an xarray dataset, renaming the 2m temperature variable to something more manageable
era5_xr = era5_data.to_xarray(time_dim_mode="valid_time").rename({"2t": "t2m"})
era5_xr

In [None]:
# Use some demonstration polygons stored, this could be any url or path to geojson file
remote_nuts_url = earthkit_remote_test_data_file("test-data", "NUTS_RG_60M_2021_4326_LEVL_0.geojson")
nuts_data = ekd.from_source("url", remote_nuts_url)

nuts_data.to_pandas()[:5]

## Mask dataarray with geodataframe

`shapes.mask` applies all the features in the geometry object (`nuts_data`) to the data object (`era5_data`).
It returns an xarray object the same shape and type as the input xarray object with all points outside of
the geometry masked

In [None]:
single_masked_data = ekt.spatial.mask(era5_xr, nuts_data, union_geometries=True)
single_masked_data

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10,4))
era5_xr.t2m.mean(dim='valid_time').plot(ax=axes[0])
axes[0].set_title('Original data')
# Single masked data
single_masked_data.t2m.mean(dim='valid_time').plot(ax=axes[1])
axes[1].set_title('Masked data')

`shapes.masks` applies the features in the geometry object (`nuts_data`) to the data object (`era5_data`).
It returns an xarray object with an additional dimension, and coordinate variable, corresponding to the 
features in the geometry object.
By default this is the index of the input geodataframe, in this example the index is just an integer
count so it takes the default name `index`.

In [None]:
masked_data = ekt.spatial.mask(era5_xr, nuts_data)
masked_data

It is possible to specify a column in the geodataframe to use for the new dimension, for example in NUTS the
`FID` (= feature id) which contains the two letter identier code for each feature:

In [None]:
masked_data = ekt.spatial.mask(era5_xr, nuts_data, mask_dim="FID")
masked_data

Here we demonstrate what we have done by plotting the masked objects we have produced

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15,3))
era5_xr.t2m.mean(dim='valid_time').plot(ax=axes[0])
axes[0].set_title('Original data')
masked_data.t2m.sel(FID='DE').mean(dim='valid_time').plot(ax=axes[1])
axes[1].set_title('Masked for Germany')
germany_data = masked_data.sel(FID='DE').dropna(dim='latitude', how='all').dropna(dim='longitude', how='all')
germany_data.t2m.mean(dim='valid_time').plot(ax=axes[2])
axes[2].set_title('Masked Germany Zoom')