# LakeCREST ESA CCI Lakes v1.1
## 1. Preprocessing
In this script we will use [**xarray**](https://docs.xarray.dev/en/latest/index.html) and [**dask**](https://dask.org/) to load, prepare and explore the [**ESA CCI Lakes v1.1**](https://catalogue.ceda.ac.uk/uuid/ef1627f523764eae8bbb6b81bf1f7a0a) dataset. Using xarray and dask, two free and open-source libraries, allows us to fully utilize the computing power of our machine in parallelized workflows scaled to the system at hand.

### 1.1 Importing modules
First, we import the necessary python modules for the preprocessing steps.

In [None]:
import pathlib
import xarray as xr
import numpy as np
import time
import pandas as pd
from dask.distributed import Client, LocalCluster

### 1.2 User inputs
Here we define the desired lake and data paths. More information on available variables and a full list of all lake IDs can be found at the end of the [**D4.3: Product User Guide (PUG)**](https://climate.esa.int/media/documents/CCI-LAKES-0029-PUG_v1.1_signed_CA.pdf).

In [None]:
# Load in table of lakes with lake coordinates and the CCI_lakeid
# based on D4.3 Product User Guide (PUG) - Annex B: List of lakes
df = pd.read_csv('lakelist_v1.1.csv', delimiter=';')

# Define lake
lakename = 'Garda'

# Find lakeid for specified lake
lakeid = df.loc[df['name'] == lakename]['cci_lakeid'].values[0]
print(f'The lakeid for Lake {lakename} is {lakeid}.')

# Get a preview of table
df.head(5)

We can use [**hvplot**](https://hvplot.holoviz.org/) to plot the pandas table in a map-overview

In [None]:
import hvplot.pandas

# Plot pandas table to map-view with hvplot
df.hvplot.points(x='longitude', y='latitude', 
                 color='red', alpha=0.5,
                 geo=True, tiles='OSM', 
                 hover_cols='all',
                 xlabel='Longitude', ylabel='Latitude')

The paths are initialized as python raw-strings to avoid errors due to escape-sequences and then converted to pathlib.Path objects. The pathlib library allows to use paths within different operating systems and gives us access to many powerful functions.

In [None]:
# Define data directory path and filenames
path_data = r'D:\lakecrest\esa_cci_lp\v1.1' # CCI data folder path
path_mask = r'D:\lakecrest\esa_cci_lp\mask\ESACCI-LAKES_mask_v1.nc' # mask path
path_dask = r'C:\Users\Micha\Desktop\dask' # Temporary dask workerspace

# Get filepaths and convert to pathlib.Path
path_data = pathlib.Path(path_data)
path_mask = pathlib.Path(path_mask)
path_dask = pathlib.Path(path_dask)

### 1.3 Dask initialization
We initialize a local Dask client with our specified number of workers, threads per worker and memory limit (per worker). Calling the client outputs the client adress, so we can access the client over its webinterface. A good starting point for the settings is to set *n_workers* to the number of physical cores and *threads_per_worker* to number of virtual cores.

In [None]:
# Define according to system specs
n_workers = 8              # (e.g. number of physical cores)
threads_per_worker = 1     # (e.g. virtual cores / n_workers)
memory_limit = '4GB'       # (e.g. max memory / n_workers)

local_directory = path_dask
cluster = LocalCluster(n_workers=n_workers, 
                       threads_per_worker=threads_per_worker, 
                       memory_limit=memory_limit,
                       local_directory=local_directory
                      )
client = Client(address=cluster.scheduler_address)
client

### 1.4 Specify chunk size
We use xarray to load the large multi-file dataset. xarray allows us to initialize and load the entire dataset by only providing the necessary filepaths. Instead of loading the entire dataset (>350GB) to memory, we can make use of xarray's ability to lazy-load chunks of data. This means that only the necessary subset of each individual .nc file will be loaded into memory at the time it is needed. For this we can either pre-define a chunk size or let xarray automatically define a size.

In [None]:
# Set chunk size based on .nc file dims and a divider
#input_lat_sz = 21600
#input_lon_sz = 43200
#divider=100

#chunk_lat_sz = int(input_lat_sz/divider)
#chunk_lon_sz = int(input_lon_sz/divider)
#chunks={'lat':chunk_lat_sz,
#        'lon':chunk_lon_sz,
#        'time':1
#        }

# Alternatively set chunks to 'auto' (dask decides chunk size)
chunks='auto'

### 1.5 Load an individual .nc file
To test out xarray and get a preview of the ESA CCI Lakes dataset we can load a single file from the dataset. For this we will use the [xarray.open_dataset](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html) function. We can get a preview of the loaded data, its attributes and variables in the console view.

In [None]:
# Use pathlib.Path.rglob function to recursively find all .nc files within the data folder
paths_data = list(path_data.rglob('*fv1.1.nc'))

# Get the first filepath from the list of .nc files
path_fn_first = paths_data[0]

# Load the file with xarray
DS_preview = xr.open_dataset(filename_or_obj=path_fn_first,
                             engine='netcdf4',
                             chunks=chunks)
DS_preview

### 1.5 Spatial subsetting and variable selection
Because we are only interested in a specific lake and a subset of the 50+ available variables we subset the dataset. Therefore, we setup a function **preprocess(ds)** that subsets the daily .nc file to only fetch information about the desired variables in a bounding-box based on the bounding coordinates of the desired lake. The preprocess function will be run on every daily .nc file when the multifile-dataset is initialized.

In [None]:
# List of the variables we want to load
variables = ['lake_surface_water_temperature',
              'lswt_quality_level',
              'lswt_uncertainty'
              ]

# Load CCI lakes mask
DS_mask = xr.open_dataset(filename_or_obj=path_mask,
                          engine='netcdf4',
                          decode_cf=True,
                          chunks=chunks
                          )

# Get logical True/False lake mask over full globe
mask_full = (DS_mask.CCI_lakeid == lakeid)

# Get logical True/False lake mask sliced over ROI only
mask_roi = DS_mask.CCI_lakeid.where(mask_full, 
                                   drop=True)

# Get bounds coordinates
lat_min = mask_roi.lat[0].values
lat_max = mask_roi.lat[-1].values
lon_min = mask_roi.lon[0].values
lon_max = mask_roi.lon[-1].values

# Subset the lake mask to same coords
#DS_mask_roi = DS_mask.sel(lat=slice(lat_min, lat_max), 
#                          lon=slice(lon_min, lon_max))

def preprocess(ds):
    '''Keeps only the necessary lat/lon slice and necessary variables when opening .nc files'''
    return ds[variables].sel(lat=slice(lat_min, lat_max), 
                             lon=slice(lon_min, lon_max))

print(f'Created preprocess(ds) function to subset with bbox ' \
      f'lat: ({lat_min:0.1f}, {lat_max:0.1f}), ' \
      f'lon: ({lon_min:0.1f}, {lon_max:0.1f}) for Lake {lakename}.')

We can check the lake mask and the computed bounding box based on the mask file on a mapview using [**hvplot**](https://hvplot.holoviz.org/) and [**GeoViews**](https://geoviews.org/). Both objects are based on the [**HoloViews**](https://holoviews.org/) library and can be easily combined.

In [None]:
import hvplot.xarray
import geoviews as gv

# Create map-plot of xarray dataarray using hvplot
hv_map = mask_roi.hvplot(geo=True, tiles='CartoLight', colorbar=False, 
                         xlabel='Longitude', ylabel='Latitude')

# Create boundingbox using geoviews
gv_bbox = gv.Rectangles([(lon_min, lat_min, lon_max, lat_max)]).opts(color='none', line_width=2, line_color='red')

# Combine objects as overlay
hv_map * gv_bbox

### 1.6 Load full dataset as xarray.Dataset
Now, we can initialize the full dataset using xarray's [**xarray.open_mfdataset**](https://docs.xarray.dev/en/latest/generated/xarray.open_mfdataset.html) function. We will set ***decode_cf*** to **false** for now and decode the dataset later. During the loading process we can monitor the progress and the task stream of our workers in the dask webinterface (output from *1.3 Dask initialization*).

The xarray documentation has an extensive [user-guide](https://xarray.pydata.org/en/stable/user-guide/io.html) with explanations and best-practices to load large datasets.

In [None]:
# Setup timer to time the loading process
start_time = time.time()

DS = xr.open_mfdataset(paths=paths_data,
                       combine='by_coords',
                       parallel=True,
                       engine='netcdf4',
                       decode_cf=False,
                       preprocess=preprocess,
                       chunks=chunks
                       )

print(f'Xarray dataset with variables: {variables} initialized after ' \
      f'{(time.time()-start_time):0.1f} seconds')

Once the dataset has been loaded, we can get a overview of the xarray.Dataset object and its variables and attributes.

In [None]:
DS

### 1.7 Apply lake-mask
Next we apply the lake mask to mask cells of possible other lakes in the same bounding box. To make sure that the lake mask has identical cell coordinates we align it to the coordinates of our dataset.

In [None]:
# Reindex coordinates to make sure that our lake mask cells are aligned to data cells
da_mask = mask_roi.reindex_like(other=DS, method='nearest')

# Get subset with lakemask and xarray.DataArray.where
DS_clip = DS.where(cond=da_mask, drop=True)

### 1.8 Decode the data
xarray will handle the data-decoding of the NetCDF format with the scaling- and offset-attributes found in the loaded files. We can use [xarray.decode_cf](https://docs.xarray.dev/en/latest/generated/xarray.decode_cf.html) to automatically decode the data.

In [None]:
# Decode data
DS_clip = xr.decode_cf(DS_clip)

### 1.9 Drop days with no data
We can drop days with all-nan values for the variable 'lake_surface_water_temperature' by using the [**xarray.Dataset.dropna**](https://docs.xarray.dev/en/latest/generated/xarray.Dataset.dropna.html) function with the parameter ***how='all'***.

In [None]:
DS_clip_dropna = DS_clip.dropna(dim='time', 
                                how='all',
                                subset=['lake_surface_water_temperature'])

dropcount = DS_clip.time.size - DS_clip_dropna.time.size
print(f'Dropped {dropcount} days with all-nan cells (from {DS_clip.time.size} to {DS_clip_dropna.time.size})')

DS_clip_dropna

### 1.10 Mask LSWT with quality flag


Using the variable 'lswt_quality_level' we can mask out cells that don't fulfill the necessary quality criteria that ranges from 0 to 5:

- 0: unprocessed 
- 1: bad 
- 2: marginal 
- 3: intermediate
- 4: good 
- 5: best

Within the [D4.1: Product Validation and Intercomparison Report](https://climate.esa.int/media/documents/CCI-LAKES-0031-PVIR_v1.4.pdf) the authors of the ESA CCI Lakes product recommend to limit the use of LSWT for lake-climate applications to quality flags 4-5 (good and best).

In [None]:
# Use quality flag variable 'lswt_quality_level' to mask out all cells except good and best
qmask = (DS_clip_dropna['lswt_quality_level']>3)
DS_clip_dropna['lake_surface_water_temperature'] = DS_clip_dropna['lake_surface_water_temperature'].where(cond=qmask)

### 1.11 Export subset (optional)
If necessary, the subset with the masked data can now be exported using [**xarray.Dataset.to_netcdf**](https://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_netcdf.html). We can get the encoding settings (e.g. compression, fillvalue, scale-factor, offset) for the NetCDF export from the previously loaded dataset. Instead of exporting all data we can also export a temporal subset by slicing the data in time. Exporting the data to NetCDF format is slow, since the data is first loaded, decompressed and decoded and then encoded and compressed again for storage.

In [None]:
path_dst = f'D:\lakecrest\output\ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-{lakename}-fv2.0.nc'
path_dst = pathlib.Path(path_dst)

# Get encoding settings from DS.encoding and DS.attrs to ensure the encoding setting are identical
DS_enc = {}
encoding_enc = ['zlib', 'shuffle', 'complevel', 
              'fletcher32', 'contiguous', 'dtype']
attr_enc = ['_FillValue', 'scale_factor', 'add_offset']
for var in variables:
    DS_enc_encoding = DS.get(var).encoding
    DS_enc_fromenc = {k:v for k, v in DS_enc_encoding.items() \
                      if k in encoding_enc}
    DS_enc_attrs = DS.get(var).attrs
    DS_enc_fromattrs = {k:v for k, v in DS_enc_attrs.items() \
                        if (k in attr_enc and hasattr(DS.get(var), k))}
    DS_enc[var] = {**DS_enc_fromenc, **DS_enc_fromattrs}
print('DS encoding:', DS_enc)

# Set starting time for timer
start_time = time.time() 

# We can also slice the data in time before exporting
#timeslice = slice('2019-01-01', '2020-01-01)                 
#DS_clip_dropna = lswt_celsius.sel(time=timeslice)

DS_clip_dropna.to_netcdf(path=path_dst,
                         mode='w',
                         engine='netcdf4',
                         encoding=DS_enc
                        )

print(f'ESA CCI Lakes v2.0 subset of Lake {lakename} ' \
      f'exported after {(time.time()-start_time):0.1f} seconds.')

Now we can load our entire lakedata from a single .nc files instead of the 9000+ daily files of the dataset. This will improve the efficiency for future computations.

In [None]:
# Reassign dataset variable with our single-file based subset
DS_clip_dropna = xr.open_dataset(filename_or_obj=path_dst,
                                 combine='by_coords',
                                 parallel=True,
                                 engine='netcdf4',
                                 decode_cf=False
                                 )

## 2. Explore dataset
### 2.1 LSWT timeseries animation
Let's explore the time-series LSWT data. To do this we first convert the [xarray.Dataset](https://docs.xarray.dev/en/latest/generated/xarray.Dataset.html) to a [xarray.Dataarray](https://docs.xarray.dev/en/latest/generated/xarray.DataArray.html) containing only the variable 'lake_surface_water_temperature' and convert the value from °K to °C.

In [None]:
lswt = DS_clip_dropna['lake_surface_water_temperature']-273.15

Now we can create an interactive animation that displays the temperature data in a cartographic reference system using [Holoviews](https://holoviews.org/) and [Cartopy](https://scitools.org.uk/cartopy/docs/latest/).

In [None]:
import hvplot.xarray
import cartopy.crs as ccrs

# Define CRS
crs = ccrs.PlateCarree(central_longitude=0, globe=None)

# Plot time-series data
lswt.hvplot(
    geo=True, tiles='CartoLight',
    groupby="time",  # adds a widget for time
    clim=(0, 25),  # sets colormap limits
    crs=crs,
    cmap='jet',
    widget_type="scrubber",
    widget_location="bottom",
    #width=300,
    xlabel='Longitude', ylabel='Latitude',
    #title=f'ESA CCI Lakes v1.1\nLake {lakename} LSWT animation',
    clabel='Lake surface water temperature (°C)'
)

### 2.2 Mean LSWT plot
We can use [**xarray.DataArray.mean**](https://docs.xarray.dev/en/latest/generated/xarray.DataArray.mean.html) with the parameters ***dim*** and ***skipna*** to plot the spatially aggregated daily LSWT mean over time.

In [None]:
import hvplot.xarray

# Slice data with timeslice
timeslice = slice('2006-01-01', '2007-12-31')
lswt_slice = lswt.sel(time=timeslice)

# Compute daily means
lswt_slice_mean = lswt_slice.mean(dim=["lat", "lon"], skipna=True).load()

# Create a line plot using hvplot
lineplot = lswt_slice_mean.hvplot.line(ylabel='LSWT (°C)', 
                                       title=f'ESA CCI Lakes v1.1\nLake {lakename} LSWT mean')

# Create a scatter plot using hvplot
scatterplot = lswt_slice_mean.hvplot.scatter(c='k')

# Overlay plots to create output
lineplot * scatterplot

As we can see there are days with outliers. We can set a treshold for a minimal lake coverage and replot the data. For this we can count the unmasked cells in our lake mask.

In [None]:
# Setup treshold for minimal lake coverage
coverage_tresh = 20 # treshold in %

# Calculate cell treshold to keep daily mean
lakecells = np.count_nonzero(~np.isnan(mask_roi)) # count of lakecells
nancells_tresh = int(lakecells*((coverage_tresh)/100)) # minimal cells treshold
print(f'Dropping days with less than {nancells_tresh} valid cells ({coverage_tresh}% of Lake {lakename}).')

# Use xarray.DataArray.dropna with threshold to get rid of days with low-coverage
lswt_slice_tresh = lswt_slice.dropna(dim='time', thresh=nancells_tresh)

# Compute daily means
lswt_slice_tresh_mean = lswt_slice_tresh.mean(dim=["lat", "lon"], skipna=True)

# Create plot
lineplot = lswt_slice_tresh_mean.hvplot.line(ylabel='LSWT (°C)', 
                                       title=f'ESA CCI Lakes v1.1\nLake {lakename} LSWT mean (>20% coverage)')

scatterplot = lswt_slice_tresh_mean.hvplot.scatter(c='k')

print(f'Dropped {lswt_slice_mean.size - lswt_slice_tresh_mean.size} datapoints.')

lineplot * scatterplot

### 2.3 Visualize data availability
As we can see in the mean plots the data availability changes in accordance with missions timeline of data sources. We can use the time dimension to calculate the yearly mean coverage frequency per cell.

In [None]:
import dask.array as da

# Setup function to compute mean time-delay of non-nan values along
def getCovFreq(darr):
    """Takes xarray.DataArray and returns dataarray with mean timedelay (d) between valid values"""
    lat_sz, lon_sz = darr.lat.size, darr.lon.size # get lat and lon sizes
    mask = darr.to_masked_array() # create a np.masked_array from xr.dataarray

    t = darr.time.values # get time dimension as array
    t_steps = len(darr.time)

    arr_t3d = np.broadcast_to(t[:, np.newaxis, np.newaxis], (t_steps, lat_sz, lon_sz)) # expand time dimension to match dataarray
    arr_t3d_masked = np.ma.masked_where(mask.mask, arr_t3d).filled(np.datetime64('NaT'))  # mask it and set masked cells to nan
    darr_t3d_masked = da.from_array(arr_t3d_masked, chunks='auto') # convert np.arr to dask-array

    def mean_cfreq(x):
        """Takes np.datetime64 array and returns np.float32 array with temp-diff (d), dropping nat-values"""
        arr = x[~np.isnat(x)] # drop nats
        diff_td64 = np.diff(arr)
        diff_d = (diff_td64 / np.timedelta64(1,'D')).astype(np.float32)
        return np.mean(diff_d) # return mean of differences

    # Apply mean_cfreq along time-dimension
    cfreq = np.apply_along_axis(mean_cfreq, 0, arr_t3d_masked)
    xr_cfreq = xr.DataArray(data=cfreq, coords=(darr.lat.values, darr.lon.values), dims=('lat', 'lon'), name='mean_cfreq')
    return(xr_cfreq)

# Slice data with timeslice
timeslice = slice('2005-01-01', '2008-12-31')
lswt_slice = lswt.sel(time=timeslice)

# Group LSWT by years and compute the yearly mean cell coverage frequency
lswt_cfrwq = lswt_slice.groupby("time.year").map(getCovFreq)

In [None]:
# Create a violin plot of the distribution of yearly mean coverage frequency
lswt_cfrwq.hvplot.violin(y='mean_cfreq', by='year', 
                         title=f'ESA CCI Lakes v1.1\nLake {lakename} - Coverage frequency', 
                         ylabel='Mean yearly coverage frequency (d)', 
                         grid=True, 
                         ylim=[0,60],
                        )

In [None]:
# Plot yearly mean coverage frequency maps 
lswt_cfrwq.hvplot(geo=True, tiles='CartoLight',
                           groupby="year",  # adds a widget for time
                           clim=(0, 25),  # sets colormap limits
                           crs=crs,
                           cmap='jet',
                           widget_type="scrubber",
                           widget_location="bottom",
                           #width=300,
                           xlabel='Longitude', ylabel='Latitude',
                           #title=f'ESA CCI Lakes v1.1\nLake {lakename} LSWT coverage frequency',
                           clabel='mean coverage frequency (d)'
)

### 2.4 Linear trend analysis
We can compute the linear trend for each lake cell using the [xarray.DataArray.polyfit](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.polyfit.html) function.

In [None]:
polyfit_results = lswt.polyfit(dim='time', deg=1, skipna=True)

ns_per_year = 3.154e+16
ns_per_decade = ns_per_year*10
trend_coef = polyfit_results.polyfit_coefficients.sel(degree=1)
trend_coef_perDec = trend_coef*ns_per_decade.persist() # convert °C/ns to °C/10y and load to memory

Plot the linear coefficients on a map.

In [None]:
import math

vmax = np.nanpercentile(trend_coef_perDec, 95)
vmax = math.ceil(vmax*10)/10

trend_coef_perDec.hvplot(
    clabel='trend (°C/10y)', 
    label=f'Lake {lakename}, Linear trend',
    geo=True, 
    tiles='StamenTerrainRetina', # plot backgroundmap
    #features={'rivers':'10m'},
    cmap='bwr',
    clim=(-vmax, vmax),
    crs=crs,
    #width=300
)

Calculate the mean trend (°C/10y) over entire lake by applying the mean.

In [None]:
np.nanmean(trend_coef_perDec)