# LakeCREST ESA CCI Lakes v2.0
## 1. Preprocessing
In this script we will use [**xarray**](https://docs.xarray.dev/en/latest/index.html) and [**dask**](https://dask.org/) to load, prepare and explore the [**ESA CCI Lakes v2.0**](https://climate.esa.int/en/projects/lakes/data/) dataset. Using dask, a free and open-source library for parallel computing, allows us to fully utilize the computing power of our machine and scale our workflow easily.

### 1.1 Importing modules
First, we import the necessary python modules for the preprocessing steps.

In [None]:
import pathlib
import xarray as xr
import numpy as np
import time
from dask.distributed import Client, LocalCluster

import colorama # for colored text outputs
from colorama import Fore, Back, Style

### 1.2 User inputs
Here we define the desired lake, variables and paths. More information on available variables and a full list of all lake IDs can be found at the end of the [**D4.3: Product User Guide (PUG)**](https://climate.esa.int/media/documents/CCI-LAKES-0029-PUG_v1.1_signed_CA.pdf) (currently only for v1.1).

In [None]:
lakes = {
    'Michigan':6,
    'Superior':2,
    'Erie':12,
    'Huron':5,
    'Ontario':15,
    'Kariba':35,
    'Garda':505,
    'Baikal':8
    }
lakename = 'Erie'
lakeid = lakes.get(lakename)

In [None]:
variables = ['lake_surface_water_temperature',
             'lswt_quality_level',
             'lswt_uncertainty']

The paths are initialized as python raw-strings to avoid errors due to escape-sequences and then converted to pathlib.Path objects. The pathlib library allows to use paths within different operating systems and gives us access to many powerful functions.

In [None]:
# Define data directory path and filenames
path_data = r'D:\lakecrest\esa_cci_lp\v1.1' # CCI data folder path
path_mask = r'D:\lakecrest\esa_cci_lp\mask\ESACCI-LAKES_mask_v1.nc' # mask path
path_dask = r'C:\Users\Micha\Desktop\dask' # Temporary dask workerspace

# Get filepaths and convert to pathlib.Path
path_data = pathlib.Path(path_data)
path_mask = pathlib.Path(path_mask)
path_dask = pathlib.Path(path_dask)

### 1.3 Dask initialization
We initialize a local Dask client with our specified number of workers, threads per worker and memory limit (per worker). Calling the client outputs the client adress, so we can access the client over its webinterface. A good starting point for the settings is to set *n_workers* to the number of physical cores and *threads_per_worker* to number of virtual cores.

In [None]:
# Define according to system specs
n_workers = 8              # (e.g. number of physical cores)
threads_per_worker = 1     # (e.g. virtual cores / n_workers)
memory_limit = '4GB'       # (e.g. max memory / n_workers)

local_directory = path_dask
cluster = LocalCluster(n_workers=n_workers, 
                       threads_per_worker=threads_per_worker, 
                       memory_limit=memory_limit,
                       local_directory=local_directory
                      )
client = Client(address=cluster.scheduler_address)
client

### 1.4 Specify chunk size
We use xarray to load the large multi-file dataset. xarray allows us to initialize and load the entire dataset by only providing the necessary filepaths. Instead of loading the entire dataset (>350GB) to memory, we can make use of xarray's ability to lazy-load chunks of data. This means that only the necessary subset of each individual .nc file will be loaded into memory at the time it is needed. For this we can either pre-define a chunk size or let xarray automatically define a size. Small chunk sizes are prefered since individual lakes are very small in comparison to the size of the global dataset. 

In [None]:
# Set chunk size based on .nc file dims and a divider
#input_lat_sz = 21600
#input_lon_sz = 43200
#divider=100

#chunk_lat_sz = int(input_lat_sz/divider)
#chunk_lon_sz = int(input_lon_sz/divider)
#chunks={'lat':chunk_lat_sz,
#        'lon':chunk_lon_sz,
#        'time':1
#        }

# Alternatively set chunks to 'auto' (dask decides chunk size)
chunks='auto'

### 1.5 Spatial subsetting
Next we setup a function **preprocess** that subsets the daily .nc file to only fetch information about the desired bounding-box based on the bounding coordinates of the desired lake. The preprocess function will be run on every daily .nc file when the multifile-dataset is initialized.

In [None]:
# Load CCI lakes mask
DS_mask = xr.open_dataset(filename_or_obj=path_mask,
                          engine='netcdf4',
                          decode_cf=False,
                          chunks=chunks
                          )

# Get logical True/False lake mask over full globe
mask_full = (DS_mask.CCI_lakeid == lakeid)

# Get logical lake mask sliced over ROI
mask_roi = DS_mask.CCI_lakeid.where(mask_full, 
                                   drop=True)

# Count number of unmasked lake cells
lake_cells = np.count_nonzero(mask_roi)

# Get bounds coordinates (in WGS84)
lat_min = mask_roi.lat[0].values
lat_max = mask_roi.lat[-1].values
lon_min = mask_roi.lon[0].values
lon_max = mask_roi.lon[-1].values
print(f'{Fore.RED}Subsetting to masked ROI with bbox ' \
      f'lat: ({lat_min:0.1f}, {lat_max:0.1f}), ' \
      f'lon: ({lon_min:0.1f}, {lon_max:0.1f}) for Lake {lakename}')

# Subset mask to same coords
DS_mask_roi = DS_mask.sel(lat=slice(lat_min, lat_max), 
                          lon=slice(lon_min, lon_max))

def preprocess(ds):
    '''Keeps only the necessary lat/lon slice when opening .nc files'''
    return ds.sel(lat=slice(lat_min, lat_max), 
                  lon=slice(lon_min, lon_max))

### 1.6 Load data as xarray.DataSet
Initialize the dataset using xarray's [**xarray.open_mfdataset**](https://docs.xarray.dev/en/latest/generated/xarray.open_mfdataset.html) function. The xarray documentation has an extensive [user-guide](https://xarray.pydata.org/en/stable/user-guide/io.html) with best-practices to load large datasets. We will set ***decode_cf*** to **false** for now and decode the dataset later.

In [None]:
# Use pathlib.Path.rglob function to recursively find all .nc files within folder
paths_data = list(path_data.rglob('*fv1.1.nc'))   # search for all files *fv1.1.nc in folders and subfolders in the path

start_time = time.time()

DS = xr.open_mfdataset(paths=paths_data,
                       combine='by_coords',
                       parallel=True,
                       engine='netcdf4',
                       decode_cf=False,
                       decode_times=True,
                       preprocess=preprocess,
                       chunks=chunks
                       )

print(f'{Fore.RED}Xarray dataset with variables: {variables} initialized after' \
      f'{(time.time()-start_time):0.1f} seconds')

We can get a overview of the xarray.Dataset and it's variables and attributes.

In [None]:
DS

### 1.7 Apply lake-mask
Next we apply the lake mask to mask cells of possible other lakes in the same bounding box.

In [None]:
# Compute the lake mask with specified CCI_lakeid
# use reindex to make sure mask cells are aligned to data cells
da_mask = (DS_mask_roi.CCI_lakeid == lakeid).reindex_like(other=DS, 
                                                          method='nearest')

# Get subset with lakemask and xarray.DataArray.where
DS_clip = DS.where(cond=da_mask, 
                   drop=True)

### 1.8 Decode the data
xarray will handle the data-decoding of the NetCDF format with the scaling- and offset-attributes found in the loaded files. We can use [xarray.decode_cf](https://docs.xarray.dev/en/latest/generated/xarray.decode_cf.html) to automatically decode the data.

In [None]:
# Decode data
DS_clip = xr.decode_cf(DS_clip)

### 1.9 Drop days with no data
We can drop days with all-nan values for the variable 'lake_surface_water_temperature' by using the [**xarray.Dataset.dropna**](https://docs.xarray.dev/en/latest/generated/xarray.Dataset.dropna.html) function with the parameter ***how='all'***. Alternatively we can also define a nan-treshhold for the parameter ***tresh*** to discard days with equal or more nan-cells than a treshold.

In [None]:
DS_clip_dropna = DS_clip.dropna(dim='time', 
                                how='all',
                                subset=['lake_surface_water_temperature'])

# alternative using threshold (e.g. days with 50% or more nan-cells)
#nan_tresh = int(lake_cells*0.5) 
#DS_clip_dropna = DS_clip.dropna(dim='time',
#                                tresh=nan_tresh,
#                                subset=['lake_surface_water_temperature'])

dropcount = DS_clip.time.size - DS_clip_dropna.time.size
print(f'{Fore.RED}Dropped {dropcount} days with all-nan cells (from {DS_clip.time.size} to {DS_clip_dropna.time.size})')

# alternative with treshold
#print(f'{Fore.RED}Dropped {dropcount} days with more than {nan_tresh/lake_cells*100:0.0f}% cells (from {DS.time.size} to {DS_dropna.time.size})')

DS_clip_dropna

### 1.10 Mask LSWT by quality flag
We convert the [xarray.Dataset](https://docs.xarray.dev/en/latest/generated/xarray.Dataset.html) to a [xarray.Dataarray](https://docs.xarray.dev/en/latest/generated/xarray.DataArray.html) containing only the variable 'lake_surface_water_temperature' and convert the value from °K to °C.

In [None]:
lswt_celsius = (DS_clip_dropna.lake_surface_water_temperature-273.15)

Using the variable 'lswt_quality_level' we can mask out cells that don't fulfill the necessary quality criteria that ranges from 0 to 5:

- 0: unprocessed 
- 1: bad 
- 2: marginal 
- 3: intermediate
- 4: good 
- 5: best

Within the [D4.1: Product Validation and Intercomparison Report](https://climate.esa.int/media/documents/CCI-LAKES-0031-PVIR_v1.4.pdf) the authors of the ESA CCI Lakes product recommend to limit the use of LSWT for lake-climate applications to quality flags 4-5 (good and best).

In [None]:
# Use quality flag variable 'lswt_quality_level' to mask out all cells except good and best
qmask = (DS_clip_dropna['lswt_quality_level']>3)
lswt_celsius = lswt_celsius.where(cond=qmask)

### 1.11 Export subset
If necessary, the subset with the masked data can be exported using [**xarray.Dataset.to_netcdf**](https://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_netcdf.html). We can get the encoding settings (e.g. compression, fillvalue, scale-factor, offset) for the NetCDF export from the previously loaded dataset.

In [None]:
path_dst = f'D:\lakecrest\output\ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-{lakename}-fv2.0.nc'
path_dst = pathlib.Path(path_dst)

# Get encoding settings from DS.encoding and DS.attrs to ensure the encoding setting are identical
DS_enc = {}
encoding_enc = ['zlib', 'shuffle', 'complevel', 
              'fletcher32', 'contiguous', 'dtype']
attr_enc = ['_FillValue', 'scale_factor', 'add_offset']
for var in variables:
    DS_enc_encoding = DS.get(var).encoding
    DS_enc_fromenc = {k:v for k, v in DS_enc_encoding.items() \
                      if k in encoding_enc}
    DS_enc_attrs = DS.get(var).attrs
    DS_enc_fromattrs = {k:v for k, v in DS_enc_attrs.items() \
                        if (k in attr_enc and hasattr(DS.get(var), k))}
    DS_enc[var] = {**DS_enc_fromenc, **DS_enc_fromattrs}
print(f'{Fore.RED}DS encoding:', DS_enc)

# Set time slice to get a temporal subset

timeslice = slice('2019-01-01', '2020-01-01)

# Set starting time for timer
start_time = time.time() 

export = lswt_celsius.sel(time=timeslice)
                  
export.to_netcdf(path=path_dst,
                       mode='w',
                       engine='netcdf4',
                       encoding=DS_enc
                      )

print(f'{Fore.RED}ESA CCI Lakes v2.0 subset of Lake {lakename} ' \
      f'exported after {(time.time()-start_time):0.1f} seconds.')

## 2. Explore dataset
### 2.1 LSWT timeseries animation
Create an interactive timeseries animation that displays the temperature data in a cartographic reference system using [Holoviews](https://holoviews.org/) and [Cartopy](https://scitools.org.uk/cartopy/docs/latest/).

In [None]:
import hvplot.xarray
import cartopy.crs as ccrs

crs = ccrs.PlateCarree(central_longitude=0, globe=None)
lswt_celsius.hvplot(
    geo=True, tiles='CartoLight',
    groupby="time",  # adds a widget for time
    clim=(0, 25),  # sets colormap limits
    crs=crs,
    cmap='jet',
    widget_type="scrubber",
    widget_location="bottom",
    width=300
)

### 2.2 Mean LSWT plot
We can use [**xarray.DataArray.mean**](https://docs.xarray.dev/en/latest/generated/xarray.DataArray.mean.html) with the parameters ***dim*** and ***skipna*** to plot the spatially aggregated daily LSWT mean over time. An extensive documentation of the plotting functionality compatible with xarray datatypes can be found at https://docs.xarray.dev/en/latest/user-guide/plotting.html.

In [None]:
timeslice = slice('2006-01-01', '2007-12-31')

lswt_celsius_mean = lswt_celsius.mean(dim=["lat", "lon"], skipna=True)
lswt_celsius_mean.sel(time=timeslice).plot.line("b-^")

### 2.3 Visualize data availability

As we can see in the mean plots the data availability changes in accordance with sensor mission timeline. We can use the time dimension to calculate the yearly average coverage frequency.

### 2.4 Compute linear trend
We can compute the linear trend for each lake cell using the [xarray.DataArray.polyfit](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.polyfit.html) function

In [None]:
date_start = "1993-01-01"
date_end = "2018-12-31"
lswt_celsius_subset = lswt_celsius.loc[dict(time=slice(date_start, date_end))].persist()
polyfit_results = lswt_celsius_subset.polyfit(dim='time', 
                                              deg=1, 
                                              skipna=True)
ns_per_year = 3.154e+16
ns_per_decade = 3.154e+17
trend_coef = polyfit_results.polyfit_coefficients.sel(degree=1)
trend_coef_perDec = trend_coef*ns_per_decade # convert °C/ns to °C/10y
trend_coef_perDec.load() # load results into memory for faster plotting

Plot the data using HoloViews

In [None]:
vmax = np.nanpercentile(trend_coef_perDec, 95)
vmax = math.ceil(vmax*10)/10

trend_coef_perDec.hvplot(
    clabel='trend (°C/10y)', 
    label=f'Lake {lakename}, Linear trend',
    geo=True, 
    tiles='StamenTerrainRetina', # plot backgroundmap
    #features={'rivers':'10m'},
    cmap='bwr',
    clim=(-vmax, vmax),
    crs=crs,
    #width=300
)

In [None]:
stop = math.ceil(vmax*10)/10
steps = int(stop/0.1)*2+1
levels = np.linspace(-stop, stop, steps)

trend_coef_perDec.hvplot.contourf(
    clabel='trend (°C/10y)', 
    label=f'Lake {lakename}, Linear trend',
    geo=True, 
    tiles='StamenTerrainRetina', # plot backgroundmap
    #features={'rivers':'10m'},
    levels=levels,  
    cmap='bwr',
    crs=crs,
    #width=300
)

Calculate the mean trend (°C/10y) over entire lake

In [None]:
np.nanmean(trend_coef_perDec)

Plot with matplotlib

In [None]:
map_proj = ccrs.PlateCarree()
data_proj = ccrs.PlateCarree()

# create figure
fig = plt.figure(dpi=150)
ax = fig.add_subplot(111, projection=map_proj)

# plot contouplot
trend_plot = ax.contourf(polyfit_results.lon.values,
                         polyfit_results.lat.values,
                         polyfit_results.polyfit_coefficients.sel(degree=1)*3.154e+17, # convert from °C/ns to °C/10y
                         cmap='bwr',
                         levels=np.linspace(-1, 1, 11),
                         transform=data_proj,
                         extend='both')

# add colorbar
axpos = ax.get_position()
cbar_ax = fig.add_axes([axpos.x1+0.03,axpos.y0,0.03,axpos.height])
cbar = fig.colorbar(trend_plot, cax=cbar_ax)
cbar.ax.tick_params(labelsize=12)
cbar.set_label('LSWT trend (°C/10y)', fontsize=12)