# LakeCREST notebooks
*Version: 08.04.2022 14:45*
## 1. Unpack ESA CCI Lakes (multiple lakes, from disk)
In this script we will use [**xarray**](https://docs.xarray.dev/en/latest/index.html) and [**dask**](https://dask.org/) to load, mask and subset the [**ESA CCI Lakes v1.1**](https://catalogue.ceda.ac.uk/uuid/ef1627f523764eae8bbb6b81bf1f7a0a). The unpacking in this script is based on loading the pre-downloaded dataset from a local disk. Using xarray and dask, two free and open-source libraries, allows us to fully utilize the computing power of our machine in parallelized workflows scaled to the system at hand.

### 1.1 Importing modules
First, we import the necessary python modules for the following steps.

In [None]:
import pathlib
import xarray as xr
import numpy as np
import time
import pandas as pd
from dask.distributed import Client, LocalCluster

### 1.2 Define ROI
Here we define the desired lake and data paths. These are the only necessary user inputs to run the export of LSWT+LIC subsets. Depending on the system the dask settings in *1.3 Dask initialization* can be adapted as well to confine to individual memory and processing limits. To export data on other available variables, more information can be found in the [**D4.3: Product User Guide (PUG)**](https://climate.esa.int/media/documents/CCI-LAKES-0029-PUG_v1.1_signed_CA.pdf). To mask the lakes from the global dataset we are using the lakemask "ESACCI-LAKES_mask_v1.nc" that can be accessed as part of [**ESA CCI Lakes v1.0**](https://catalogue.ceda.ac.uk/uuid/3c324bb4ee394d0d876fe2e1db217378).
#### Define filepaths (✦ User inputs)

In [None]:
# Define lake to unpack
lakenames = ['Ontario', 'Huron', 'Michigan', 'Kariba', 'Garda']

# Define data directory path and filenames
path_data = pathlib.Path(r'D:\lakecrest\esa_cci_lp\v1.1') # Path to ESA CCI Lakes data folder path
path_mask = pathlib.Path(r'D:\lakecrest\esa_cci_lp\mask\ESACCI-LAKES_mask_v1.nc') # Path to lakemask
path_dask = pathlib.Path(r'C:\Users\Micha\Desktop\dask') # Path to temporary dask workerspace

Now we can search for the CCI_lakeid corresponding to the defined lakes using the provided table.

In [None]:
# Load in table of lakes with lake coordinates and the CCI_lakeid
# based on D4.3 Product User Guide (PUG) - Annex B: List of lakes
try:
    df = pd.read_csv('lakelist_v1.1.csv', delimiter=';')
except:
    print('Error: Did not find the lakelist .csv file, check that it is in the same folder!')

for lake in lakenames:
    if not(lake in list(df['name'])):
        print(f'Warning: Lake {lake} was not found, check spelling!')

# Get Preview of selected lakes
df[df['name'].isin(lakenames)]

### 1.3 Dask initialization
We initialize a local Dask client with our specified number of workers, threads per worker and memory limit (per worker). Calling the client outputs the client adress, so we can access the client over its webinterface. A good starting point for the settings is to set *n_workers* to the number of physical cores and *threads_per_worker* to number of virtual cores.

In [None]:
# Define according to system specs
n_workers = 2            # (e.g. number of physical cores)
threads_per_worker = 8     # (e.g. virtual cores / n_workers)
memory_limit = '16GB'       # (e.g. max memory / n_workers)

local_directory = path_dask
cluster = LocalCluster(n_workers=n_workers, 
                       threads_per_worker=threads_per_worker, 
                       memory_limit=memory_limit,
                       local_directory=local_directory
                      )
client = Client(address=cluster.scheduler_address)
client

### 1.4 Specify chunk size
We use xarray to load the large multi-file dataset. xarray allows us to initialize and load the entire dataset by only providing the necessary filepaths. Instead of loading the entire dataset (>350GB) to memory, we can make use of xarray's ability to lazy-load chunks of data. This means that only the necessary subset of each individual .nc file will be loaded into memory at the time it is needed. For this we can either pre-define a chunk size or let xarray automatically define a size.

In [None]:
# dask decides chunk size
chunks='auto'

# Alternatively set chunks to specified size
# chunks={'lat':10,
#         'lon':10,
#         'time':1
#         }

### 1.5 Spatial subsetting
Because we are only interested in a specific lakes we define bounding boxes for each the desired lake based on the provided lake mask file.

In [None]:
# Load CCI lakes mask
DS_mask = xr.open_dataset(filename_or_obj=path_mask,
                          engine='netcdf4',
                          decode_cf=True,
                          chunks=chunks
                          )
lakedict = {}

for lake in lakenames:
    # Get lakeid
    lakeid = lakeid = df.loc[df['name'] == lake]['cci_lakeid'].values[0]
    
    # Get logical True/False lake mask over full globe
    mask_full = (DS_mask.CCI_lakeid == lakeid)

    # Get logical True/False lake mask sliced over ROI only
    mask_roi = mask_full.where(mask_full, drop=True)

    # Get bounds coordinates
    lat_min = mask_roi.lat[0].values
    lat_max = mask_roi.lat[-1].values
    lon_min = mask_roi.lon[0].values
    lon_max = mask_roi.lon[-1].values
    
    lakedict[lake] = {
        'name': lake,
        'id': lakeid,
        'lat_min': lat_min,
        'lat_max': lat_max,
        'lon_min': lon_min,
        'lon_max': lon_max,
        'mask_roi': mask_roi}

We can check the lake masks and the computed bounding boxes based on the mask file on a mapview using [**hvplot**](https://hvplot.holoviz.org/) and [**GeoViews**](https://geoviews.org/). Both objects are based on the [**HoloViews**](https://holoviews.org/) library and can be easily combined.

In [None]:
import hvplot.xarray
import geoviews as gv
import holoviews as hv
import warnings

# Ignore warnings about unevenly sampled axes, 
# this is expected because we are working with geographic coordinates
warnings.filterwarnings("ignore", message="Image dimension.*")

plots = []

for lake in lakedict.values():
    # Create mapplot of mask
    hv_mask = lake['mask_roi'].hvplot(geo=True, tiles='CartoLight', colorbar=False, 
                              xlabel='Longitude', ylabel='Latitude', title=f'Lake {lake["name"]}',
                                     width=300, height=300)
    # Create mapplot of bounding box
    gv_bbox = gv.Rectangles([(lake['lon_min'], lake['lat_min'], 
                              lake['lon_max'], lake['lat_max'])])\
    .opts(color='none', line_width=2, line_color='red')
    # Combine mapplots and add to list of plots
    plots.append(hv_mask*gv_bbox)

# Plot ROIs together
combined = hv.Layout(plots).cols(3)
combined.opts(shared_axes=False)

### 1.6 Load an individual .nc file
To test out xarray and get a preview of the ESA CCI Lakes dataset we can load a single file from the dataset. For this we will use the [xarray.open_dataset](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html) function. We can get a preview of the loaded data, its attributes and variables in the console view.

In [None]:
# Use pathlib.Path.rglob function to recursively find all .nc files within the data folder
paths_data = list(path_data.rglob('*fv1.1.nc'))

# Get the first filepath from the list of .nc files
path_fn_first = paths_data[0]

# Load the file with xarray
DS_preview = xr.open_dataset(filename_or_obj=path_fn_first,
                             engine='netcdf4',
                             decode_cf=False,
                             chunks=chunks)
DS_preview

### 1.7 Load full dataset as xarray.Dataset
Now, we can initialize the full dataset using xarray's [**xarray.open_mfdataset**](https://docs.xarray.dev/en/latest/generated/xarray.open_mfdataset.html) function. Xarray will handle the data-decoding of the NetCDF format with the scaling- and offset-attributes found in the loaded files. During the loading process we can monitor the progress and the task stream of our workers in the dask webinterface (output from *1.3 Dask initialization*). Once the dataset is loaded, we'll get a preview.

The xarray documentation has an extensive [user-guide](https://xarray.pydata.org/en/stable/user-guide/io.html) with explanations and best-practices to load large datasets.

In [None]:
# Define variables (rest is dropped from dataset)
variables = ['lake_surface_water_temperature',
              'lswt_quality_level',
              'lswt_uncertainty',
              'lake_ice_cover'
              ]

# Create preprocess function that drops unnecessary variables
def preprocess(ds):
    return(ds[variables])

# Setup timer to time the loading process
start_time = time.time()

DS = xr.open_mfdataset(paths=paths_data,
                       combine='by_coords',
                       parallel=True,
                       engine='netcdf4',
                       decode_cf=True,
                       chunks=chunks,
                       preprocess=preprocess)

print(f'Xarray dataset with variables: {variables} initialized after ' \
      f'{(time.time()-start_time):0.1f} seconds')

DS # Load preview

### 1.8 Export subsets
The subset with the masked data can now be exported using [**xarray.Dataset.to_zarr**](https://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_zarr.html). We can get the encoding settings (e.g. compression, fillvalue, scale-factor, offset) for the export from the previously loaded preview (*1.5 Load an individual .nc file*) of the dataset.

We apply the lake mask to mask cells of other lakes in the same bounding box. To make sure that the lake mask has identical cell coordinates we first align it to the coordinates of our dataset. Now we can export the subset. Exporting the files to the [zarr](https://zarr.readthedocs.io/en/stable/)-format instead of NetCDF is more efficient in parralel-writing and compression.

In [None]:
# Define output path in /subsets folder
# Form pathlib objects
path_wrk = pathlib.Path().absolute()
path_ss = path_wrk.joinpath('subsets')
# create subsets folder if it doesnt exit yet
path_ss.mkdir(parents=True, exist_ok=True)

# Get exports for each lake
for lake in lakedict.values():
    
    ### Slice data, apply lakemask and subset variables
    ### -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
    
    #slice
    DS_lake = DS.sel(lat=slice(lake['lat_min'], lake['lat_max']), lon=slice(lake['lon_min'], lake['lon_max']))
    
    # Get mask from dict
    mask_roi = lake['mask_roi']
    
    # Reindex coordinates to make sure that our lake mask cells are aligned to data cells
    da_mask = mask_roi.reindex_like(other=DS_lake, method='nearest')
    # Load mask to memory
    da_mask.load()

    # Slice dataset to bbox and mask data
    DS_lake_masked = DS_lake.where(cond=(da_mask==True))
    
    # Rechunk subset
    lat_chunk = DS_lake_masked.lat.size
    lon_chunk = DS_lake_masked.lon.size
    DS_lake_masked = DS_lake_masked.chunk({'lat':lat_chunk, 'lon':lon_chunk, 'time':1})
    DS_lake_masked

    ### Run export for subset
    ### -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-

    import zarr
    
    # Define an output path for the subset file
    lakename = lake["name"]
    path_dst = path_ss.joinpath(fr'ESACCI-LAKES-L3S-LK_PRODUCTS-MERGED-LSWT_LIC-{lakename}-fv1.1.zarr')

    # Get encoding settings from un-decoded DS_preview
    DS_enc = {}
    encoding_enc = ['dtype'] # ,'zlib', 'shuffle', 'complevel', 'fletcher32', 'contiguous']
    attr_enc = ['_FillValue', 'scale_factor', 'add_offset']
    for var in variables:
        DS_enc_encoding = DS_preview.get(var).encoding
        DS_enc_fromenc = {k:v for k, v in DS_enc_encoding.items() \
                          if k in encoding_enc}
        DS_enc_attrs = DS_preview.get(var).attrs
        DS_enc_fromattrs = {k:v for k, v in DS_enc_attrs.items() \
                            if (k in attr_enc and hasattr(DS_preview.get(var), k))}
        DS_enc[var] = {**DS_enc_fromenc, **DS_enc_fromattrs}

    # Setup zarr-compressor using Blosc and zstd-compression
    compressor = zarr.Blosc(cname="zstd", clevel=3, shuffle=2)

    # Set compression settings in encoding
    for var in DS_enc.items():
        var[1]['compressor'] = compressor

    #print('DS encoding:')
    #for var in DS_enc.items(): 
    #    print(var)

    # Set starting time for timer
    start_time = time.time() 

    print(f'Exporting for Lake {lakename} started.. ', end='')
    DS_lake_masked.to_zarr(store=path_dst,
                           mode='w',
                           encoding=DS_enc)
    print(f'Subset exported after {(time.time()-start_time):0.0f} seconds.')

print('All subsets generated!')

#### Convert subset zarr-folders to zip-files
To make the zarr-folders easier to handle we can put them in zip-archives.

In [None]:
import shutil

#shutil.make_archive(output_filename, 'zip', dir_name)
paths_ss = list(path_ss.glob('*zarr'))
for path in paths_ss:
    # Set starting time for timer
    start_time = time.time()
    fn_dst = path.stem
    path_dst = path_ss.joinpath(fn_dst)
    # Create .zip if it doesn't exist yet
    if not(path_ss.joinpath(path_dst.name+('.zip')).exists()):
        print(f'Zipping zarr-folder of Lake {path.stem.split("-")[-2]}.. ', end='')
        shutil.make_archive(path_dst, 'zip', path)
        print(f'Finished zipping after {(time.time()-start_time):0.0f} seconds.')

print('All zarr-folders converted to .zip files!')