# Reanalysis preprocessing
***

__Author__: Chus Casado<br>
__Date__:   31-05-2022<br>

__Introduction__:<br>
This code processes the raw EFAS reanalysis discharge data to extract the data necessary for the following steps in the skill analysis.

The raw discharge data was downloaded from the Climate Data Store (CDS) and consists of NetCDF files for every year of the analysis. These NetCDF files contain values for the complete EFAS domain, but the succeeding analysis only require the time series for specific points: the selected reporting points. The codes extracts this timeseries and saves the result in a folder of the repository.

In a second step, the discharge timeseries are compared against a discharge return period to create a new binary timeseries of exceedance/non-exceedance over the specified threshold. To account for events in which the peak discharge is close to the threshold, there's an option to create a 3-class exceedance timeseries: 0, non-exceendance; 1, exceedance over the reduced threshold ($0.95\cdot Q_{rp}$); 2, exceedance over the actual threshold ($Q_{rp}$). By default, the reducing factor is $0.95$, but this value can be changed.

**Questions**:<br>

**To do**:<br>
* [ ] How to define in the configuration file (`config.yml`) the two reporting point input files?

In [1]:
import os
import numpy as np
import pandas as pd
import xarray as xr
os.environ['USE_PYGEOS'] = '0'
import geopandas as gpd
import yaml
from tqdm import tqdm_notebook

path_root = os.getcwd()

## 1 Configuration

In [2]:
with open("../conf/config.yml", "r", encoding='utf8') as ymlfile:
    cfg = yaml.load(ymlfile, Loader=yaml.FullLoader)

# area threshold
area_threshold = cfg.get('reporting_points', {}).get('area', 500)

# return period
rp = cfg.get('return_period', {}).get('threshold', 5)

# percent buffer over the discharge threshold
reducing_factor = cfg.get('return_period', {}).get('reducing_factor', None)

# start and end of the study period
start = cfg.get('study_period', {}).get('start', None)
end = cfg.get('study_period', {}).get('end', None)

# PATHS

# local directory where I have saved the raw discharge data
path_in = cfg.get('paths', {}).get('input', {}).get('discharge', {}).get('reanalysis', f'../data/discharge/reanalysis/')

# NetCDF file that contains the discharge thresholds for each reporting point
file_thresholds = cfg.get('return_period', {}).get('file', '../data/thresholds/return_levels.nc')

# reporting points
file_in_stations = cfg.get('paths', {}).get('input', {}).get('reporting_points', '../data/reporting_points/')
path_out_stations = cfg.get('paths', {}).get('out', {}).get('reporting_points', '../results/reporting_points/')
file_out_stations = f'{path_out_stations}reporting_points_over_{area_threshold}km2.parquet'

## 2 Stations

Load the table with all EFAS fixed reporting point and filter those points for which discharge data will be extracted.

In [3]:
# load table of fixed reporting points
stations = pd.read_csv(file_in_stations, index_col='station_id')
# stations.index = stations.index.astype(str)
stations.index.name = 'id'
# filter stations and fields
mask = (stations['DrainingArea.km2.LDD'] >= area_threshold) & (stations.FixedRepPoint == True)
stations = stations.loc[mask, ['StationName', 'LisfloodX', 'LisfloodY', 'DrainingArea.km2.LDD', 'Catchment', 'River', 'EC_Catchments', 'Country code']]
stations.columns = stations.columns = ['name', 'X', 'Y', 'area', 'subcatchment', 'river', 'catchment', 'country']
stations[['strahler', 'pfafstetter']] = np.nan

In [4]:
# load shapefile with edited river and catchment names
points_edited = gpd.read_file('../data/GIS/fixed_report_points_500.shp')
points_edited.set_index('station_id', inplace=True, drop=True)
points_edited.index = points_edited.index.astype(int)
points_edited = points_edited[['StationNam', 'LisfloodX', 'LisfloodY', 'DrainingAr', 'Subcatchme',
                               'River', 'Catchment', 'Country co', 'strahler', 'pfafstette']]
points_edited.columns = stations.columns
# select points with a Pfafstetter code
mask = points_edited.pfafstetter.isnull()
points_edited = points_edited.loc[~mask]

In [5]:
# correct names of catchments and rivers
ids = list(set(stations.index).intersection(points_edited.index))
stations = stations.loc[ids]
for id in ids:
    for col in ['subcatchment', 'river', 'catchment']:
        if points_edited.loc[id, col] != np.nan:
            stations.loc[id, col] = points_edited.loc[id, col]

# add subcatchment and river order
stations.loc[ids, ['strahler', 'pfafstetter']] = points_edited.loc[ids, ['strahler', 'pfafstetter']]

# rename columns
#stations.columns = ['name', 'X', 'Y', 'area', 'subcatchment', 'river', 'catchment', 'country', 'strahler', 'subcatchment_order']

print('no. stations:\t{0}'.format(stations.shape[0]))

no. stations:	2371


In [6]:
# xarrys with station coordinates that will be used to extract data
x = xr.DataArray(stations.X, dims='id')
y = xr.DataArray(stations.Y, dims='id')

## 3 Discharge data

It loads the EFAS discharge reanalyses for the complete EFAS domain, and out of if only it extracts the discharge time series for the previously selected reporting points and the study period. The discharge timeseries are saved in a _parquet_ file.

In [7]:
var = 'discharge'

# output folder
path_out = cfg.get('paths', {}).get('output', {}).get(var, {}).get('reanalysis', f'../data/{var}/reanalysis/')
if os.path.exists(path_out) is False:
    os.makedirs(path_out)

# load dataset and extract variable discharge
dis = xr.open_mfdataset(f'{path_in}*.nc')['dis06']
dis.close()

# trim data to the study period
dis = dis.sel(time=slice(start, end))

# extract discharge for the selected stations
dis = dis.sel(x=x, y=y, method='nearest')
dis = dis.drop(['x', 'y', 'step', 'surface', 'latitude', 'longitude', 'valid_time'])

# compute the lazy DataArray
dis = dis.compute()

# add 6 h to the timesteps
dis =dis.rename({'time': 'datetime'})
dis['datetime'] = dis.datetime + np.timedelta64(6, 'h')

# save extraction as NetCDF files
dis.name = var
for stn in tqdm_notebook(dis.id.data):
    dis.sel(id=stn).to_netcdf(f'{path_out}{stn:>04}.nc')

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for stn in tqdm_notebook(dis.id.data):


  0%|          | 0/2371 [00:00<?, ?it/s]

## 4 Exceedance

### 4.1 Discharge thresholds

The discharge thresholds are the discharge values for return periods 1.5, 2, 5, 10, 20, 50, 100, 200 and 500 years. The data is supplied in a NetCDF file that contains all the river network in Europe. This NetCDF is loaded as an _xarray_ and the values corresponding to the selected reporting points are extracted.

In [8]:
# load thresholds
thresholds = xr.open_dataset(file_thresholds)

# extract thresholds for the selected stations
thresholds = thresholds.sel(x=x, y=y, method='nearest')
thresholds = thresholds.drop(['x', 'y'])

# add thresholds to the DataFrame of stations
for var, da in thresholds.items():
    stations.loc[thresholds.id, var] = da.values.round(1)

# export DataFrame of stations
stations.to_parquet(file_out_stations)

### 4.2 Exceedance over threshold

This block of code computes the exceedances of the 5-year return period out of the discharge timeseries and thresholds that were previously extracted. The results are seved in a _parquet_ file that will be used in the succeeding analys.

In [9]:
var = 'exceedance'

# output folder
path_out = cfg.get('paths', {}).get('output', {}).get(var, {}).get('reanalysis', f'../data/{var}/reanalysis/')
if os.path.exists(path_out) is False:
    os.makedirs(path_out)

# compute exceedance
thr = f'rl{rp}'
exceedance = dis >= thresholds[thr]

if reducing_factor is not None:
    # compute exceendance over the reduced threshold
    exceedance_buffer = dis >= thresholds[thr] * (1 - reducing_factor)
    
    # create a 3-class exceedance DataArray:
    # 0: non-exceedance
    # 1: exceeedance over the reduced threshold
    # 2: exceedance over the threshold
    exceedance = np.maximum(exceedance.astype(int) * 2, exceedance_buffer.astype(int))

# save as NetCDF files
exceedance.name = var
for stn in tqdm_notebook(exceedance.id.data):
    exceedance.sel(id=stn).to_netcdf(f'{path_out}{stn:>04}.nc')

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for stn in tqdm_notebook(exceedance.id.data):


  0%|          | 0/2371 [00:00<?, ?it/s]