# Reanalysis: exceedance and events
***

__Author__: Chus Casado<br>
__Date__: 23-01-2023<br>

__Introduction__:<br>
The objetctive of this notebook is to count the number of flood events that occurred in the 2-year period from 14-10-2020 and 09-10-2022. I consider a flood event each time that the discharge time series of a reporting point goes over the 5-year discharge return period, which is not the same as the number of timesteps at which discharge exceeds the threshod. For every exceeding period, we count one event, no matter if the event lasted 6 h or several days.

Critera used to select an event:

* [x] A period of time in which discharge exceeds the 5-year return period.
* [ ] That period of time must last for a fixed number of timesteps.
* [x] Between two events, discharge must be at some point lower than the 2-year return period.

__Tasks to do__:<br>
* [ ] Remove stations too close. 
    * For that, it would be necessary that the river in which the station is located would be correct (I saw errors in the Danube, Ebro, Duero, Tajo).
    * What if we use only calibration points? They avoid stations with less than 10% increment in catchment area.
* [ ] Check observed data for the station 2996 Burguillo

In [1]:
import os
import glob
import numpy as np
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import cartopy.crs as ccrs
import cartopy.feature as cf
from datetime import datetime, timedelta
import seaborn as sns

path_root = os.getcwd()

import warnings
warnings.filterwarnings("ignore")

os.environ['USE_PYGEOS'] = '0'
import geopandas as gpd

os.chdir('../py/')
from notifications import *
os.chdir(path_root)

## 1 Data

Three types of data is used:
* A table (CSV) of reporting points.
* A NetCDF file with the discharge thresholds for the whole EFAS domain.
* A set of CSV files with the reanalysis discharge timeseries extracted at the reporting points.Three types of data is used:

### 1.1 Reporting points

Load the table with all EFAS reporting points and filter those points for which discharge data will be extracted. At the moment, only reporting points fulfilling the folowing criteria are selected:
* Catchment area larger than 'area_threshold' (500 km2).
* The point is considered a fixed reporting point.
* The point was used for the calibration of LISFLOOD. That means that the field 'EC_Calib' is different from 0 or NaN.

In [4]:
# area threshold
area = 500

# load table of fixed reporing points
stations = pd.read_csv('../data/Station-2022-10-27v12.csv', index_col='station_id')
stations.index = stations.index.astype(str)
# filter stations and fields
mask = (stations['DrainingArea.km2.LDD'] >= area) & (stations.FixedRepPoint == True) & ((stations.EC_calib != 0) & (stations.EC_calib != np.nan))
stations = stations.loc[mask, ['StationName', 'LisfloodX', 'LisfloodY', 'DrainingArea.km2.LDD', 'Catchment', 'River', 'EC_Catchments', 'Country code']]
stations.columns = stations.columns = ['name', 'X', 'Y', 'area', 'subcatchment', 'river', 'catchment', 'country']
stations[['strahler', 'pfafstetter']] = np.nan

# load shapefile with edited river and catchment names
points_edited = gpd.read_file('../data/GIS/fixed_report_points_500.shp')
points_edited.set_index('station_id', inplace=True, drop=True)
points_edited = points_edited[['StationNam', 'LisfloodX', 'LisfloodY', 'DrainingAr', 'Subcatchme',
                               'River', 'Catchment', 'Country co', 'strahler', 'pfafstette']]
points_edited.columns = stations.columns

# correct names of catchments and rivers
ids = list(set(stations.index).intersection(points_edited.index))
for id in ids:
    for var in ['subcatchment', 'river', 'catchment']:
        if points_edited.loc[id, var] != np.nan:
            stations.loc[id, var] = points_edited.loc[id, var]
        
# add subcatchment and river order
stations.loc[ids, ['strahler', 'pfafstetter']] = points_edited.loc[ids, ['strahler', 'pfafstetter']]

# rename columns
#stations.columns = ['name', 'X', 'Y', 'area', 'subcatchment', 'river', 'catchment', 'country', 'strahler', 'subcatchment_order']

print('no. stations:\t{0}'.format(stations.shape[0]))

no. stations:	2219


#### Analysis of autocorrelation across stations

In this section we will explore all reporting points river by river, and we will remove the points whose catchment area does not increase that of the preceding station by a threshold (10 %).

In [6]:
# filter stations accordint to increase in catchment area
# stations_sel = filter_points(stations_sel, threshold=.0)

Applying the threshold on the increase of catchment area we have removed more than 300 reporting points.

### 1.2 Discharge thresholds

The discharge thresholds are the discharge values for return periods 1.5, 2, 5, 10, 20, 50, 100, 200 and 500 years. The data is supplied in a NetCDF file that contains all the river network in Europe. This NetCDF is loaded as an _xarray_ and the values corresponding to the selected reporting points is extracted and included in the reporting points dataframe.

In [7]:
# load thresholds and extract 5-year discharge
path_reanalysis = f'../data/CDS/thresholds/'
thresholds = xr.open_dataset(f'{path_reanalysis}return_levels.nc')

# coordinate arrays necessary to extract the data
x = xr.DataArray(stations.X, dims='id')
y = xr.DataArray(stations.Y, dims='id')
    
# extract thresholds for each return period and station
variables = list(thresholds.keys())
for var in variables:
    # extract X-year discharge for the stations
    da = thresholds[var].sel(x=x, y=y, method='nearest')
    
    # add threshold to the stations data frame
    stations[var] = da.data

### 1.3 Discharge reanalysis

This data represents EFAS simulation results when forced with observed meteorological data. It will be regarded as our observed discharge, even though it is modelled.

The original NetCDF files were previously preprocessed in another [notebook](2_1_reanalysis_preprocessing.ipynb). As a result we have for each year a CSV file with the discharge timeseries for the EFAS reporting points. Since there's only forecast discharge for EFAS v4.0 since the 14-10-2020 12 pm, this will be the starting point for our analysis.

In [10]:
# load discharge reanalysis data
files = glob.glob(f'../data/CDS/reanalysis/*202*.csv')
if 'dis' in locals():
    del dis
for file in files:
    temp = pd.read_csv(file, parse_dates=True, index_col=0)
    if 'dis' in locals():
        dis = pd.concat((dis, temp), axis=0)
    else:
        dis = temp.copy()
    del temp
    
# cut the timeseries from 14-10-2020 12 pm and the selected reporting points
dis = dis.loc['2020-10-14 12:00:00':, stations.index]

print('Discharge timeseries:\n{0}\ttimesteps\n{1}\tstations'.format(*dis.shape))

Discharge timeseries:
3231	timesteps
2219	stations


### 1.4 Exceedance reanalysis

In [29]:
# return period
rp = 5

# compute exceedance
exc = dis >= stations[f'rl{rp}']

# export
exc.to_parquet(f'../data/exceedance/reanalysis/exceedance_rl{rp}.parquet')

## 2 Events

First, I will count the number of events that exceeded the 5-year return period, without taking into account if they belong to the same flood. Then, I will try different combinations of thresholds with the objective of skipping exceedances of the 5-year return period that correspond to a single flood event (for instance, discharge goes up and down around the threshold).

I have created a function call 'identify_events' that includes both procedures.

In [24]:
# dictionary where all the results will be saved
events = {}

### 2.1 Single exceendance threshold
#### 2.1.1 Exceeding Q5

In [28]:
# return period
rp = 5

# IDENTIFY EVENTS
# ---------------

# identify events only with upper bound
key = str(rp)
events[key] = identify_events(dis, stations[f'rl{rp}'])
events[key].to_parquet(f'../data/exceedance/reanalysis/events_rl{rp}.parquet')

In [None]:
# count number of events per station
col = f'n_events_{rp}'
stations_sel[col] = events[key].sum()
print('no. stations with at least one event:\t{0}'.format((stations_sel[col] > 0).sum()))
print('total no. of events:\t\t\t{0}'.format(stations_sel[col].sum()))

# PLOT MAP
# --------

out_folder = f'results/no_events/{rp}/'
if os.path.exists(out_folder) is False:
    os.makedirs(outfolder)
plot_events_map(stations_sel.X, stations_sel.Y, stations_sel[col], save=f'{out_folder}/no_events_map.png')

# PLOT TIMESERIES
# ---------------

# Select the stations in a catchment
catchment = 'Ebro'
mask = (stations_sel.catchment == catchment) & (stations_sel[col] > 0)
stns_catchment = stations_sel.loc[mask].sort_values(['river', 'area']).index

# plot timeseries of the station with more events
# stn = stations_sel[col].idxmax() # in total
# stn = stn_catchment[col].idxmax() # in that catchment
# plot_events_timeseries(dis[stn], events[key][stn], thresholds=stations_sel.loc[stn, ['rl1.5', 'rl2', 'rl5', 'rl20']],
#                        title='{0} - {1} ({2})'.format(stn, *stations_sel.loc[stn, ['StationName', 'Catchment']]),
#                        save=None)

# plot timeseries for all the stations in the catchment
for stn in stns_catchment:
    plot_events_timeseries(dis[stn], events[key][stn], thresholds=stations_sel.loc[stn, ['rl1.5', 'rl2', 'rl5', 'rl20']],
                           title='{0} - {1} ({2})'.format(stn, *stations_sel.loc[stn, ['name', 'catchment']]),
                          )#save=f'{out_folder}/no_events_{stn.zfill(4)}.png')

### 2.2 Double threshold
#### 2.2.1 Exceeding Q5 and below Q2

In [None]:
# return periods for the upper and lower bound
ub = 5
lb = 2

# IDENTIFY EVENTS
# ---------------

# identify events only with upper bound
key = f'{ub}_{lb}'
events[key] = identify_events(dis, stations_sel[f'rl{ub}'], stations_sel[f'rl{lb}'])

# count number of events per station
col = f'n_events_{ub}_{lb}'
stations_sel[col] = events[key].sum()
print('no. stations with at least one event:\t{0}\t({1} with less events)'.format((stations_sel[col] > 0).sum(),
                                                                      ((events['5'].sum() - events[key].sum()) > 0).sum()))
print('total no. of events:\t\t\t{0}\t({1} less)'.format(stations_sel[col].sum(),
                                                         events['5'].sum().sum() - events[key].sum().sum()))

# PLOT MAP
# --------

out_folder = f'results/no_events/{ub}-{lb}/'
if os.path.exists(out_folder) is False:
    os.makedirs(outfolder)
    
plot_events_map(stations_sel.X, stations_sel.Y, stations_sel[col], save=f'{out_folder}/no_events_map.png')

# PLOT TIMESERIES
# ---------------

# select the station with more difference in the number of events
dif = events['5'].sum() - events[key].sum()
dif = dif[dif > 0]
stn = dif.idxmax()

for stn in dif.index:
    plot_events_timeseries(dis[stn], events[key][stn], events['5'][stn], thresholds=stations_sel.loc[stn, ['rl1.5', 'rl2', 'rl5', 'rl20']],
                           title='{0} - {1} ({2})'.format(stn, *stations_sel.loc[stn, ['name', 'catchment']]),
                          )#save=f'{out_folder}/no_events_{stn.zfill(4)}.png')

### 2.3 Comparativa

In [None]:
catchment = 'Ebro'
stn_catchment = stations_sel.loc[stations_sel.catchment == catchment].sort_values(['area', 'river'], ascending=False)

print('no. stations:\t{0}'.format(stn_catchment.shape[0]))

I delete stations in the catchment with high correlation.

In [None]:
# correlation matrix
corr = dis[stn_catchment.index].corr(method='spearman').abs()
rho = .85

# remove the upper diagonal
for i in range(corr.shape[0]):
    for j in range(corr.shape[1]):
        if j >= i:
            corr.iloc[i,j] = np.nan

Let's see what rivers have highly correlated stations ($\rho \ge 0.85$). For each station with at least one correlation higher than the threshold, I will plot its ID, river and he name of the rivers to which its highly correlated stations belong. I want to check whether rivers far apart in the catchment or only close by rivers show correlation.

In [None]:
for stn in corr.index:
    exc = corr.loc[stn] > rho
    if exc.any():
        stns_corr = stn_catchment.loc[exc]
        rivers = stns_corr.river.unique()
        if len(rivers) > 1:
            print(stn, stn_catchment.loc[stn, 'river'], '\t', rivers)

There's a cluster of correlated stations in the Aragon river and its tributaries: Arga (Araquil), Esca and Irati. The Ega and Zadorra rivers, close to the the Aragon, belong also to this group of stations. More interestingly, there are stations in the Ebro river that belong to this cluster.

There's a second cluster of correltaed stations in the rivers Noguera Pallaresa, Esera and Noguera Ribagorzana (all of them rivers in the Segre subcatchment).

In [None]:
# show correlation between stations in the Ebro river
compare_discharge(dis, ['606', '2849'], stations_sel.rl5)
compare_discharge(dis, ['606', '644'], stations_sel.rl5)
compare_discharge(dis, ['606', '652'], stations_sel.rl5)

In [None]:
# remove highly correlated stations
rho = .85
corr_ = corr.copy()
for stn in corr.index[1:]:
    if (corr_.loc[stn] >= rho).any():
        corr_.drop(stn, axis=0, inplace=True)
        corr_.drop(stn, axis=1, inplace=True)

# plot correlation matrix
sns.heatmap(corr_, xticklabels=corr_.index, yticklabels=corr_.columns, vmin=0, vmax=1, cmap='viridis');

print('no. stations selected:\t{0}'.format(corr_.shape[0]))

Once we have filtered out highly correlated stations, we can check how many flood events remain. 

In [None]:
vnts = events['5'].loc[:, corr_.index]
print('no. total events:\t{0}'.format(vnts.sum().sum()))

mask_col = vnts.any(axis=0)
mask_row = vnts.any(axis=1)
vnts = vnts.loc[mask_row, mask_col]
print('no. stations:\t\t{0}'.format(vnts.shape[1]))
print('no. timesteps:\t\t{0}'.format(vnts.shape[0]))

There is a total of 15 events that happened in 11 different stations. There's one event that ocurred at the same time in two stations, let's find it.

In [None]:
ts = []
for t in range(vnts.shape[0]):
    stns = vnts.loc[:, vnts.iloc[t]].columns.tolist()
    if len(stns) > 1:
        ts.append(t)
ts

In [None]:
t = 3

stns = vnts.loc[:, vnts.iloc[t]].columns.tolist()
compare_discharge(dis, stns, stations_sel.rl5)
stations_sel.loc[stns]

The stations belong to the rivers Jiloca and Huerva, two close rivers, but in different subcatchments. That's why they suffered a flood event at the same time, but their correlation is rather low.

I wonder how many stations remain in each river after removing highly correlated stations.

In [None]:
# stations in the catchment selected according to correlation
stns_corr = stations_sel.loc[corr_.index]

# compare data for stations in the same river
stns_by_river = stns_corr.river.value_counts()
for river in stns_by_river.index:
    if stns_by_river[river] > 1:
        stns_river = stns_corr.loc[stns_corr.river == river].index
        if len(stns_river) > 1:
            compare_discharge(dis, stns_river, stations_sel.rl5, title=river, alpha=.1)

#### Overlapping among events

In [None]:
# discharge series for the catchment
dis_cat = dis[stn_catchment.index]
# discharge series for the stations selected according to correlation
dis_corr = dis[stns_corr.index]

In [None]:
exceedances_timeline(dis, stn_catchment, thresholds=['rl5', 'rl20'], figsize=(12, 8))

In [None]:
exceedances_timeline(dis, stns_corr, thresholds=['rl5', 'rl20'], yticks=True, figsize=(12, 8))

In [None]:
stns = ['618', '613', '633', '642', '4548']
compare_discharge(dis, stns, stations_sel.rl5)
stn_catchment.loc[stns, ['area', 'catchment', 'river']]

This 5 stations belong to the the Jalon (Jalon, Piedra and Jiloca rivers) and the Huerva subcatchments, which are adjacent. All of them had a flood event in September 2021, which is obviously somehow correlated. The Spearman correlation coefficient is high in some cases (up to 0.76), but below the threshold that discards stations.

In [None]:
stns = ['606', '652', '594', '615']
compare_discharge(dis, stns, stations_sel.rl5)
stn_catchment.loc[stns, ['area', 'catchment', 'river']]