# Skill assessment - computation
***

**Author**: Chus Casado Rodríguez<br>
**Date**: 17-05-2023<br>


**Introduction**:<br>
In this notebook I will analyse the EFAS skill in predicting flood events in general, i.e., looking whether events where predicted at some point in time, regardless of neither the offset nor the duration of the event.

**Questions**:<br>

* [ ] Take into account the model spread?
* [ ] Aggregate results by river/administrative area? EFAS aims at alerting administrations about incoming events in there administrative area, shouldn't that aggregation be included in the results?
* [ ] Remove extremely bad performing stations.

**Pending tasks**:<br>

* [x] Weighting the model average by the Brier score?
* [x] Sort stations by catchment area (or other order)?
* [x] Persistence
* [ ] Analyse only the periods/stations close to an observed event and compute f1 for this extraction. Later on, on the complementary subset of data another metric must be computed to avoid false positives, p.e., false alarm ratio.
* [ ] Rename approach 'current' as '1_deterministic_+_1_probabilistic'


**Interesting links**<br>
[Evaluation metrics for imbalanced classification](https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/)<br>
[Cross entropy for machine learning](https://machinelearningmastery.com/cross-entropy-for-machine-learning/)<br>
[Probability metrics for imbalanced classification](https://machinelearningmastery.com/probability-metrics-for-imbalanced-classification/)<br>
[ROC curves and precision-recall curves for imbalanced classification](https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/)<br>
[Instructions for sending EFAS flood notifications](https://efascom.smhi.se/confluence/display/EDC/Instructions+for+sending%2C+upgrading+and+deactivating+EFAS+Flood+Notifications)

In [1]:
import os
import sys
import operator
import glob
import numpy as np
import pandas as pd
import xarray as xr
# import matplotlib as mpl
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cf
from datetime import datetime, timedelta
from tqdm import tqdm_notebook

path_root = os.getcwd()

import warnings
warnings.filterwarnings("ignore")

os.chdir('../py/')
from computations import *
from plots import *
os.chdir(path_root)

## 1 Configuration

In [2]:
# minimum catchment area
area_threshold = 500

# dissagregate the analysis by seasons?
seasonality = False

### 1.1 Notification criteria

#### Probability threshold

In [3]:
# probability thresholds
thresholds = np.arange(.05, .96, .025).round(3)
# thresholds = np.round(sigmoid(np.linspace(-10, 10, 50)), 5)
probabilities = xr.DataArray(thresholds, dims=['probability'], coords={'probability': thresholds})

#### Persistence

A list of tuples with two values: the first value is the width of the window rolling sum, and the second value the minimum number of positives in that window so that a notification is raised.

In [4]:
persistence = [(1, 1), (2, 2), (2, 3), (3, 3), (2, 4), (3, 4), (4, 4)]
persistence = {'/'.join([str(i) for i in pers]): pers for pers in persistence}

#### Leadtime

Notifications are only sent with a minimum leadtime (h).

In [5]:
min_leadtime = 'all'

### 1.2 Paths

In [6]:
# path where results will be saved
path_data = '../data/'
path_forecast = f'{path_data}exceedance/forecast/Q5_5percent_buffer/'
path_reanalysis = f'{path_data}exceedance/reanalysis/'
path_out = f'{path_data}hits/Q5_5percent_buffer/'
if seasonality:
    path_out += 'seasonal/'
if os.path.exists(path_out) is False:
    os.makedirs(path_out)

## 2 Data

### 2.1 Stations

I load all the stations that where selected in a previous [notebook](3_0_select_stations.ipynb).

In [8]:
# load table of fixed reporing points
stations = pd.read_parquet(f'../results/reporting_points/reporting_points_over_{area_threshold}km2.parquet')

### 2.2 Exceedance reanalysis

In [None]:
# # load probability of exceeding the discharge threshold in the REANALYSIS data

# rean_exc = pd.read_parquet(f'{path_reanalysis}/exceedance_rl5_3classes.parquet')
# rean_exc.columns = rean_exc.columns.astype(int)
# # rean_exc = rean_exc.loc[pd.to_datetime(pred.datetime.data), stations.index.tolist()]

# # compute onsets of the flood events
# rean_onsets = rean_exc.astype(int).diff(axis=0) == 1
# rean_onsets.iloc[0,:] = rean_exc.iloc[0,:]

# # create a DataArray with observed threshold exceedance
# obs = df2da(rean_exc, dims=['id', 'datetime'], plot=False, figsize=(16, 20), title='observed exceendace')
# del rean_exc

# # expected probability of an exceedance
# obs = obs.astype(int)

# if seasonality:
#     obs = disaggregate_by_season(obs)

# print(obs.dims)
# print(obs.shape)

In [24]:
xr.open_dataarray(f'{path_reanalysis}0001.nc')

In [25]:
# load probability of exceeding the discharge threshold in the REANALYSIS data
files = glob.glob(f'{path_reanalysis}*.nc')
rean_exc = xr.open_mfdataset(files, combine='nested', concat_dim='id')['exceedance']

OSError: [Errno -51] NetCDF: Unknown file format: b'E:\\casadje\\GitHub\\EFAS_skill\\data\\exceedance\\reanalysis\\4466.nc'

In [15]:
# # load probability of exceeding the discharge threshold in the REANALYSIS data
# rean_exc = pd.read_csv(f'{path_reanalysis}exceedance_rl5_3classes.csv', parse_dates=True, index_col=0)
# rean_exc.columns = rean_exc.columns.astype(int)

# # create a DataArray with observed threshold exceedance
# rean_exc = df2da(rean_exc, ['id', 'datetime'])

if seasonality:
    rean_exc = disaggregate_by_season(rean_exc)

print(rean_exc.dims)
print(rean_exc.shape)

Frozen({'datetime': 7829, 'id': 2371})


AttributeError: 'Dataset' object has no attribute 'shape'

In [16]:
rean_exc

Unnamed: 0,Array,Chunk
Bytes,70.81 MiB,30.58 kiB
Shape,"(2371, 7829)","(1, 7829)"
Dask graph,2371 chunks in 7114 graph layers,2371 chunks in 7114 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 70.81 MiB 30.58 kiB Shape (2371, 7829) (1, 7829) Dask graph 2371 chunks in 7114 graph layers Data type int32 numpy.ndarray",7829  2371,

Unnamed: 0,Array,Chunk
Bytes,70.81 MiB,30.58 kiB
Shape,"(2371, 7829)","(1, 7829)"
Dask graph,2371 chunks in 7114 graph layers,2371 chunks in 7114 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray


### 2.3 Exceedance forecast

In [17]:
# load probability of exceeding the discharge threshold in the FORECAST data
fore_exc = xr.open_mfdataset(f'{path_forecast}*.nc', combine='nested', concat_dim='id')
fore_exc['id'] = fore_exc.id.astype(int)

In [None]:
# reshape the DataArray of forecasted exceedance
fore_exc = xr.Dataset({label: reshape_DataArray(da, trim=True) for label, da in fore_exc.items()})
fore_exc = fore_exc.transpose('id', 'datetime', 'leadtime', 'model')

# extract starting and ending dates
if ('start' not in locals()) or ('end' not in locals()):
    start = pd.to_datetime(fore_exc.datetime.min().data)
    end = pd.to_datetime(fore_exc.datetime.max().data)

# recalculate the exceedance datasets to convert the 3 classes (>Q5, >0.95Q5, <0.95Q5) to only 2 (exceedance, non-exceedance)
rean_exc, fore_exc = recompute_exceedance(rean_exc.sel(datetime=slice(start, end)), fore_exc['high'], fore_exc['low'])

### 2.3 Weighting factors

In [None]:
# by the number of membes
weights_member = xr.open_dataarray(f'{path_data}weights_member.nc')

# by the Brier score
weights_brier = xr.open_dataarray(f'{path_data}weights_brier.nc', engine='netcdf4')

# heatmap of weights
fig, axes = plt.subplots(nrows=2, figsize=(6, 3), constrained_layout=True, sharex=True, sharey=True)
Weights = xr.Dataset({'no. member': weights_member, 'Brier score': weights_brier})
for i, (ax, (var, da)) in enumerate(zip(axes, Weights.items())):
    htm = plot_DataArray(da, vmin=0, vmax=1, ax=ax, ytick_step=1, xtick_step=1, title=f'weighted by {var}', cbar_kws={'shrink': .66})
    if i == len(axes) - 1:
        ax.set_xlabel('leadtime (h)')

## 3. Computations
### 3.1 Hits, misses and false alarms

***

In [None]:
for stn in tqdm_notebook(stations.index):

    # check if the output file already exists
    file_out = f'{path_out}{stn:>04}.nc'
    if os.path.exists(file_out):
        continue
        
    # FORECAST EXCEEDANCE PROBABILITY
    forecast = fore_exc.sel(id=stn)
        

    # TOTAL PROBABILITY OF EXCEEDANCE

    # exceedance according to current criteria
    deterministic = (forecast.sel(model=['EUD', 'DWD']) >= probabilities).any('model')
    probabilistic = (forecast.sel(model=['EUE', 'COS']) >= probabilities).any('model')
    current = deterministic & probabilistic

    # exceedance according to mean over models
    model_mean = forecast.mean('model', skipna=True) >= probabilities

    # exceedance according to the mean over models weighted by the number of members
    member_weighted = forecast.weighted(weights_member).mean('model', skipna=True) >= probabilities

    # exceedance according to the mean over models weighted by the inverse Brier score
    brier_weighted = forecast.weighted(weights_brier.fillna(0)).mean('model', skipna=True) >= probabilities

    # merge all total probability approaches in a single DataArray
    total_exc = xr.Dataset({
                            'current': current,
                            'model_mean': model_mean,
                            'member_weighted': member_weighted,
                            'brier_weighted': brier_weighted,
                            }).to_array(dim='approach')

    del forecast

    # HITS, MISSES, FALSE ALARMS
      
    hits = {}
    for label, pers in persistence.items():

        # compute predicted events
        pred = compute_events(total_exc, persistence=pers, min_leadtime=min_leadtime)
               
        # disaggregate seasonaly
        if seasonality:
            pred = disaggregate_by_season(pred)

        # compute hits, misses and false alarms
        if 'leadtime' in pred.dims:
            aux = compute_hits(rean_exc.sel(id=stn), pred, center=True, w=5)
        else:
            aux = compute_hits(rean_exc, pred, center=True, w=5)
        aux = aux.assign_coords(persistence=label)
        hits[label] = aux.expand_dims(dim='persistence')
        
    hits = xr.concat(hits.values(), dim='persistence')
    
    print(f'Exporting file {file_out}', end='\r')
    hits.to_netcdf(file_out)

    del pred, hits

```Python
    hits = {}
    for label, pers in persistence.items():

        # compute predicted events
        pred = compute_events(total_exc, persistence=pers, min_leadtime=min_leadtime)
               
        # disaggregate seasonaly
        if seasonality:
            pred = disagregate_by_season(pred)
        
        hits_season = {}
        for season in obs_season.season.data:
            # compute hits, misses and false alarms
            if 'leadtime' in pred.dims:
                aux = compute_hits(obs4s.sel(id=stn, season=season), pred4s.sel(season=season), center=True, w=5)
            else:
                aux = compute_hits(obs4s.sel(season), pred4s.sel(season=season), center=True, w=5)
            aux = aux.assign_coords(season=season)
            hits_season[season] = aux.expand_dims(dim='season')
        hits_season = xr.concat(hits_season.values(), dim='season')
        hits_season = hits_season.assign_coords(persistence=label)
        hits4s[label] = hits_season.expand_dims(dim='persistence')

    hits4s = xr.concat(hits4s.values(), dim='persistence')
```

Is it different the result when using `compute_hits` with the complete DataArray (where season is a dimension) and when it is done in a loop?

### 3.2 Number of observed events

In [None]:
# compute onsets of the flood events
rean_onsets = rean_exc.diff('datetime') == 1
rean_onsets = xr.concat((rean_exc.isel(datetime=0).astype(bool), rean_onsets), 'datetime')

In [None]:
n_events_3c = rean_onsets.sum('datetime').to_pandas()

In [None]:
# save number of observed events
stations['n_events_obs'] = rean_onsets.sum('datetime')#.to_pandas()

print('No. stations with observed events:\t{0}'.format((stations.n_events_obs > 0).sum()))
print('No. observed events:\t\t\t{0}'.format(stations.n_events_obs.sum()))

# export the stations table
stations.to_parquet(f'../results/reporting_points/reporting_points_over_{area_threshold}km2.parquet')

# compute number of events per season
if seasonality:
    rean_onsets4s = disaggregate_by_season(rean_onsets, dim='datetime')
    cols = ['n_events_obs_winter', 'n_events_obs_spring', 'n_events_obs_summer', 'n_events_obs_autumn']
    stations[cols] = rean_onsets4s.sum('datetime').to_pandas().transpose()

***

In [None]:
# load probability of exceeding the discharge threshold in the REANALYSIS data
rean_exc_2c = pd.read_parquet(f'{path_reanalysis}exceedance_rl5.parquet')
rean_exc_2c.columns = rean_exc_2c.columns.astype(int)

# create a DataArray with observed threshold exceedance
rean_exc_2c = df2da(rean_exc_2c, ['id', 'datetime'])

# cut to the study period
rean_exc_2c = rean_exc_2c.sel(datetime=slice(start, end))

if seasonality:
    rean_exc_2c = disaggregate_by_season(rean_exc_2c)

print(rean_exc_2c.dims)
print(rean_exc_2c.shape)

In [None]:
# compute onsets of the flood events
rean_onsets_2c = rean_exc_2c.astype(int).diff('datetime') == 1
rean_onsets_2c = xr.concat((rean_exc_2c.isel(datetime=0).astype(bool), rean_onsets_2c), 'datetime')

In [None]:
n_events_2c = rean_onsets_2c.sum('datetime').to_pandas()

In [None]:
n_events_3c.sum(), n_events_2c.sum()

In [None]:
mask = (n_events - n_events_2c) > 0

In [None]:
stn_diff = stations.loc[mask, ['subcatchment', 'river', 'catchment', 'country']]
stn_diff['n_events_2c'] = n_events_2c[mask]
stn_diff['n_events_3c'] = n_events[mask]

In [None]:
stn_diff.head()