# Forecast preprocessing
***

**Author**: Chus Casado<br>
**Date**: 22-05-2023<br>

**Introduction**:<br>
This code computes the probability of the forecasted discharge of exceeding a threshold associated with a specific return period.

The input data are:
* The discharge forecast for the complete set of reporting points, numerical weather prediction (NWP) model and the complete study period. This data is saved in NetCDF format in a hard disk; due to its size it cannot be included in the GitHub repository.
* The discharge thresholds associated to each reporting point, from which a specific return period (by default 5 years) will be used.

The output is the probability of exceeding the specified return period. This is a matrix of multiple dimensions: station, NWP model, date-time, and lead time. Actually, 2 matrixes are computed, one related to the probability of exceeding the specified threshold ($Q_{rp}$), and another with the probability of exceeding a slighly lower threshold ($0.95\cdot Q_{rp}$). This lower threshold is used to avoid false positives or false negatives in forecasts very close to the observation, but on opposite sides of the discharge threshold.

The results are saved inside the repository as NetCDF, saving one file per reporting point.

**Questions**:<br>


**Tasks to do**:<br>
* [ ] Probably, the `compute_exceedance` function could be improved so the two discharge thresholds are computed at the same time. Since the function needs to load all the forecast discharge records, it is a waste of time the current procedure, in which all the records need to be loaded twice (once for each discharge threshold).

**Interesting links**<br>

In [1]:
import os
path_root = os.getcwd()
import glob
import numpy as np
import pandas as pd
import xarray as xr
from datetime import datetime, timedelta
import time
from tqdm import tqdm_notebook
import yaml

import warnings
warnings.filterwarnings("ignore")

os.chdir('../py/')
from computations import *
from plots import *
os.chdir(path_root)

## Configuration

In [10]:
with open("../conf/config.yml", "r", encoding='utf8') as ymlfile:
    cfg = yaml.load(ymlfile, Loader=yaml.FullLoader)

# area threshold
area_threshold = cfg.get('reporting_points', {}).get('area', 500)

# return period
rp = cfg.get('return_period', {}).get('threshold', 5)

# percent buffer over the discharge threshold
reducing_factor = cfg.get('return_period', {}).get('reducing_factor', None)

# PATHS

# local directory where I have saved the raw discharge data
path_in = cfg.get('paths', {}).get('input', {}).get('discharge', {}).get('forecast', f'../data/discharge/reanalysis/')

# path where the output exceedance datasets will be saved
path_out = cfg.get('paths', {}).get('output', {}).get('exceedance', {}).get('forecast', f'../data/exceedance/forecast/')
if os.path.exists(path_out) is False:
    os.makedirs(path_out)
    
# reporting points
path_stations = cfg.get('paths', {}).get('input', {}).get('reporting_points', '..data/reporting_points/')
file_stations = f'{path_stations}reporting_points_over_{area_threshold}km2.parquet'

### 1 Discharge forecast

#### List available data

In [3]:
# list files
fore_files = {model: [] for model in list(models)}
for year in [2021, 2022]:
    for month in range(1, 13):    
        # list files
        for model in models:
            fore_files[model] += glob.glob(f'{path_in}{model}/{year}/{month:02d}/*.nc')

# count files and check if all are avaible
n_files = pd.Series(data=[len(fore_files[model]) for model in models], index=models)

# list of forecast from the beginning to the end of the data
start, end = datetime(1900, 1, 1), datetime(2100, 1, 1)
for model in models:
    st, en = [datetime.strptime(fore_files[model][step][-13:-3], '%Y%m%d%H') for step in [0, -1]]
    start = max(st, start)
    end = min(en, end)
dates = pd.date_range(start, end, freq='12h')

# find missing files
if any(n_files != len(dates)):
    missing = {}
    for model in models:
        filedates = [datetime.strptime(file[-13:-3], '%Y%m%d%H') for file in fore_files[model]]    
        missing[model] = [date for date in dates if date not in filedates]
    print('mising files:', missing)

# trim files to the period where all models are available
for model in models:
    fore_files[model] = [file for file in fore_files[model] if start <= datetime.strptime(file[-13:-3], '%Y%m%d%H') <= end]
    print('{0}:\t{1} files'.format(model, len(fore_files[model])))

COS:	1460 files
DWD:	1460 files
EUD:	1460 files
EUE:	1460 files


## 2 Analysis

### 2.1 Stations 

If the preprocessing of the discharge forecast was only to be done in the selected reporting points, the code to run is the following

```Python
# load selected points for all the catchments
stations = pd.DataFrame()
catchments = []
results_path = '../results/reporting_points/'
for folder in os.listdir(results_path):
    try:
        stn_cat = pd.read_csv(f'{results_path}{folder}/points_selected.csv', index_col='station_id')
        stations = pd.concat((stations, stn_cat))
        catchments.append(folder)
    except:
        continue
print('no. stations:\t\t\t{0}'.format(stations.shape[0]))
```

Instead, if the preprocessing is to be done in all the reporting points above a certain area threshold, the code to run is this:

In [11]:
# load table of fixed reporing points
stations = pd.read_parquet(file_stations)

### 2.2 Reforecast data: exceedance probability

This section will iteratively (station by station) load all the available forecast and compute the probability of exceeding the discharge threshold for each of the meteorological forcings. The result will be a NetCDF file for each station that contains the exceedance probability. These files will be later used in the skill assessment.

In [5]:
# select stations that haven't been processed before
files = glob.glob(f'{path_out}*.nc')
if len(files) > 0:
    old_stations = [int(file.split('\\')[-1].split('.')[0]) for file in files]
    new_stations = set(stations.index).difference(old_stations)
    stations = stations.loc[new_stations]
print('no. new stations:\t\t\t{0}'.format(stations.shape[0]))

# generate a DataArray with the discharge threshold of the stations in the catchment
thresholds = xr.DataArray(stations[f'rl{rp}'], dims='id', coords={'id': stations.index.astype(str).tolist()})

no. new stations:			2371


In [6]:
start = time.perf_counter()

# exceedances over Q5
exceedance_q5 = compute_exceedance(fore_files, thresholds) # * (1 + reducing_factor))
# exceedances over 95% of the Q5
exceedance_buffer = compute_exceedance(fore_files, thresholds * (1 - reducing_factor))
# join both DataArrays in a single Dataset
exceedance = xr.Dataset({'high': exceedance_q5, 'low': exceedance_buffer})

In [7]:
# export files station by station
for stn in tqdm_notebook(exceedance.id.data):
    file = f'{stn:>04}.nc'
    if file in os.listdir(path_out):
        print(f'File {file} already exists')
        continue
    else:
        exceedance.sel(id=stn).to_netcdf(f'{path_out}{file}')
        
end = time.perf_counter()

print('excecution time: {0:.1f} s'.format(end - start))

  0%|          | 0/2371 [00:00<?, ?it/s]

excecution time: 1256.8 s
