# Create dataset - time series
***

**Autor:** Chus Casado Rodríguez<br>
**Date:** 26-05-2025<br>

**Introduction:**<br>
This code creates the time series for the reservoirs in ResOpsUS. The time series include records from ResOpsUS and simulations from GloFAS.

The result is a time series that combines the observed data from ResOpsUS with the simulation from GloFASv4 (when possible). For each reservoir, these time series are exported both in CSV and a NetCDF format.

Records are cleaned to avoid errors:

* Outliers in the **storage** time series are filtered by comparison with the a moving median (window 7 days). If the relative difference of a given storage value and the moving median exceeds a threshold, the value is removed. This procedure is encapsulated in the function `lisfloodreservoirs.utils.timeseries.clean_storage()`
* Outliers in the **inflow** time series are removed using two conditions: one based in the gradient, and the other using an estimated inflow based on the water balance. When both conditions are met, the value is removed. Since inflow time series cannot contain missing values when used in the reservoir simulation, a simple linear interpolation is used to fill in gaps up to 7 days. This procedure is encapsulated in the function `lisfloodreservoirs.utils.timeseries.clean_inflow()`.

**To do:**<br>
* [ ] <font color='red'> Is the ResOpsUS raw data in the same time zone as GloFAS?</font>
* [ ] Demand time series

In [1]:
import numpy as np
import pandas as pd
import xarray as xr
from datetime import datetime, timedelta
from tqdm.auto import tqdm

from lisfloodreservoirs.utils import DatasetConfig
from lisfloodreservoirs import read_attributes
from lisfloodreservoirs.utils.plots import plot_resops, reservoir_analysis, compare_flows
from lisfloodreservoirs.utils.timeseries import clean_storage, clean_inflow, time_encoding
from lisfloodreservoirs.utils.timezone import convert_to_utc, reindex_to_00utc

## Configuration

In [2]:
cfg = DatasetConfig('config_ResOpsUS_v21.yml')

print(f'Time series will be saved in {cfg.PATH_TS}')

Time series will be saved in Z:\nahaUsers\casadje\datasets\reservoirs\ResOpsUS\v2.1\time_series


## Data

### Attributes


In [3]:
# import all tables of attributes
attributes = read_attributes(cfg.PATH_ATTRS)
print(f'{attributes.shape[0]} reservoirs in the attribute tables')

677 reservoirs in the attribute tables


### Time series
#### Oberserved: ResOpsUS

In [4]:
path_plots = cfg.PATH_TS / 'plots'
path_plots.mkdir(parents=True, exist_ok=True)
resops_ts = {}
for grand_id in tqdm(attributes.index, desc='Reading observed time series'): # ID refers to GRanD

    # load timeseries
    file = cfg.PATH_OBS_TS / f'ResOpsUS_{grand_id}.csv'
    if file.is_file():
        series = pd.read_csv(file, parse_dates=True, index_col='date')
    else:
        print(f"{file} doesn't exist")

    # remove duplicated index
    series = series[~series.index.duplicated(keep='first')]
    # trim to GloFAS long run period
    series = series.loc[cfg.START:cfg.END,:]
    if series.empty:
        print(f'Reservoir {grand_id} has no observations in the time period from {cfg.START} to {cfg.END}')
        continue
    # ensure there aren't gaps in the index
    dates = pd.date_range(series.first_valid_index(), series.last_valid_index(), freq='D')
    series = series.reindex(dates)
    series.index.name = 'date'

    # remove negative values
    series[series < 0] = np.nan
    # clean storage time series
    series.storage = clean_storage(series.storage, w=7, error_thr=0.1)
    # clean inflow time series
    series.inflow = clean_inflow(
        series.inflow, 
        storage=series.storage if attributes.loc[grand_id, 'STORAGE'] == 1 else None, 
        outlfow=series.outflow if attributes.loc[grand_id, 'OUTFLOW'] == 1 else None, 
        grad_thr=1e4, 
        balance_thr=5, 
        int_method='linear'
    )

    # convert time series to UTC (with offset)
    series = convert_to_utc(
        lon=attributes.loc[grand_id, 'LON'], 
        lat=attributes.loc[grand_id, 'LAT'], 
        series=series
    )
    # interpolate values to 00 UTC
    series = reindex_to_00utc(series)
    
    # save in dictionary
    series.index = pd.DatetimeIndex(series.index.date, name='date')
    resops_ts[grand_id] = series

    # plot observed time series
    plot_resops(
        series.storage,
        series.elevation,
        series.inflow,
        series.outflow,
        attributes.loc[grand_id, ['CAP_MCM', 'CAP_GLWD']].values,
        title=grand_id,
        save=path_plots / f'{grand_id:04}_lineplot.jpg'
        )

print(f'{len(resops_ts)} reservoirs in ResOpsUS time series')

Reading observed time series:   0%|          | 0/677 [00:00<?, ?it/s]

Reservoir 288 has no observations in the time period from 1975-01-01 00:00:00 to None
676 reservoirs in ResOpsUS time series


In [5]:
# convert to xarray.Dataset
xarray_list = []
for key, df in resops_ts.items():
    ds = xr.Dataset.from_dataframe(df)
    ds = ds.assign_coords(GRAND_ID=key)
    xarray_list.append(ds)
obs = xr.concat(xarray_list, dim='GRAND_ID')

print(f'{len(obs.GRAND_ID)} reservoirs and {len(obs)} variables in the observed time series')

676 reservoirs and 5 variables in the observed time series


#### Simulated: GloFAS

This snippet is a legacy. It imports the reservoir variables (inflow, storage and release) obtained from the long run simulation of GloFAS4. As GloFAS4 only did not consider all the reservoirs in the dataset, these time series are not useful any more.

```Python
# import time series
glofas_ts = {}
mask = ~attributes.GLOFAS_ID.isnull()
for grand_id, glofas_id in tqdm(attributes[mask].GLOFAS_ID.items(), total=mask.sum(), desc='Reading simulated time series'):
    file = cfg.PATH_SIM_TS / f'{glofas_id:03.0f}.csv'
    if file.is_file():
        series = pd.read_csv(file, parse_dates=True, dayfirst=False, index_col='date')
        series.index -= pd.Timedelta(days=1)
        series.storage *= attributes.loc[grand_id, 'CAP_GLWD']
        series[series < 0] = np.nan
        # series.columns = [f'{col.lower()}_sim' for col in series.columns]
        glofas_ts[grand_id] = series
    else:
        print(f"{file} doesn't exist")
        
print(f'{len(glofas_ts)} reservoirs in GloFAS time series')

# convert to xarray.Dataset
new_dim = 'GRAND_ID'
xarray_list = []
for key, df in glofas_ts.items():
    ds = xr.Dataset.from_dataframe(df)
    ds = ds.assign_coords({new_dim: key})
    xarray_list.append(ds)
sim = xr.concat(xarray_list, dim=new_dim)

# rename variables in the simulated time series
sim = sim.rename_vars({var: f'{var}_glofas' for var in list(sim)})
```

In [6]:
# load time series
var = 'dis24'
path_inflow = cfg.PATH_RESOPS / 'ancillary' / 'ncextract' / var
sim = xr.open_mfdataset(path_inflow.glob('*.nc'), combine='nested', concat_dim='id')

# rename variables and coordinates
sim = sim.rename({
    'id': 'GRAND_ID', 
    'valid_time': 'date',
    var: 'inflow_sim'
})
sim = sim.drop_vars(['surface', 'lat', 'latitude', 'lon', 'longitude'], errors='ignore')

# correct and trim time
sim['date'] = sim['date'] - pd.Timedelta(days=1)
sim = sim.sel(date=slice(cfg.START, cfg.END))

# compute
sim = sim.compute()

# # Create a CRS variable and set its attributes
# crs_attrs = {
#     'epsg_code': 'EPSG:4326',
#     'semi_major_axis': 6378137.0,  # WGS 84
#     'inverse_flattening': 298.257223563,  # WGS 84
#     'grid_mapping_name': 'latitude_longitude'
#     }
# sim['crs'] = xr.DataArray(data=0, attrs=crs_attrs)  # CRS variable with its attributes

# # define attributes
# sim.attrs['Units'] = 'inflow: simulated discharge from GloFASv4 (m3/s)'
# sim.time.attrs['timezone'] = 'UTC+00'
# sim.GRAND_ID.attrs['Description'] = 'The identifier of the reservor in GRanD (Global Reservoir and Dam database)'
# lat_attrs = {
#     'Units': 'degrees_north',
#     'standard_name': 'latitude',
#     'grid_mapping': 'crs'
# }
# lon_attrs = {
#     'Units': 'degrees_east',
#     'standard_name': 'longitude',
#     'grid_mapping': 'crs'
# }
# sim.latitude.attrs = lat_attrs
# sim.longitude.attrs = lon_attrs

print(f'{len(sim.GRAND_ID)} reservoirs variables in the simulated inflow time series')

677 reservoirs variables in the simulated inflow time series


#### Meteorology: areal

Time series of catchment-average meteorology generated with the LISFLOOD utility `catchstats`.

In [7]:
# load meteorological time series
path_meteo_areal = cfg.PATH_RESOPS / 'ancillary' / 'catchstats'
rename_vars = {
    'id': 'GRAND_ID',
    'time': 'date',
    'e0': 'evapo_areal',
    'tp': 'precip_areal',
    'ta': 'temp_areal',
}
variables = [x.stem for x in path_meteo_areal.iterdir() if x.is_dir() & (x.stem in rename_vars)]
meteo_areal = xr.Dataset({f'{var}': xr.open_mfdataset(f'{path_meteo_areal}/{var}/*.nc')[f'{var}_mean'] for var in variables})

# rename variables and coordinates
meteo_areal = meteo_areal.rename(rename_vars)

# correct and trim time
meteo_areal['date'] = meteo_areal['date'] - pd.Timedelta(days=1) # WARNING!! One day lag compared with LISFLOOD
meteo_areal = meteo_areal.sel(date=slice(cfg.START, cfg.END))

# keep catchments in the attributes
IDs = list(attributes.index.intersection(meteo_areal.GRAND_ID.data))
meteo_areal = meteo_areal.sel(GRAND_ID=IDs)

# compute
meteo_areal = meteo_areal.compute()

# # define attributes
# meteo_units = 'evapo_areal: catchment-average potential evaporation from open water from ERA5 [mm/d]\n' \
#     'precip_areal: catchment-average precipitation from ERA5 [mm/d]\n' \
#     'temp_areal: catchment-average air temperature from ERA5 [°C]\n'
# meteo_areal.attrs['Units'] = meteo_units
# meteo_areal.time.attrs['timezone'] = 'UTC+00'
# meteo_areal.GRAND_ID.attrs['Description'] = 'The identifier of the reservor in GRanD (Global Reservoir and Dam database)'

print(f'{len(meteo_areal.GRAND_ID)} reservoirs and {len(meteo_areal)} variables in the areal meteorological time series')

633 reservoirs and 3 variables in the areal meteorological time series


#### Meteorology: point

Time series of reservoir point meteorology extracted with the LISFLOOD utilitiy `ncextract`.

In [8]:
# load meteorological time series
path_meteo_point = cfg.PATH_RESOPS / 'ancillary' / 'ncextract' / 'meteo'
rename_vars = {
    'id': 'GRAND_ID',
    'time': 'date',
    'e0': 'evapo_point',
    'tp': 'precip_point',
    'ta': 'temp_point',
}
variables = [x.stem for x in path_meteo_point.iterdir() if x.is_dir() & (x.stem in rename_vars)]
meteo_point = xr.Dataset({f'{var}': xr.open_mfdataset(f'{path_meteo_point}/{var}/*.nc')[var] for var in variables})

# rename variables and coordinates
meteo_point = meteo_point.rename(rename_vars)
meteo_point = meteo_point.drop_vars(['surface', 'lat', 'latitude', 'lon', 'longitude'], errors='ignore')

# correct and trim time
meteo_point['date'] = meteo_point['date'] - pd.Timedelta(days=1) # WARNING!! One day lag compared with LISFLOOD

# keep catchments in the attributes
IDs = list(attributes.index.intersection(meteo_point.GRAND_ID.data))
meteo_point = meteo_point.sel(GRAND_ID=IDs)

# meteo_point = meteo_point.drop_vars(['lon', 'lat'], errors='ignore')

# compute
meteo_point = meteo_point.compute()

# # define attributes
# meteo_units = 'evapo_point: potential evaporation at the reservoir location from open water from ERA5 [mm/d]\n' \
#     'precip_point: precipitation at the reservoir location from ERA5 [mm/d]\n' \
#     'temp_point: air temperature  at the reservoir location from ERA5 [°C]\n'
# meteo_point.attrs['Units'] = meteo_units
# meteo_point.time.attrs['timezone'] = 'UTC+00'
# meteo_point.GRAND_ID.attrs['Description'] = 'The identifier of the reservor in GRanD (Global Reservoir and Dam database)'

print(f'{len(meteo_point.GRAND_ID)} reservoirs and {len(meteo_point)} variables in the areal meteorological time series')

677 reservoirs and 3 variables in the areal meteorological time series


## Prepare dataset

### Convert units

In [9]:
if cfg.NORMALIZE:

    # reservoir attributes used to normalize the dataset
    area_sm = xr.DataArray.from_series(attributes.AREA_SKM) * 1e6 # m2
    capacity_cm = xr.DataArray.from_series(attributes.CAP_MCM) * 1e6 # m3
    catchment_sm = xr.DataArray.from_series(attributes.CATCH_SKM) * 1e6 # m2
    
    # Observed timeseries
    # -------------------
    for var, da in obs.items():
        # convert variables in hm3 to fraction of reservoir capacity [-]
        if var in ['storage', 'evaporation']:
            obs[f'{var}_norm'] = obs[var] * 1e6 / capacity_cm
        # convert variables in m3/s to fraction of reservoir capacity [-]
        elif var in ['inflow', 'outflow']:
            obs[f'{var}_norm'] = obs[var] * 24 * 3600 / capacity_cm

    # Simulated timeseries
    # -------------------
    for var, da in sim.items():
        # convert variables in hm3 to fraction of reservoir capacity [-]
        if var.split('_')[0] in ['storage']:
            sim[f'{var}_norm'] = sim[var] * 1e6 / capacity_cm
        # convert variables in m3/s to fraction of reservoir capacity [-]
        elif var.split('_')[0] in ['inflow', 'outflow']:
            sim[f'{var}_norm'] = sim[var] * 24 * 3600 / capacity_cm
            
    # Catchment meteorology
    # ---------------------
    # convert areal evaporation and precipitation from mm to fraction filled
    for var in ['evapo', 'precip']:
        meteo_areal[f'{var}_areal_norm'] = meteo_areal[f'{var}_areal'] * catchment_sm * 1e-3 / capacity_cm

    # Point meteorology
    # ---------------------
    # convert point evaporation and precipitation from mm to fraction filled
    for var in ['evapo', 'precip']:
        meteo_point[f'{var}_point_norm'] = meteo_point[f'{var}_point'] * catchment_sm * 1e-3 / capacity_cm   

### Export

In [10]:
path_csv = cfg.PATH_TS / 'csv'
path_csv.mkdir(parents=True, exist_ok=True)
path_nc = cfg.PATH_TS / 'netcdf'
path_nc.mkdir(parents=True, exist_ok=True)

for grand_id in tqdm(attributes.index, desc='Exporting time series'):    

    # concatenate time series
    if grand_id in obs.GRAND_ID.data:
        ds = obs.sel(GRAND_ID=grand_id).drop_vars(['GRAND_ID'])
    else:
        print(f'Reservoir {grand_id} does not have observations. Skipping to the next reservoir')
        continue
    if grand_id in sim.GRAND_ID.data:
        ds = xr.merge((ds, sim.sel(GRAND_ID=grand_id).drop_vars(['GRAND_ID'])))
    if grand_id in meteo_areal.GRAND_ID.data:
        ds = xr.merge((ds, meteo_areal.sel(GRAND_ID=grand_id).drop_vars(['GRAND_ID'])))
    if grand_id in meteo_point.GRAND_ID.data:
        ds = xr.merge((ds, meteo_point.sel(GRAND_ID=grand_id).drop_vars(['GRAND_ID'])))
        
    # delete empty variables
    for var in list(ds.data_vars):
        if (ds[var].isnull().all()):
            del ds[var]

    # trim time series to the observed period
    start, end = attributes.loc[grand_id, ['TIME_SERIES_START', 'TIME_SERIES_END']].values
    ds = ds.sel(date=slice(start, end))

    # create time series of temporal attributes
    ds['year'] = ds.date.dt.year
    ds['month'] = ds.date.dt.month
    ds['month_sin'], ds['month_cos'] = time_encoding(ds['month'], period=12)
    ds['weekofyear'] = ds.date.dt.isocalendar().week
    ds['woy_sin'], ds['woy_cos'] = time_encoding(ds['weekofyear'], period=52)
    ds['dayofyear'] = ds.date.dt.dayofyear
    ds['doy_sin'], ds['doy_cos'] = time_encoding(ds['dayofyear'], period=365)
    ds['dayofweek'] = ds.date.dt.dayofweek
    ds['dow_sin'], ds['dow_cos'] = time_encoding(ds['dayofweek'], period=6)
        
    # export CSV
    # ..........
    ds.to_pandas().to_csv(path_csv / f'{grand_id}.csv')

    # export NetCDF
    # .............
    ds.to_netcdf(path_nc / f'{grand_id}.nc')

Exporting time series:   0%|          | 0/677 [00:00<?, ?it/s]

Reservoir 288 does not have observations. Skipping to the next reservoir
