# Data

## Introduction

```{figure} ../images/figure_1_study_region.png
---
alt: Study region polygons and HYSETS monitored catchment polygons.
name: study-region-fig
width: 500px
align: center
---
Study region polygons and WSC + USGS active (green triangles) and historical (yellow triangles) streamflow monitoring stations.  
```

```{note}
Before proceeding with the computations in the notebook, the streamflow time series and (optionally) catchment boundaries from the HYSETS dataset must be downloaded from the [HYSETS open data repository](https://osf.io/rpc3w/).  Some data are provided in the `data/` folder as part of this repository.  Data pre-processing can be skipped by downloading the input data files from (add dataset repository link)
```

The data used in this study comes from *The Hydrometeorological Sandbox École de Technologie Supérieure* (HYSETS) {cite}`arsenault2020comprehensive`.  The HYSETS data, including streamflow time series and attributes for 14,425 catchments can be accessed at [https://osf.io/rpc3w/](https://osf.io/rpc3w/).  We use a subset of approximately 1620 catchments contained in major basins covering and bounding British Columbia, as shown in {numref}`Figure {number} <study-region-fig>`.  Ten climate indices were processed from [Daymet](https://daymet.ornl.gov/) for this subset, for details see *BCUB - A large sample ungauged basin attribute dataset for British Columbia, Canada* {cite}`kovacek2024bcub` ([https://doi.org/10.5194/essd-2023-508](https://doi.org/10.5194/essd-2023-508)).  


For this experiment we use the following files:

* **Daily average streamflow time series**: filenames follow the convention `<official_id>.csv`.  These should be downloaded from the open data repository linked above and saved under `data/hysets_streamflow_timeseries/`
* **Catchment attributes**: filename: `BCUB_watershed_attributes_updated.csv`.   This file is provided in the `data/` folder, and it was modified from the original file `HYSETS_watershed_properties.txt` in the HYSETS dataset with ten added climate indices as described in {cite}`kovacek2024bcub`.

## Get input data from outside sources

The experiments here use data from several sources, including HYSETS, ...

Download the following files and update the file paths below to your local file system:

1. HYSETS updated streamflow files: [HYSETS_2023_update_QC_stations.nc](https://osf.io/rpc3w/files/dropbox/HYSETS_2023_update_QC_stations.nc) (or if the link changes, see the [OSF site, under files](https://osf.io/rpc3w/files)).

The attributes file is small enough to include in this repository.



In [10]:
from pathlib import Path
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point

# import data_processing_functions as dpf

# update this to the path where you stored `HYSETS_2023_update_QC_stations.nc`
HYSETS_DIR = Path('/home/danbot/code/common_data/HYSETS')

# import the HYSETS attributes data
hysets_df = pd.read_csv(HYSETS_DIR / 'HYSETS_watershed_properties.txt', sep=';')
da_dict = {row['Official_ID']: row['Drainage_Area_km2'] for _, row in hysets_df.iterrows()}
official_id_dict = {row['Official_ID']: row['Watershed_ID'] for _, row in hysets_df.iterrows()}


### Import the study region stations

In [11]:
station_fpath = 'data/study_region_stations.geojson'
bcub_gdf = gpd.read_file(station_fpath)    # get the number of unique stations in the dataset
unique_stations = np.unique(bcub_gdf['Official_ID'])
print(f'{len(unique_stations)} unique monitored catchments in the dataset')

1618 unique monitored catchments in the dataset


In [12]:
# visualize the locations (centroids) of the catchments
# convert to geodataframe
# convert coordinate reference system to 3857 for plotting
gdf = bcub_gdf.copy().to_crs(3857)
bbox = gdf.geometry.total_bounds

In [13]:
# visualize the catchment centroid locations
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.palettes import Colorblind, Sunset10
output_notebook()

# range bounds supplied in web mercator coordinates
p = figure(x_axis_type="mercator", y_axis_type="mercator", width=700, height=400,
          x_range=(bbox[0], bbox[2]), y_range=(bbox[1], bbox[3]))
p.add_tile("CartoDB Positron", retina=True)
p.scatter(x=gdf.geometry.x, y=gdf.geometry.y, color='orange', size=4)

show(p)

## Import streamflow timeseries


```{note}
At the top of `data_processing_functions.py`, update the `STREAMFLOW_DIR` variable to match where the HYSETS streamflow time series are stored.  
```


In [14]:
from data_processing_functions import load_and_filter_streamflow_timeseries as load_ts

ds = load_ts(unique_stations, hysets_df, HYSETS_DIR)

In [16]:
def retrieve_timeseries_discharge(stn, ds):
    watershed_id = official_id_dict[stn]
    # drainage_area = self.ctx.da_dict[stn]
    # data = self.ctx.data
    da = da_dict[stn]
    df = ds['discharge'].sel(watershed=str(watershed_id)).to_dataframe(name='discharge').reset_index()
    df = df.set_index('time')[['discharge']]
    df.dropna(inplace=True)
    # clip minimum flow to 1e-4
    df['discharge'] = np.clip(df['discharge'], 1e-4, None)
    df.rename(columns={'discharge': stn}, inplace=True)
    df[f'{stn}_uar'] = 1000 * df[stn] / da
    df[f'{stn}_mm'] = df[stn] * (24 * 3.6 / da)
    df['log_x'] = np.log(df[f'{stn}_uar'])
    return df

In [17]:
# test loading streamflow time series for a pair of monitoring stations
s1, s2 = unique_stations[0], unique_stations[1]

df1 = retrieve_timeseries_discharge(s1, ds)
df2 = retrieve_timeseries_discharge(s2, ds)
test_df = pd.concat([df1, df2], axis=1)       

flow_fig = figure(width=700, height=350, x_axis_type='datetime')
flow_fig.line(test_df.index, test_df[unique_stations[0]], color='navy', legend_label=unique_stations[0])
flow_fig.line(test_df.index, test_df[unique_stations[1]], color='dodgerblue', legend_label=unique_stations[1])
flow_fig.yaxis.axis_label = r'$$\text{Flow } \frac{m^3}{s}$$'
flow_fig.xaxis.axis_label = r'$$\text{Date}$$'
show(flow_fig)

## Next steps

In the subsequent chapters, the monitoring network station catchments are updated, and the catchment attributes are re-extracted and compared against the HYSETS values.  The target variables are then derived from the streamflow time series before training the gradient boosting models to test their predictability from catchment attributes.

## Citations

```{bibliography}
:filter: docname in docnames
```