# Data

## Introduction

```{figure} ../images/figure_1_study_region.png
---
alt: Study region polygons and HYSETS monitored catchment polygons.
name: study-region-fig
width: 500px
align: center
---
Study region polygons and WSC + USGS active (green triangles) and historical (yellow triangles) streamflow monitoring stations.  
```

```{note}
Before proceeding with the computations in the notebook, the streamflow time series and (optionally) catchment boundaries from the HYSETS dataset must be downloaded from the [HYSETS open data repository](https://osf.io/rpc3w/).  Some data are provided in the `data/` folder as part of this repository.  Data pre-processing can be skipped by downloading the input data files from (add dataset repository link)
```

The data used in this study comes from *The Hydrometeorological Sandbox École de Technologie Supérieure* (HYSETS) {cite}`arsenault2020comprehensive`.  The HYSETS data, including streamflow time series and attributes for 14,425 catchments can be accessed at [https://osf.io/rpc3w/](https://osf.io/rpc3w/).  We use a subset of approximately 1620 catchments contained in major basins covering and bounding British Columbia, as shown in {numref}`Figure {number} <study-region-fig>`.  Ten climate indices were processed from [Daymet](https://daymet.ornl.gov/) for this subset, for details see *BCUB - A large sample ungauged basin attribute dataset for British Columbia, Canada* {cite}`kovacek2025bcub` ([https://doi.org/10.5194/essd-2023-508](https://doi.org/10.5194/essd-2023-508)).  


For this experiment we use the following files:

* **Daily average streamflow time series**: filenames follow the convention `<official_id>.csv`.  These should be downloaded from the open data repository linked above and saved under `data/hysets_streamflow_timeseries/`
* **Catchment attributes**: filename: `BCUB_watershed_attributes_updated.csv`.   This file is provided in the `data/` folder, and it was modified from the original file `HYSETS_watershed_properties.txt` in the HYSETS dataset with ten added climate indices as described in {cite}`kovacek2025bcub`.

## Get input data from outside sources

The experiments here use data from several sources, including HYSETS, ...

Download the following files and update the file paths below to your local file system:

1. HYSETS updated streamflow files: [HYSETS_2023_update_QC_stations.nc](https://osf.io/rpc3w/files/dropbox/HYSETS_2023_update_QC_stations.nc) (or if the link changes, see the [OSF site, under files](https://osf.io/rpc3w/files)).

The attributes file is small enough to include in this repository.



In [1]:
import os
from pathlib import Path
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point

import data_processing_functions as dpf



In [2]:
# update this to the path where you stored `HYSETS_2023_update_QC_stations.nc`
HYSETS_DIR = Path('/home/danbot/code/common_data/HYSETS')

# import the HYSETS attributes data
hysets_df = pd.read_csv(HYSETS_DIR / 'HYSETS_watershed_properties.txt', sep=';')
da_dict = {row['Official_ID']: row['Drainage_Area_km2'] for _, row in hysets_df.iterrows()}
official_id_dict = {row['Official_ID']: row['Watershed_ID'] for _, row in hysets_df.iterrows()}

### Import the study region stations

In [12]:
station_fpath = 'data/study_region_stations.geojson'
bcub_gdf = gpd.read_file(station_fpath)    # get the number of unique stations in the dataset
bcub_gdf['watershedID'] = bcub_gdf['Official_ID'].apply(lambda x: official_id_dict.get(x, None))
unique_stations = np.unique(bcub_gdf['Official_ID'])
print(f'{len(unique_stations)} unique monitored catchments in the dataset')
# what is the minimum drainage area of the BCUB stations?
min_da = bcub_gdf['Drainage_Area_km2'].min()
print(f'Minimum drainage area of the BCUB stations: {min_da:.3f} km²')

1618 unique monitored catchments in the dataset
Minimum drainage area of the BCUB stations: 1.010 km²


In [13]:
# visualize the locations (centroids) of the catchments
# convert to geodataframe
# convert coordinate reference system to 3857 for plotting
gdf = bcub_gdf.copy().to_crs(3857)
bbox = gdf.geometry.total_bounds

In [14]:
# visualize the catchment centroid locations
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.palettes import Colorblind, Sunset10
output_notebook()

# range bounds supplied in web mercator coordinates
p = figure(x_axis_type="mercator", y_axis_type="mercator", width=700, height=400,
          x_range=(bbox[0], bbox[2]), y_range=(bbox[1], bbox[3]))
p.add_tile("CartoDB Positron", retina=True)
p.scatter(x=gdf.geometry.x, y=gdf.geometry.y, color='orange', size=4)

show(p)

## Import streamflow timeseries




In [16]:
import xarray as xr
# Load dataset
streamflow = xr.open_dataset(HYSETS_DIR / 'HYSETS_2023_update_QC_stations.nc')

# Promote 'watershedID' to a coordinate on 'watershed'
streamflow = streamflow.assign_coords(watershedID=("watershed", streamflow["watershedID"].data))

# Set 'watershedID' as index
streamflow = streamflow.set_index(watershed="watershedID")

# Select only watershedIDs present in bcub_df
valid_ids = [int(wid) for wid in bcub_gdf['watershedID'].values if wid in streamflow.watershed.values]
ds = streamflow.sel(watershed=valid_ids)

In [44]:
def retrieve_timeseries_discharge(stn):
    watershed_id = official_id_dict[stn]
    # drainage_area = self.ctx.da_dict[stn]
    # data = self.ctx.data
    da = da_dict[stn]
    df = ds['discharge'].sel(watershed=str(watershed_id)).to_dataframe(name='discharge').reset_index()
    df = df.set_index('time')[['discharge']]
    df.dropna(inplace=True)
    # clip minimum flow to 1e-4
    df['discharge'] = np.clip(df['discharge'], 1e-4, None)
    df.rename(columns={'discharge': stn}, inplace=True)
    df[f'{stn}_uar'] = 1000 * df[stn] / da
    df[f'{stn}_mm'] = df[stn] * (24 * 3.6 / da)
    df['log_x'] = np.log(df[f'{stn}_uar'])
    return df

In [45]:
# test loading streamflow time series for a pair of monitoring stations
s1, s2 = unique_stations[0], unique_stations[1]

df1 = retrieve_timeseries_discharge(s1)
df2 = retrieve_timeseries_discharge(s2)
test_df = pd.concat([df1, df2], axis=1)       

flow_fig = figure(width=700, height=350, x_axis_type='datetime')
flow_fig.line(test_df.index, test_df[unique_stations[0]], color='navy', legend_label=unique_stations[0])
flow_fig.line(test_df.index, test_df[unique_stations[1]], color='dodgerblue', legend_label=unique_stations[1])
flow_fig.yaxis.axis_label = r'$$\text{Flow } \frac{m^3}{s}$$'
flow_fig.xaxis.axis_label = r'$$\text{Date}$$'
show(flow_fig)

From the above climate plots, it is clear there is very little information in the last three attributes, `low_prcp_freq`, `high_prcp_duration`, `high_prcp_freq`.  

## Streamflow Data Validation


Here we set a minimum record length to define a POR flow duration curve.

In [97]:
def count_complete_years(stn):
    # Convert to datetime only if necessary
    df = retrieve_timeseries_discharge(stn)
    if df.empty:
        return (stn, 0, [])
    date_column = 'time'
    df.reset_index(inplace=True)
    if not np.issubdtype(df[date_column].dtype, np.datetime64):
        df = df.copy()
        df[date_column] = pd.to_datetime(df[date_column])

    # Filter out missing values first
    valid_data = df[df[f'{stn}_uar'].notna()]

    # Extract year and month
    valid_data['year'] = valid_data[date_column].dt.year
    valid_data['month'] = valid_data[date_column].dt.month
    valid_data['day'] = valid_data[date_column].dt.day
    
    # Count total and missing days per year-month group
    month_counts = valid_data.groupby(['year', 'month'])['day'].nunique()
    
    # Identify complete months (at least 20 observations)
    complete_months = (month_counts >= 20)

    # count how many complete months per year
    complete_month_counts = complete_months.groupby(level=0).sum()
    
    complete_years = complete_month_counts[complete_month_counts == 12]
    return (stn, len(complete_years), complete_years.index.tolist())

In [98]:
# parallelize the counting of complete years across all stations
from multiprocessing import Pool

with Pool(18) as pool:
    results = pool.map(count_complete_years, unique_stations)

In [None]:

# Use the unpacked tuples f1, f2, f3 from results for dictionary construction
min_years_of_record = 5
# don't filter here, keep all information to allow filtering at the point of application
results = [
    (stn, n_years, years)
    for stn, n_years, years in results
    if isinstance(n_years, int) and isinstance(years, list)
]
complete_year_dict = {stn: {'complete_years': years, 'n_complete_years': n_years} for stn, n_years, years in results}

# save as json
import json
with open('data/complete_years.json', 'w') as f:
    json.dump(complete_year_dict, f, indent=4)

In [103]:
#create a binary matrix of the stations (rows) and complete years (columns)
# year_matrix = np.zeros((len(bcub_stations), len(all_years)), dtype=int)
validated_stations = sorted(list(complete_year_dict.keys()))
validated_stations = [stn for stn in validated_stations if complete_year_dict[stn]['n_complete_years'] >= min_years_of_record]
N = len(validated_stations)
print(f'There are {N} stations with at least {min_years_of_record} complete years of data.')


There are 1030 stations with at least 5 complete years of data.


## Catchment Attributes 

In [105]:
rev_date = '20250227'
attribute_file = f'BCUB_watershed_attributes_updated_{rev_date}.csv'
attribute_fpath = os.path.join('data', attribute_file)
df = pd.read_csv(attribute_fpath, dtype={'official_id': str})
df = df[[c for c in df.columns if 'unnamed:' not in c.lower()]]
# exclude = ['15039900','15031000']
df.columns = [c.lower() for c in df.columns]
# assert '12414900' in df['official_id'].values
df.sort_values('official_id', inplace=True)
df.reset_index(drop=True, inplace=True)

df['n_complete_years'] = df['official_id'].apply(lambda x: complete_year_dict.get(x, {}).get('n_complete_years', np.nan))

In [107]:
climate_attributes = ['tmean', 'prcp', 'vp', 'swe', 'srad', 'low_prcp_duration', 'low_prcp_freq', 'high_prcp_duration', 'high_prcp_freq']
terrain_attributes = ['slope_deg', 'aspect_deg', 'elevation_m', 'log_drainage_area_km2']
soil_attributes = ['porosity_x100', 'logk_ice_x100']
land_cover_attributes = ['land_use_forest', 'land_use_shrubs', 'land_use_grass', 'land_use_wetland', 'land_use_crops', 
                       'land_use_urban', 'land_use_water', 'land_use_snow_ice']

if not 'tmean' in df.columns:
    # compute the mean temperature for each catchment
    df['tmean'] = (df['tmax'] + df['tmin']) / 2
if not 'log_drainage_area_km2' in df.columns:
    df['log_drainage_area_km2'] = np.log(df['drainage_area_km2'] + 1)

# save the dataframe with attributes
df.to_csv(attribute_fpath, index=False)




In [108]:
from pathlib import Path
from bokeh.plotting import figure, show, gridplot
from bokeh.io import output_notebook
import numpy as np
output_notebook()


def compute_empirical_cdf(values):
    """Compute the empirical cumulative distribution function (CDF) of the given values."""
    sorted_values = np.sort(values)
    cdf = np.arange(1, len(sorted_values) + 1) / len(sorted_values)
    return sorted_values, cdf


def plot_cdf(values, label=None):
    fig = figure(width=700, height=400)
    x, y = compute_empirical_cdf(values)
    fig.line(x, y, legend_label=label, line_width=2)
    fig.legend.location = "top_left"
    if label.startswith('land_use'):
        fig.legend.location = 'bottom_right'
    
    fig.xaxis.axis_label = label
    fig.yaxis.axis_label = 'Cumulative Probability'
    fig.legend.background_fill_alpha = 0.6
    fig = dpf.format_fig_fonts(fig, font_size=14)
    return fig


In [109]:
figs = []
for c in terrain_attributes:
    values = df[c].values
    print(f'{c} - {np.mean(values):.2f} [{np.min(values):.2f}, {np.max(values):.2f}]')
    cdf_fig = plot_cdf(values, label=c)
    cdf_fig = dpf.format_fig_fonts(cdf_fig, font_size=14)
    figs.append(cdf_fig)

lt = gridplot(figs, ncols=4, width=300, height=300)
show(lt)

slope_deg - 16.86 [0.38, 35.12]
aspect_deg - 183.37 [0.22, 359.81]
elevation_m - 1081.06 [23.42, 2438.27]
log_drainage_area_km2 - 5.32 [0.53, 12.52]


In [110]:
figs = []
for c in climate_attributes:
    values = df[c].values
    cdf_fig = plot_cdf(values, label=c)
    figs.append(cdf_fig)

lt = gridplot(figs, ncols=4, width=300, height=300)
show(lt)

In [111]:
figs = []
for c in soil_attributes:
    values = df[c].values
    cdf_fig = plot_cdf(values, label=c)
    figs.append(cdf_fig)

lt = gridplot(figs, ncols=4, width=300, height=300)
show(lt)

In [112]:
figs = []
for c in land_cover_attributes:
    values = df[f'{c}_frac_2010'].values
    cdf_fig = plot_cdf(values, label=c)
    figs.append(cdf_fig)

lt = gridplot(figs, ncols=4, width=300, height=300)
show(lt)

## Next steps

In the subsequent chapters, the monitoring network station catchments are updated, and the catchment attributes are re-extracted and compared against the HYSETS values.  The target variables are then derived from the streamflow time series before training the gradient boosting models to test their predictability from catchment attributes.

## Citations

```{bibliography}
:filter: docname in docnames
```