# Data

In this notebook we:

1.  Describe the information derived from other data sources
2.  Quality review catchments for minimum period of record and regulation or other data quality issues related to human influence on the runoff regime.
3.  Compute the empirical (reference) distributions for all catchments meeting minimum data requirements.
4.  Compute global mean PDF / PMF across all catchments.
5.  view the distribution of catchment attributes (used in the first method - parametric prediction of FDCs)

## Introduction

```{figure} ../images/figure_1_study_region.png
---
alt: Study region polygons and HYSETS monitored catchment polygons.
name: study-region-fig
width: 700px
align: center
---
Study region polygons and WSC + USGS active (green triangles) and historical (yellow triangles) streamflow monitoring stations.  The purple dots represent ungauged catchments characterized in the BCUB dataset {cite}`kovacek2025bcub`, but they are not used in this study.  
```

The streamflow data used in this study comes from *The Hydrometeorological Sandbox École de Technologie Supérieure* (HYSETS) {cite}`arsenault2020comprehensive`.  The HYSETS data, including streamflow time series and attributes for 14,425 catchments can be accessed at [https://osf.io/rpc3w/](https://osf.io/rpc3w/) (As of 2025-07-04, the streamflow timeseries and attribute filename is `HYSETS_2023_update_QC_stations.nc`).  We use a subset of approximately 1620 catchments contained in major basins covering and bounding British Columbia, as shown in {numref}`Figure {number} <study-region-fig>`. 



### Catchment attributes

Catchment attributes are used for all three models, and these are derived from four geospatial data sources:

**Table: Summary of input data sources used to characterize attributes of monitored catchments**

| **Data Type**           | **Source Name**                                                      | **Reference**                            |
|-------------------------|----------------------------------------------------------------------|------------------------------------------|
| Daily streamflow        | Large sample hydrology dataset for N. America and Mexico (HYSETS)   | {cite}`@arsenault2020comprehensive` |
| Terrain                 | USGS 1 arc-second Digital Elevation Data (3DEP)                      | {cite}`3dep`                                   |
| Land cover              | North American Land Change Monitoring System (NALCMS)               | {cite}`latifovic2010nalcms` |
| Soil properties         | Global hydrogeological dataset (GLHYMPS)                            | {cite}`gleeson2014glimpse` |
| Meteorological forcings | Daily surface weather and climatological summaries ([Daymet](https://daymet.ornl.gov/))         | {cite}`thornton2022daymet` |


For details on the data processing pipeline for the catchment attributes, see *BCUB - A large sample ungauged basin attribute dataset for British Columbia, Canada* {cite}`kovacek2025bcub` ([https://doi.org/10.5194/essd-2023-508](https://doi.org/10.5194/essd-2023-508)).  Pre-processed catchment attributes are provided in the `data/` folder of this repository, and they can be used directly in the notebook.    



### Daily Meteorological Forcings

The catchment attributes related to meteorological forcings represent single catchment indices of each variable, however the LSTM neural network model requires daily meteorological forcings to train the model for the catchments in the study region.  These are derived from the Daymet dataset {cite}`thornton2022daymet`, which provides daily meteorological data at a 1km resolution.  The forcings include:

* **Precipitation**: total daily precipitation in mm
* **Minimum daily temperature**: minimum daily temperature in degrees Celsius
* **Maximum daily temperature**: maximum daily temperature in degrees Celsius
* **Shortwave radiation**: average daily shortwave radiation in W/m²
* **Vapour pressure**: daily average vapour pressure in Pa
* **Snow water equivalent**: total daily snow water equivalent in mm

These must be processed to catchment-average daily timeseries in netcdf file form for each catchment according to the [NeuralHydrologydocumentation](https://neuralhydrology.readthedocs.io/en/latest/tutorials/add-dataset.html).  The daily timeseries have been processed for the sample of catchments in this study, and they can be accessed at [https://doi.org/10.5683/SP3/65FXAS](https://doi.org/10.5683/SP3/65FXAS).  The full replication code for processing the meteorological forcings from the Daymet dataset is provided at [https://github.com/dankovacek/process_metforcings](https://github.com/dankovacek/process_metforcings)



### Pre-processed data files

The following pre-processed files are included in the `data/` folder of the repository at [https://github.com/dankovacek/distribution_estimation](https://github.com/dankovacek/distribution_estimation):

```{note}
Before proceeding with the computations in the notebook, the streamflow time series and (optionally) catchment boundaries from the HYSETS dataset must be downloaded from the [HYSETS open data repository](https://osf.io/rpc3w/).  Some data are provided in the `data/` folder as part of this repository.  Data pre-processing can be skipped by downloading the input data files from (add dataset repository link)
```

### Additional data from external sources

Download the following files and update the file paths below to your local file system:

**FDC estimation by log-normal distribution parameter prediction**:
* catchment attributes: `data/BCUB_watershed_attributes_updated_20250227.csv`
* streamflow summary statistics (see Notebook 3): `data/catchment_attributes_with_runoff_stats.csv`

**FDC estimation by k-nearest neighbours**:
* catchment attributes as above
* daily streamflow timeseries (as published in HYSETS): `data/HYSETS_2023_update_QC_stations.nc`.  Must be downloaded from the HYSETS open data repository at [https://osf.io/rpc3w/](https://osf.io/rpc3w/).

**FDC estimation by recurrent neural network model (LSTM)**:
* catchment attributes as above are used as conditioning variables
* The LSTM FDC estimation is done using the [NeuralHydrology](https://neuralhydrology.readthedocs.io/en/latest/) python library.  The LSTM model uses daily meteorological timeseries for the HYSETS stations in the study region.  The processing of catchment-average daily timeseries is a computationally intensive process, and the pre-processed timeseries are provided for six meteorological variables (precipitation, min and max daily temperature, shortwave radiation, vapour pressure, snow water equivalent) 
* Pre-processed daily meteorological forcings are provided at [https://doi.org/10.5683/SP3/65FXAS](https://doi.org/10.5683/SP3/65FXAS) and should be downloaded to replicate the LSTM modelling component.  



## View the data

In [70]:
import os
import json
from time import time
from pathlib import Path
import pandas as pd
import numpy as np
import geopandas as gpd
from multiprocessing import Pool

from utils.kde_estimator import KDEEstimator
from utils import data_processing_functions as dpf

In [2]:
# update this to the path where you stored `HYSETS_2023_update_QC_stations.nc`
BASE_DIR = Path(os.getcwd())
HYSETS_DIR = Path('/home/danbot/code/common_data/HYSETS')

# import the HYSETS attributes data
hysets_df = pd.read_csv(HYSETS_DIR / 'HYSETS_watershed_properties.txt', sep=';')
da_dict = {row['Official_ID']: row['Drainage_Area_km2'] for _, row in hysets_df.iterrows()}
official_id_dict = {row['Official_ID']: row['Watershed_ID'] for _, row in hysets_df.iterrows()}

In [18]:
camels_df = pd.read_csv('data/camels/camels_hydro.txt', sep=';')
camels_df['gauge_id'] = camels_df['gauge_id'].astype(str)
camels_df.head()


Unnamed: 0,gauge_id,q_mean,runoff_ratio,slope_fdc,baseflow_index,stream_elas,q5,q95,high_q_freq,high_q_dur,low_q_freq,low_q_dur,zero_q_freq,hfd_mean
0,1013500,1.699155,0.543437,1.528219,0.585226,1.845324,0.241106,6.373021,6.1,8.714286,41.35,20.170732,0.0,207.25
1,1022500,2.173062,0.602269,1.77628,0.554478,1.702782,0.204734,7.123049,3.9,2.294118,65.15,17.144737,0.0,166.25
2,1030500,1.820108,0.555859,1.87111,0.508441,1.377505,0.107149,6.854887,12.25,7.205882,89.25,19.402174,0.0,184.9
3,1031500,2.030242,0.576289,1.494019,0.445091,1.648693,0.111345,8.010503,18.9,3.286957,94.8,14.697674,0.0,181.0
4,1047000,2.18287,0.656868,1.415939,0.473465,1.510238,0.196458,8.095148,14.95,2.577586,71.55,12.776786,0.0,184.8


### Import the study region stations

In [3]:
station_fpath = 'data/study_region_stations.geojson'
bcub_gdf = gpd.read_file(station_fpath)    # get the number of unique stations in the dataset
bcub_gdf['watershedID'] = bcub_gdf['Official_ID'].apply(lambda x: official_id_dict.get(x, None))
unique_stations = np.unique(bcub_gdf['Official_ID'])
print(f'{len(unique_stations)} unique monitored catchments in the dataset')
# what is the minimum drainage area of the BCUB stations?
min_da = bcub_gdf['Drainage_Area_km2'].min()
print(f'Minimum drainage area of the BCUB stations: {min_da:.3f} km²')

1618 unique monitored catchments in the dataset
Minimum drainage area of the BCUB stations: 1.010 km²


In [4]:
# visualize the locations (centroids) of the catchments
# convert to geodataframe
# convert coordinate reference system to 3857 for plotting
gdf = bcub_gdf.copy().to_crs(3857)
bbox = gdf.geometry.total_bounds

In [5]:
# visualize the catchment centroid locations
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.palettes import Colorblind, Sunset10
output_notebook()

# range bounds supplied in web mercator coordinates
p = figure(x_axis_type="mercator", y_axis_type="mercator", width=700, height=400,
          x_range=(bbox[0], bbox[2]), y_range=(bbox[1], bbox[3]))
p.add_tile("CartoDB Positron", retina=True)
p.scatter(x=gdf.geometry.x, y=gdf.geometry.y, color='orange', size=4)

show(p)

### Import the catchment attributes

```{note}
Some stations are excluded from the analysis due to data quality issues.  These are listed in the `exclude_stations` list below.
```

In [None]:
def match_with_padding(oid):
    if oid in hysets_df['Official_ID'].values:
        return oid
    print(f'{oid} not found in HYSETS data, trying padded versions...')
    for pad in range(1, 4):
        padded = oid.zfill(len(oid) + pad)
        if padded in hysets_df['Official_ID'].values:
            print(f'    Found padded version: {padded}')
            return padded
    raise ValueError(f"Official ID {oid} not found in HYSETS data, even with padding.")

rev_date = '20250227'
attribute_file = f'BCUB_watershed_attributes_updated_{rev_date}.csv'
updated_attribute_file = 'catchment_attributes_with_runoff_stats.csv'
if not os.path.exists(os.path.join('data', updated_attribute_file)):
    print(f'Updated attribute file {updated_attribute_file} not found. Using {attribute_file} instead.')
    updated_attribute_path = os.path.join('data', attribute_file)
    process_statistics = True
else:
    updated_attribute_path = os.path.join(os.getcwd(), 'data', updated_attribute_file)
    process_statistics = False

attr_df = pd.read_csv(updated_attribute_path, dtype={'official_id': str})
attr_df['official_id'] = attr_df['official_id'].apply(lambda x: match_with_padding(x))
attr_df = attr_df[[c for c in attr_df.columns if 'unnamed:' not in c.lower()]]
attr_df.columns = [c.lower() for c in attr_df.columns]
attr_df.sort_values('official_id', inplace=True)
attr_df.reset_index(drop=True, inplace=True)
print(len(attr_df), 'catchments in the attribute file.')

# filter the bcub_gdf for stations in attr_df
# bcub_gdf = bcub_gdf[bcub_gdf['Official_ID'].isin(attr_df['official_id'].values)]
# print(f'{len(bcub_gdf)} catchments in the BCUB dataset after filtering for attributes.')

Updated attribute file catchment_attributes_with_runoff_stats.csv not found. Using BCUB_watershed_attributes_updated_20250227.csv instead.
212414900 not found in HYSETS data, trying padded versions...
    Found padded version: 0212414900
5010500 not found in HYSETS data, trying padded versions...
    Found padded version: 05010500
5012000 not found in HYSETS data, trying padded versions...
    Found padded version: 05012000
5014000 not found in HYSETS data, trying padded versions...
    Found padded version: 05014000
5014500 not found in HYSETS data, trying padded versions...
    Found padded version: 05014500
5017500 not found in HYSETS data, trying padded versions...
    Found padded version: 05017500
1308 catchments in the attribute file.
1307 catchments in the BCUB dataset after filtering for attributes.


## Streamflow data validation

Given the range of environmental conditions and the dynamic nature of rivers, streamflow monitoring is a challenging task.  It is common for stations to be damaged by high flows, affected by ice, or erosion or deposition of sediment.  Streamflow monitoring stations require periodic maintenance, and gaps in records are common.  The figure below illustrates a critical issue underlying hydrological studies, the continuity of streamflow records.

```{figure} images/weekly_data_availability.png
---
alt: A visualization of weekly data availablity for the streamflow monitoring stations in the study region shows many gaps in the records.
name: data-continuity-fig
width: 800px
align: center
---
Discontinuous and non-overlapping records is a problem underlying any hydrological analysis, and the problem is compounded for large sample studies..  
```

### Import streamflow timeseries

In [28]:
import xarray as xr
# Load dataset
streamflow = xr.open_dataset(HYSETS_DIR / 'HYSETS_2023_update_QC_stations.nc')

# Promote 'watershedID' to a coordinate on 'watershed'
streamflow = streamflow.assign_coords(watershedID=("watershed", streamflow["watershedID"].data))

# Set 'watershedID' as index
streamflow = streamflow.set_index(watershed="watershedID")

# Select only watershedIDs present in bcub_df
valid_ids = [int(wid) for wid in bcub_gdf['watershedID'].values if wid in streamflow.watershed.values]
ds = streamflow.sel(watershed=valid_ids)

In [29]:
def retrieve_timeseries_discharge(stn):
    watershed_id = official_id_dict[stn]
    da = da_dict[stn]
    try:
        df = ds['discharge'].sel(watershed=str(watershed_id)).to_dataframe(name='discharge').reset_index()
    except KeyError:
        print(f"Warning: Station {stn} not found in dataset under watershedID {watershed_id}.")
        return pd.DataFrame()
    
    df = df.set_index('time')[['discharge']]
    df.dropna(inplace=True)
    df['zero_flow_flag'] = df['discharge'] == 0
    # df['discharge'] = np.clip(df['discharge'], 1e-4, None)
    # df.rename(columns={'discharge': stn}, inplace=True)
    df[f'{stn}_uar'] = 1000 * df['discharge'] / da
    df[f'{stn}_mm'] = df['discharge'] * (24 * 3.6 / da)
    df['replaced_zero_flow_uar'] = df['discharge'].clip(1e-4) * (1000 / da)
    df['log_uar'] = np.log(df['replaced_zero_flow_uar'])
    return df

In [30]:
# test loading streamflow time series for a pair of monitoring stations
s1, s2 = unique_stations[0], unique_stations[1]

df1 = retrieve_timeseries_discharge(s1)
df2 = retrieve_timeseries_discharge(s2)
test_df = pd.concat([df1, df2], axis=1)       

flow_fig = figure(width=700, height=350, x_axis_type='datetime')
flow_fig.line(test_df.index, test_df[f'{s1}_uar'], color='navy', legend_label=s1)
flow_fig.line(test_df.index, test_df[f'{s2}_uar'], color='dodgerblue', legend_label=s2)
flow_fig.yaxis.axis_label = r'$$\text{Unit Area Runoff } L/s/\text{km}^2$$'
flow_fig.xaxis.axis_label = r'$$\text{Date}$$'
show(flow_fig)

From the above climate plots, it is clear there is very little information in the last three attributes, `low_prcp_freq`, `high_prcp_duration`, `high_prcp_freq`.  

### Extra catchments to exclude

* Kakuhan Creek Near Haines AK - 15056030

```{figure} images/kakuhan_creek.png
---
width: 600px
name: Example of a catchment polygon with delineation issues.
---
There is uncertainty in the delineation of the Kakuhan Creek catchment polygon.  The historical station location does not align with the stream network derived from 30m DEM data.
```

### Excluded due to no complete years of data (seasonal / <= 90% complete)

* Genessee Creek at the Mouth - 08FA009
* McNair Creek near Port Mellon - 08GA037
* Canoe River near Valemount - 08NC003
* Big Quilcene River Near Quilcene, WA - 12052500
* Morey Creek above McChord Afb near Parkland, WA - 12090480
* North Fork Newaukum Creek Near Enumclaw, WA - 12107950
* Newaukum Creek Tributary Near Blacik Diamond, WA - 12108450
* May Creek near Issaquah, WA - 12119300
* Honey Creek near Renton, WA - 12119450
* Carpenter Creek near Bacon Rod near Mount Vernon, WA - 12200684
* Unnamed Tributary Massacre Bay on Orcas Island, WA - 12200762
* Whatcom Creek near Bellingham, WA - 12203000
* Hall Creek at Inchelium, WA - 12409500
* Dayebas Creek Near Haines, AK - 15056070
* Bonne Creek near Klawock, AK - 15081510

### Stations representing regulated streams that QA appears to have missed

* 12323760 - Silver Lake Dam
* 12143700 - Boxley Creek near Cedar Falls (unregulated but heavily influenced by seepage from adjacent reservoir -- Chester Morse Lake)
* 12143900 - Boxley Creek Near Edgewick, WA (also unregulated but heavily influenced by seepage from adjacent reservoir -- Chester Morse Lake)
* 12117500 - Cedar River at Landsburg below the diversion dam
* 12398000 - Sullivan Lake
* 12058800 - NF SKOKOMISH R BELOW LWR CUSHMAN DAM NR POTLATCH, WA
* 12137800 - Sultan River below diversion dam
* 12100000 - White River near Buckley - Several upstream diversions for a) power generation, b) flood control

In [31]:
exclude_stations = ['08FA009', '08GA037', '08NC003', '12052500', '12090480', '12107950', '12108450', '12119300', 
                    '12119450', '12200684', '12200762', '12203000', '12409500', '15056070', '15081510',
                    '12323760', '12143700', '12143900', '12398000', '12058800', '12137800', '12100000']

for ex_stn in exclude_stations:
    # check if it is in the QC Hysets list
    df = retrieve_timeseries_discharge(ex_stn)
    if df.empty:
        print(f'Station not included in HYSETS.')
        continue
    else:
        print(f'{ex_stn} found in HYSETS despite known issues.')
    

Station not included in HYSETS.
Station not included in HYSETS.
Station not included in HYSETS.
Station not included in HYSETS.
Station not included in HYSETS.
Station not included in HYSETS.
Station not included in HYSETS.
Station not included in HYSETS.
Station not included in HYSETS.
Station not included in HYSETS.
Station not included in HYSETS.
Station not included in HYSETS.
Station not included in HYSETS.
Station not included in HYSETS.
Station not included in HYSETS.
12323760 found in HYSETS despite known issues.
12143700 found in HYSETS despite known issues.
12143900 found in HYSETS despite known issues.
12398000 found in HYSETS despite known issues.
12058800 found in HYSETS despite known issues.
12137800 found in HYSETS despite known issues.
12100000 found in HYSETS despite known issues.


### Streamflow data validation for length of record


Here we set a minimum record length to define a POR flow duration curve.

In [32]:
def count_complete_years(stn):
    # Convert to datetime only if necessary
    df = retrieve_timeseries_discharge(stn)
    if df.empty:
        return (stn, 0, [])
    date_column = 'time'
    df.reset_index(inplace=True)
    if not np.issubdtype(df[date_column].dtype, np.datetime64):
        df = df.copy()
        df[date_column] = pd.to_datetime(df[date_column])

    # Filter out missing values first
    valid_data = df[df[f'{stn}_uar'].notna()]

    # Extract year and month
    valid_data['year'] = valid_data[date_column].dt.year
    valid_data['month'] = valid_data[date_column].dt.month
    valid_data['day'] = valid_data[date_column].dt.day
    
    # Count total and missing days per year-month group
    month_counts = valid_data.groupby(['year', 'month'])['day'].nunique()
    
    # Identify complete months (at least 20 observations)
    complete_months = (month_counts >= 20)

    # count how many complete months per year
    complete_month_counts = complete_months.groupby(level=0).sum()
    
    complete_years = complete_month_counts[complete_month_counts == 12]
    return (stn, len(complete_years), complete_years.index.tolist())

In [33]:
# Use the unpacked tuples f1, f2, f3 from results for dictionary construction
min_years_of_record = 5

complete_yr_fpath = 'data/complete_years.json'
if not os.path.exists(complete_yr_fpath):
    with Pool() as pool:
        results = pool.map(count_complete_years, unique_stations)
    # don't filter here, keep all information to allow filtering at the point of application
    results = [
        (stn, n_years, years)
        for stn, n_years, years in results
        if isinstance(n_years, int) and isinstance(years, list)
    ]
    complete_year_dict = {stn: {'complete_years': years, 'n_complete_years': n_years} for stn, n_years, years in results}
    with open(complete_yr_fpath, 'w') as f:
        json.dump(complete_year_dict, f, indent=4)
else:
    with open(complete_yr_fpath, 'r') as f:
        complete_year_dict = json.load(f)

In [38]:
#create a binary matrix of the stations (rows) and complete years (columns)
# year_matrix = np.zeros((len(bcub_stations), len(all_years)), dtype=int)
validated_stations = sorted(list(complete_year_dict.keys()))
# temporally validated
validated_stations = [stn for stn in validated_stations if complete_year_dict[stn]['n_complete_years'] >= min_years_of_record]
# remove excluded for human influence
validated_stations = [stn for stn in validated_stations if stn not in exclude_stations]
# N = len(validated_stations)
attr_df = attr_df[attr_df['official_id'].isin(validated_stations)]
print(f'There are {len(attr_df)} unregulated monitoring stations with at least {min_years_of_record} complete years of data.')


There are 1017 unregulated monitoring stations with at least 5 complete years of data.


In [62]:
from lmoments3 import distr

def compute_runoff_stats(data):
    out = {}
    for label in ['replaced_zero_flow_uar', 'log_uar']:
        if label.startswith('log_'):
            vals = data[label].values
        else:
            vals = np.log(data[label].values)
        vals = vals[~np.isnan(vals) & ~np.isinf(vals)]
        # classical moments
        m   = vals.mean()
        median = np.median(vals)
        s   = vals.std(ddof=1)
        mad = np.mean(np.abs(vals - m))
        sk  = pd.Series(vals).skew()
        kt  = pd.Series(vals).kurtosis()

        # l-moments
        # params = distr.gev.lmom_fit(vals)

        out.update({
            f'{label}_mean': m,
            f'{label}_median': median,
            f'{label}_mad': mad,
            f'{label}_std': s,
            f'{label}_skew': sk,
            f'{label}_kurt': kt,
            # f'{label}_lmom_xi': params['c'],
            # f'{label}_lmom_loc': params['loc'],
            # f'{label}_lmom_scale': params['scale'],
        })
    return out

# reset the index to ensure the split is done correctly
def process_row(data):
    stn = str(data['official_id'])
    data = retrieve_timeseries_discharge(stn)
    
    # Compute the runoff statistics
    runoff_data = compute_runoff_stats(data)
    camels_data = camels_df[camels_df['gauge_id'] == stn].copy()
    if len(camels_data) > 1:
        camels_q = camels_data['q_mean'].values[0]
        raise Exception(f'Multiple CAMELS data found for {stn}.')
    else:
        camels_q = camels_data['q_mean'].values[0] if not camels_data.empty else np.nan

    # Merge your existing mm‐based mean + the new metrics
    out = {
      **runoff_data,
      'camels_q_mean_mm': camels_q,
    }
    return pd.Series(out)


In [63]:
updated_attribute_file = 'catchment_attributes_with_runoff_stats.csv'
if not os.path.exists(os.path.join('data', updated_attribute_file)):
    print(f'Updated attribute file {updated_attribute_file} not found. Using {attribute_file} instead.')
    updated_attribute_path = os.path.join('data', attribute_file)
    process_statistics = True
else:
    updated_attribute_path = os.path.join(os.getcwd(), 'data', updated_attribute_file)
    process_statistics = False


Updated attribute file catchment_attributes_with_runoff_stats.csv not found. Using BCUB_watershed_attributes_updated_20250227.csv instead.


In [64]:
if process_statistics == True:
    print(f'Processing runoff statistics for {len(validated_stations)} stations')
    updated_fpath = os.path.join(os.getcwd(), 'data', f'catchment_attributes_with_runoff_stats.csv')
    stats_results = attr_df.apply(lambda x: process_row(x), axis=1)
    target_cols = stats_results.columns.tolist()
    attr_df.loc[stats_results.index, stats_results.columns] = stats_results
    print(f'   Saving updated attributes with runoff statistics for {len(attr_df)} catchments to:', updated_fpath)
    attr_df.to_csv(updated_fpath)

Processing runoff statistics for 1023 stations
   Saving updated attributes with runoff statistics for 1017 catchments to: /home/danbot/code/distribution_estimation/docs/notebooks/data/catchment_attributes_with_runoff_stats.csv


In [65]:
# import the HYSETS attributes data
ws_id_dict = hysets_df.set_index('Official_ID')['Watershed_ID'].to_dict()
da_dict = hysets_df.set_index('Official_ID')['Drainage_Area_km2'].to_dict()
official_id_dict = {row['Official_ID']: row['Watershed_ID'] for _, row in hysets_df.iterrows()}

In [78]:
# import the BCUB (study) region boundary
bcub_df = pd.read_csv(os.path.join('data', f'catchment_attributes_with_runoff_stats.csv'), dtype={'official_id': str})
bcub_df['official_id'] = bcub_df['official_id'].astype(str)
bcub_df = bcub_df[bcub_df['official_id'].isin(validated_stations)].copy()
# map the Hysets watershed IDs to the BCUB watershed IDs
# create a dict to map HYSETS watershed IDs to the Official station IDs
bcub_df['watershedID'] = bcub_df['official_id'].apply(lambda x: official_id_dict.get(x, None))
validated_stations = sorted(bcub_df['official_id'].unique())
print(f'   Found {len(bcub_df)} catchments in the BCUB region with runoff statistics.')

   Found 1017 catchments in the BCUB region with runoff statistics.


## Compute reference (observed) distributions

Here we compute the baseline, or empirical probability distribution of daily unit area runoff (UAR, in L/s/km²) for each monitored catchment in the study region based on observed streamflow records. For each station, we construct an empirical probability mass function (PMF) over a common support $\Omega$, discretized into $N = 2^{12}$ bins of equal width in log space. This approach ensures that all catchments are evaluated on a consistent, physically meaningful scale, capturing the full range of observed runoff values from $5 \times 10^{-6}$ to $10^{4}$ L/s/km².  

The number of bins corresponds to 12 bit encoding where bin widths are $< 1\%$ ($~0.26\%$) of the quantized value (bin midpoint), representing precision beyond what is typically achievable with daily mean streamflow estimation. The resulting reference distributions serve as the baseline for evaluating predicted flow duration (reliability) curves.

The empirical PMF is computed for the evaluation of "ground truth" over the common support $\Omega$, allowing states to have zero probability.  This is a deliberate choice to represent what has been observed as truly as possible.  Streamflow in nature is a continuous variable, so the common cases we see of zero probability between nonzero are a result of finite sampling.  A secondary "baseline" is computed here using a kernel density estimator (KDE) with a log-normal kernel and bandwidth set according to an assumed measurement error model, and these are used as part of the FDC estimation models described in Notebook 2.  

In [67]:
# set global bounding values of UAR from the 
# max_streamflow = ds['discharge'].max().values.item()
# max_streamflow = # it's actually 19400 in the dataset
global_min = np.log(5e-6) # L/s/km^2
global_max = np.log(1e4) # L/s/km^2
log_edges = np.linspace(global_min, global_max, 2**12 + 1)
log_x = 0.5 * (log_edges[:-1] + log_edges[1:])
lin_x = np.exp(log_x)
log_w = np.diff(log_edges)
base_kde_estimator = KDEEstimator(log_edges)

Bin edges are +/- 0.262% from the bin midpoints.


In [68]:
class ReferenceDistribution:
    def __init__(self, **kwargs):

        for k, v in kwargs.items():
            setattr(self, k, v)

    def _initialize_station(self, stn):
        self.stn = stn
        self.df = retrieve_timeseries_discharge(stn)
        self.da = self.da_dict[stn]
        self.n_observations = len(self.df.dropna())

In [69]:
def compute_empirical_pmf(log_edges, data):
    """Compute the empirical PMF for a given station."""
    log_runoff = np.log(data)
    pdf, _ = np.histogram(log_runoff, bins=log_edges, density=True)
    # Convert to PMF
    w = np.diff(log_edges)
    pmf = pdf * w
    pmf /= np.sum(pmf)
    assert np.isclose(pmf.sum(), 1.0), f'Empirical PMF does not sum to 1, sum={pmf.sum():.4f}'
    return pmf, pdf

In [75]:

# from kde_estimator import KDEEstimator
output_folder = Path(os.getcwd()) / 'data' / 'results' / 'baseline_distributions'  

shared_config = {
    'da_dict': da_dict,
    'complete_year_dict': complete_year_dict,
    'kde_obj': base_kde_estimator,
    }

distribution_dict = {}

output_obs_fname_pmf = output_folder / f'pmf_obs.csv'
output_obs_fname_pdf = output_folder / f'pdf_obs.csv'
output_kde_fname_pmf = output_folder / f'pmf_kde.csv'
output_kde_fname_pdf = output_folder / f'pdf_kde.csv'
output_paths = [output_obs_fname_pmf, output_obs_fname_pdf, output_kde_fname_pmf, output_kde_fname_pdf]
if np.all([os.path.exists(f) for f in output_paths]):
    print(f'Files already exist, skipping computation.')
    for f in output_paths:
        k = str(f).split('/')[-1].split('.')[0]  # 'pmf' or 'pdf'
        distribution_dict[k] = pd.read_csv(f)
else:
    # compute the PDF and PMF for each station    
    baseline_distribution = ReferenceDistribution(**shared_config)
    kde_results, obs_results = [], []
    log_x = base_kde_estimator.log_x
    log_w = base_kde_estimator.log_w
    log_edges = base_kde_estimator.log_edges
    for i, stn in enumerate(bcub_df['official_id'].values):
        t_kde0 = time()
        baseline_distribution._initialize_station(stn)
        data = baseline_distribution.df['replaced_zero_flow_uar'].dropna().values
        # compute KDE-based distribution
        kde_pmf, kde_pdf = base_kde_estimator.compute(data, shared_config['da_dict'][stn])
        kde_results.append((stn, kde_pmf, kde_pdf))
        assert np.isclose(kde_pmf.sum(), 1.0), f'KDE PMF for {stn} does not sum to 1, sum={kde_pmf.sum():.4f}'
        assert np.isclose(np.trapezoid(kde_pdf, x=log_x), 1.0), f'KDE PDF for {stn} does not integrate to 1, integral={np.trapzoid(kde_pdf, x=log_x):.4f}'
        t_kde1 = time()
        t_emp0 = time()
        # compute the strictly empirical distribution based on the "global" log grid
        obs_pmf, obs_pdf = compute_empirical_pmf(log_edges, data)
        assert np.isclose(obs_pmf.sum(), 1.0, atol=1e-4), f'Empirical PMF for {stn} does not sum to 1, sum={obs_pmf.sum():.4f}'
        assert np.isclose(np.trapezoid(obs_pdf, x=log_x), 1.0, atol=1e-4), f'Empirical PDF for {stn} does not integrate to 1, integral={np.trapzoid(obs_pdf, x=log_x):.4f}'
        t_emp1 = time()
        obs_results.append((stn, obs_pmf, obs_pdf))
        if len(kde_results) % 100 == 0:
            print(f'Processed {len(kde_results)}/{len(validated_stations)} stations...')

    # concatenate the results
    for label, results in zip(['kde', 'obs'], [kde_results, obs_results]):
        stations, pmfs, pdfs = zip(*results)
        pdf_df = pd.DataFrame(np.stack(pdfs, axis=1), columns=stations)
        pmf_df = pd.DataFrame(np.stack(pmfs, axis=1), columns=stations)

        pdf_df['log_x'] = log_x
        pdf_df['lin_x'] = np.exp(log_x)
        pdf_df['left_log_edges'] = log_edges[:-1]
        pdf_df['right_log_edges'] = log_edges[1:]
        pmf_df['log_x'] = log_x
        pmf_df['lin_x'] = np.exp(log_x)
        pmf_df['left_log_edges'] = log_edges[:-1]
        pmf_df['right_log_edges'] = log_edges[1:]  

        # save the pdf and pmf files
        # pdf_df.set_index('lin_x', inplace=True)
        # pmf_df.set_index('lin_x', inplace=True)
        pmf_sum = pmf_df[list(stations)].copy().sum()
        min_sum, max_sum = np.min(pmf_sum), np.max(pmf_sum)
        assert np.all(np.isclose(pmf_sum, 1.0, atol=1e-4)), f'PMFs do not sum to 1, min={min_sum:.4f}, max={max_sum:.4f}'
        pdf_df.to_csv(output_folder / f'pdf_{label}.csv', index=False)
        pmf_df.to_csv(output_folder / f'pmf_{label}.csv', index=False)
        


Files already exist, skipping computation.


In [79]:
def downsample_distribution(dist_df, stns, dist_type, end_bits=12):
    """Downsample the distribution to the given bitrate end_bits."""
    quant_cols = ['log_x', 'lin_x', 'left_log_edges', 'right_log_edges']
    start_bits = int(np.log2(len(dist_df)))
    if start_bits == end_bits:
        for s in stns:
            assert s in dist_df.columns, f'Station {s} not in distribution dataframe'
        df = dist_df[stns+quant_cols].copy()
        new_log_x = df['log_x'].values
        new_log_edges = sorted(list(set(df['left_log_edges'].values) | set(df['right_log_edges'].values)))
        new_log_w = np.diff(new_log_edges)        
        return df[stns], new_log_x, new_log_w, new_log_edges

    assert start_bits > end_bits, f'Cannot downsample from {start_bits} bits to {end_bits} bits'
    assert end_bits >= 1, f'End bits must be at least 1'    
        
    factor = 2 ** (start_bits - end_bits)
    # get the sum of the pmfs based on the new quantization
    dists = dist_df[stns].values.reshape(-1, factor, len(stns)).sum(axis=1)
    # get the new left log edges from the FIRST VALUE in each group
    left_log_edges = dist_df['left_log_edges'].values.reshape(-1, factor)[:, 0]
    # # get the last right log edges from the LAST VALUE in each group
    right_log_edges = dist_df['right_log_edges'].values.reshape(-1, factor)[:, -1]
    new_log_edges = sorted(list(set(left_log_edges) | set(right_log_edges)))
    new_log_x = 0.5 * (np.array(new_log_edges[:-1]) + np.array(new_log_edges[1:]))
    new_log_w = np.diff(new_log_edges)
    # re-normalize the distributions
    if dist_type == 'pdf':
        new_area = np.trapezoid(dists, x=new_log_x, axis=0)
        new_min, new_max = new_area.min(), new_area.max()
        dists /= new_area
        new_area = np.trapezoid(dists, x=new_log_x, axis=0)
        assert all(np.isclose(new_area, 1.0, atol=1e-5)), f'Downsampled PDFs do not integrate to 1, min={new_min:.4f}, max={new_max:.4f}'
    else:
        dists /= dists.sum(axis=0)
        sums = dists.sum(axis=0)
        assert np.all(np.isclose(sums, 1.0)), f'Downsampled PMFs do not sum to 1, min={sums.min():.4f}, max={sums.max():.4f}'
    df = pd.DataFrame(dists, columns=stns)
    return df, new_log_x, new_log_w, new_log_edges


In [80]:
# compute the entropy of the prior adjusted distribution for each station

bits = list(range(2, 13)) # set a range that is both too low and too high for the data
entropy_output_folder = Path(os.getcwd()) / 'data' / 'results' / 'entropy_results'
if not entropy_output_folder.exists():
    entropy_output_folder.mkdir(parents=True, exist_ok=True)

# eps = 1e-22 # set a small epsilon to avoid numerical issues
kde_pmf_df = pd.DataFrame(distribution_dict['pmf_obs']).reset_index(drop=False)
for b in bits:
    # resample the PMF by q_values to the number of states
    resampled_df, new_log_x, new_log_w, new_log_edges = downsample_distribution(
        kde_pmf_df, validated_stations, 'pmf', end_bits=b
    )
    resampled_df['log_x'] = new_log_x
    resampled_df['lin_x'] = np.exp(new_log_x)
    resampled_df['left_log_edges'] = new_log_edges[:-1]
    resampled_df['right_log_edges'] = new_log_edges[1:]
    # save the resampled PMF
    resampled_df.to_csv(entropy_output_folder / f'kde_pmf_resampled_{b}bits.csv', index=False)


## Compare the Adaptive Bandwidth KDE PMFs vs. the FFTKDE PMFs


In [81]:

from utils.evaluation_metrics import EvaluationMetrics
from KDEpy import FFTKDE
eval_obj = EvaluationMetrics(base_kde_estimator.log_x, base_kde_estimator.log_w)#baseline_log_grid, log_dx)

In [82]:
for l in ['kde', 'obs']:
    distribution_dict['pdf_' + l] = pd.read_csv(output_folder / f'pdf_{l}.csv', index_col='log_x')
    distribution_dict['pmf_' + l] = pd.read_csv(output_folder / f'pmf_{l}.csv', index_col='log_x')

In [83]:
all_results = {}
log_x = base_kde_estimator.log_x
log_w = base_kde_estimator.log_w
for stn in validated_stations:

    akde_pmf = distribution_dict['pmf_kde'][stn].values
    # assert the pdf sums to 1
    assert np.isclose(np.sum(akde_pmf), 1, atol=1e-4), f'PDF for {stn} does not sum to 1: {np.sum(akde_pmf):.4f}'

    baseline_distribution = ReferenceDistribution(**shared_config)
    baseline_distribution._initialize_station(stn)
    kde_obj = shared_config['kde_obj']
    data = baseline_distribution.df['replaced_zero_flow_uar'].dropna().values
    try:
        fft_kde_pdf = FFTKDE(bw="silverman").fit(np.log(data)).evaluate(log_x)
    # catch warning from FFTKDE and print station ID
    except ValueError as e:
        print(f'Warning: FFTKDE failed for station {stn} with error: {e}')
        continue

    # convert the FFTKDE pdf to a PMF
    # Convert to PMF
    fft_pmf = fft_kde_pdf * log_w
    fft_pmf /= np.sum(fft_pmf)
    assert np.isclose(np.sum(fft_pmf), 1, atol=1e-4), f'PMF for {stn} does not sum to 1: {np.sum(fft_pmf):.4f}'
    # compute measures between the adaptive KDE and the FFTKDE
    measures = eval_obj._evaluate_fdc_metrics_from_pmf(akde_pmf, fft_pmf)
    all_results[stn] = measures
    



In [86]:
# format to a DataFrame with the station ID as index
results_df = pd.DataFrame(all_results).T
results_df.index.name = 'station_id'
results_df.reset_index(inplace=True)
results_df.to_csv('data/results/kde_fft_comparison.csv', index=False)

In [87]:
results_df.sort_values(by=['kld'], inplace=True, ascending=False)
bad_rmse_stns = results_df['station_id'].values[:10]
results_df.head(10)

Unnamed: 0,station_id,pct_vol_bias,mean_error,mean_abs_rel_error,rmse,nse,kge,ve,vb_pmf,vb_fdc,kld,emd,mean_frac_diff
7,05AB022,0.017894,0.009737,0.32134,0.391285,0.889393,0.875018,0.924046,-0.035555,-0.017894,0.478748,0.099,0.266062
454,08NL037,-0.051933,-0.073157,0.210093,0.277263,0.98806,0.982268,0.942761,0.065768,0.051933,0.223518,0.1612,0.324935
379,08NA056,0.008412,0.008085,0.124055,0.190746,0.984797,0.977052,0.959454,-0.128509,-0.008412,0.170862,0.1797,0.084792
378,08NA052,-0.015798,-0.280211,0.031081,0.045706,0.99779,0.983306,0.973346,0.018706,0.015798,0.098907,0.5795,0.029888
680,12111500,-0.135738,-3.271116,0.107856,0.14276,0.998438,0.990524,0.827649,0.16059,0.135738,0.09777,4.9752,0.062484
647,12091060,-0.168498,-1.461314,0.056886,0.108637,0.999115,0.992682,0.815214,0.224845,0.168498,0.063916,2.3585,0.126885
346,08MF048,-0.033046,-0.787682,0.073199,0.147401,0.997166,0.990462,0.958724,0.044119,0.033046,0.062621,1.3053,0.061809
117,08CC002,-0.05694,-2.887706,0.064164,0.081075,0.998227,0.983404,0.911539,0.073638,0.05694,0.061913,5.5046,0.055688
127,08DA010,-0.024279,-1.373964,0.029151,0.042473,0.998747,0.983119,0.963007,0.028364,0.024279,0.057677,2.3751,0.032478
104,08AA008,-0.01637,-0.073195,0.038677,0.106179,0.998753,0.971455,0.979283,0.018569,0.01637,0.057658,0.1174,0.044364


## Baseline Reliability (Flow Duration) Curve 

As a baseline to compare against the FDC prediction models tested in this study, we could use the maximum uncertainty or uniform distribution, but such a low bar may not provide much insight.  Instead, we compute the mean PDF across all stations in the study region, and use this as a baseline FDC to compare against the curves yielded by the different models.  As an upper bound, we compute the Bayes posterior log-normal fit for each station.

### Compute the mean PDF across all stations

Given a state space $\Omega$ with $M$ discrete states, and $N$ spatially distributed sensors (streamflow monitoring stations) with PMFs $P=\{p_j\}_{j=1}^N$, we can define the mean PDF across all sensors in terms of the observed states $\omega$ as follows:

$$\Omega=\{\omega_i\}_{i=1}^M,\quad P\in\mathbb{R}^{M\times N},\quad P_{i j}=p_j(\omega_i),\ \sum_{i=1}^M P_{i j}=1.$$

$$P_{i j}=\Pr_j(B_i),\qquad \bar p_i =\frac{1}{N}\sum_{j=1}^N P_{i j}\quad(\text{mean PMF over sensors}).$$

where $B_i$ denotes the $i$-th quantization bin (interval) in log-unit-area-runoff (L-UAR), i.e., $B_i = [y_i, y_{i+1})$ with $y_i = \log(x_i)$. Each $P_{ij}$ is the probability assigned by sensor $j$ to bin $B_i$.

$$\text{Density per log-}x\ (\text{piecewise constant on }B_i):\quad h_i=\frac{\bar p_i}{\Delta y_i},\qquad \sum_i h_i\,\Delta y_i=1$$


### Entropy of the mean PDF and fraction of $b$-bit quantization capacity

$$M = 2^{b},\qquad \bar{\mathbf p} = (\bar p_1,\ldots,\bar p_M)^T,\quad \bar p_i \ge 0,\ \sum_{i=1}^M \bar p_i = 1$$

$$\text{Shannon entropy (bits):}\quad H_2(\bar{\mathbf p}) \;=\; -\sum_{i:\,\bar p_i>0} \bar p_i \log_2 \bar p_i \;\le\; b.
$$

$$\text{Normalized entropy (fraction of capacity):}\quad \rho_b \;=\; \frac{H_2(\bar{\mathbf p})}{b} \in [0,1].$$

$$\text{Perplexity (effective bins):}\quad \mathcal P(\bar{\mathbf p}) \;=\; 2^{H_2(\bar{\mathbf p})},\qquad
\text{occupancy fraction } \phi_b \;=\; \frac{\mathcal P}{2^b} \;=\; 2^{H_2(\bar{\mathbf p})-b}.
$$

In [89]:
from bokeh.palettes import RdYlGn

pdf_fig = figure(title=f"Mean PDF across {len(results_df)} Stations", width=750, height=400, x_axis_type='log')
entropy_distributions = {}
quant_cols = ['log_x', 'lin_x', 'left_log_edges', 'right_log_edges']
states = [2**b for b in bits]
stations = results_df['station_id']
for b in bits[2:]:
    pmf_path = entropy_output_folder / f'kde_pmf_resampled_{b}bits.csv'
    pmf_resampled = pd.read_csv(pmf_path)
    pmf = pmf_resampled.copy()[stations]
    pmf = pmf.mean(axis=1)  # median PMF across stations

    # Compute bin edges and widths from linear bin centers
    # centers = pmf_resampled.index.astype(float).values
    # edges = logspace_edges_from_linear_centers(centers)
    # dx = np.diff(edges)
    left_edges = pmf_resampled['left_log_edges'].values
    right_edges = pmf_resampled['right_log_edges'].values
    log_edges = np.concatenate([left_edges, [right_edges[-1]]])
    log_w = np.diff(log_edges)

    # Convert PMF to PDF and normalize
    pdf = pmf / log_w
    pdf /= np.trapezoid(pdf, x=pmf_resampled['log_x'].values)  # ensure integral = 1

    mean_dist_dict = {'pmf': pmf, 'pdf': pdf, 'log_x': pmf_resampled['log_x'], 'lin_x': pmf_resampled['lin_x'],
                     'left_log_edges': pmf_resampled['left_log_edges'], 'right_log_edges': pmf_resampled['right_log_edges']}
    mean_df = pd.DataFrame(mean_dist_dict)
    mean_df.to_csv(BASE_DIR / 'data' / 'results' / f'mean_distribution_{b}bits.csv')

    # Entropy of PMF (still valid for info content calc)
    mask = pmf > 0
    entropy = -np.sum(pmf[mask] * np.log2(pmf[mask]))
    ratio = entropy / b
    perplexity = 2 ** entropy

    # Plot PDF
    pdf_fig.quad(
        top=pdf, bottom=0,
        left=np.exp(log_edges[:-1]), right=np.exp(log_edges[1:]),
        fill_color=RdYlGn[11][states.index(int(2**b)) % len(RdYlGn[11])],
        line_color=None, alpha=0.7,
        legend_label=f'{b:.0f}b (H={entropy:.1f}, ρ={100*ratio:.0f}%, ϕ={perplexity:.1f})'
    )
    # count the number of zero bins
    n_zeros = np.sum(pmf == 0)
    print(f'{b} bits: Entropy={entropy:.2f}, Ratio={ratio:.2f}, Perplexity={perplexity:.2f}, Zero bins={n_zeros}/{len(pmf)}')

# Final formatting
pdf_fig.legend.click_policy = 'hide'
pdf_fig.legend.location = 'top_left'
pdf_fig.xaxis.axis_label = "Unit Area Runoff (L/s/km²)"
pdf_fig.yaxis.axis_label = "Probability Density"
pdf_fig.legend.background_fill_alpha = 0.3
pdf_fig = dpf.format_fig_fonts(pdf_fig, font_size=14)

show(pdf_fig)

# export the figure to an html file
from bokeh.resources import CDN
from bokeh.embed import file_html
html = file_html(pdf_fig, CDN, f'Mean PDF across {len(results_df)} Stations')
with open('data/results/mean_pdf_across_stations.html', 'w') as f:
    f.write(html)

4 bits: Entropy=2.44, Ratio=0.61, Perplexity=5.44, Zero bins=0/16
5 bits: Entropy=3.41, Ratio=0.68, Perplexity=10.61, Zero bins=0/32
6 bits: Entropy=4.40, Ratio=0.73, Perplexity=21.08, Zero bins=0/64
7 bits: Entropy=5.39, Ratio=0.77, Perplexity=42.02, Zero bins=9/128
8 bits: Entropy=6.39, Ratio=0.80, Perplexity=83.78, Zero bins=33/256
9 bits: Entropy=7.39, Ratio=0.82, Perplexity=167.19, Zero bins=92/512
10 bits: Entropy=8.38, Ratio=0.84, Perplexity=332.78, Zero bins=232/1024
11 bits: Entropy=9.37, Ratio=0.85, Perplexity=661.43, Zero bins=552/2048
12 bits: Entropy=10.36, Ratio=0.86, Perplexity=1311.68, Zero bins=1278/4096


## Catchment Attributes 


View the catchment attributes as a distribution across the sample.  Catchment attributes are the primary information source for the first experiment (prediction of log-normal distribution parameters), and they are used as conditioning variables for the second and third experiments (k-nearest neighbours and LSTM daily unit area runoff estimation).  The attributes are derived from four geospatial data sources:

In [None]:
rev_date = '20250227'
attribute_file = f'BCUB_watershed_attributes_updated_{rev_date}.csv'
attribute_fpath = os.path.join('data', attribute_file)
df = pd.read_csv(attribute_fpath, dtype={'official_id': str})
df = df[[c for c in df.columns if 'unnamed:' not in c.lower()]]
# exclude = ['15039900','15031000']
df.columns = [c.lower() for c in df.columns]
# assert '12414900' in df['official_id'].values
df.sort_values('official_id', inplace=True)
df.reset_index(drop=True, inplace=True)

df['n_complete_years'] = df['official_id'].apply(lambda x: complete_year_dict.get(x, {}).get('n_complete_years', np.nan))

In [None]:
climate_attributes = ['tmean', 'prcp', 'vp', 'swe', 'srad', 'low_prcp_duration', 'low_prcp_freq', 'high_prcp_duration', 'high_prcp_freq']
terrain_attributes = ['slope_deg', 'aspect_deg', 'elevation_m', 'log_drainage_area_km2']
soil_attributes = ['porosity_x100', 'logk_ice_x100']
land_cover_attributes = ['land_use_forest', 'land_use_shrubs', 'land_use_grass', 'land_use_wetland', 'land_use_crops', 
                       'land_use_urban', 'land_use_water', 'land_use_snow_ice']

if not 'tmean' in df.columns:
    # compute the mean temperature for each catchment
    df['tmean'] = (df['tmax'] + df['tmin']) / 2
if not 'log_drainage_area_km2' in df.columns:
    df['log_drainage_area_km2'] = np.log(df['drainage_area_km2'] + 1)

# save the dataframe with attributes
df.to_csv(attribute_fpath, index=False)




In [None]:
from pathlib import Path
from bokeh.plotting import figure, show, gridplot
from bokeh.io import output_notebook
import numpy as np
output_notebook()


def compute_empirical_cdf(values):
    """Compute the empirical cumulative distribution function (CDF) of the given values."""
    sorted_values = np.sort(values)
    cdf = np.arange(1, len(sorted_values) + 1) / len(sorted_values)
    return sorted_values, cdf


def plot_cdf(values, label=None):
    fig = figure(width=700, height=400)
    x, y = compute_empirical_cdf(values)
    fig.line(x, y, legend_label=label, line_width=2)
    fig.legend.location = "top_left"
    if label.startswith('land_use'):
        fig.legend.location = 'bottom_right'
    
    fig.xaxis.axis_label = label
    fig.yaxis.axis_label = 'Cumulative Probability'
    fig.legend.background_fill_alpha = 0.6
    fig = dpf.format_fig_fonts(fig, font_size=14)
    return fig


### Terrain attributes

```{figure} images/terrain_attributes.png
---
alt: Terrain attributes are shown for an example catchment.
name: terrain-attributes
width: 700px
align: center
---
Terrain attributes are shown for an example catchment.  Images are from a video presentation on Streamflow Monitoring Network Optimization prepared for the 2024 Canadian Water Resources Association (CWRA) Annual Conference.  The video is available at [https://vimeo.com/1094107902](https://vimeo.com/1094107902).
```

In [None]:
figs = []
for c in terrain_attributes:
    values = df[c].values
    print(f'{c} - {np.mean(values):.2f} [{np.min(values):.2f}, {np.max(values):.2f}]')
    cdf_fig = plot_cdf(values, label=c)
    cdf_fig = dpf.format_fig_fonts(cdf_fig, font_size=14)
    figs.append(cdf_fig)

lt = gridplot(figs, ncols=4, width=300, height=300)
show(lt)

### Climate attributes

```{figure} images/climate_attributes.png
---
alt: Climate attributes are shown for an example catchment.
name: climate-attributes
width: 700px
align: center
---
Climate attributes are shown for an example catchment.
```


In [None]:
figs = []
for c in climate_attributes:
    values = df[c].values
    cdf_fig = plot_cdf(values, label=c)
    figs.append(cdf_fig)

lt = gridplot(figs, ncols=4, width=300, height=300)
show(lt)

Above, there is a problem with the precision of the derived precipitation frequency attributes.  In particular the high precipitation frequency and duration, which have very few unique values.  As a result, these are not expected to contain much information or be useful for predictive modelling.

### Soil attributes

```{figure} images/soil_attributes.png
---
alt: Soil attributes are shown for an example catchment.
name: soil-attributes
width: 600px
align: center
---
Soil attributes are shown for an example catchment.
```

In [None]:
figs = []
for c in soil_attributes:
    values = df[c].values
    cdf_fig = plot_cdf(values, label=c)
    figs.append(cdf_fig)

lt = gridplot(figs, ncols=4, width=300, height=300)
show(lt)


```{figure} images/land_cover_attributes.png
---
alt: Land cover classifications are shown for an example catchment.
name: land-cover-classifications
width: 700px
align: center
---

Land cover classifications are shown for an example catchment.
```

In [None]:
figs = []
for c in land_cover_attributes:
    values = df[f'{c}_frac_2010'].values
    cdf_fig = plot_cdf(values, label=c)
    figs.append(cdf_fig)

lt = gridplot(figs, ncols=4, width=300, height=300)
show(lt)

## Citations

```{bibliography}
:filter: docname in docnames
```