# Data

In this notebook we:

1.  Describe the information derived from other data sources
2.  Quality review catchments for minimum period of record and regulation or other data quality issues related to human influence on the runoff regime.
3.  Compute the empirical (reference) distributions for all catchments meeting minimum data requirements.
4.  Compute global mean PDF / PMF across all catchments.
5.  view the distribution of catchment attributes (used in the first method - parametric prediction of FDCs)

## Introduction

```{figure} ../images/figure_1_study_region.png
---
alt: Study region polygons and HYSETS monitored catchment polygons.
name: study-region-fig
width: 700px
align: center
---
Study region polygons and WSC + USGS active (green triangles) and historical (yellow triangles) streamflow monitoring stations.  The purple dots represent ungauged catchments characterized in the BCUB dataset {cite}`kovacek2025bcub`, but they are not used in this study.  
```

The streamflow data used in this study comes from *The Hydrometeorological Sandbox École de Technologie Supérieure* (HYSETS) {cite}`arsenault2020comprehensive`.  The HYSETS data, including streamflow time series and attributes for 14,425 catchments can be accessed at [https://osf.io/rpc3w/](https://osf.io/rpc3w/) (As of 2025-07-04, the streamflow timeseries and attribute filename is `HYSETS_2023_update_QC_stations.nc`).  We use a subset of approximately 1620 catchments contained in major basins covering and bounding British Columbia, as shown in {numref}`Figure {number} <study-region-fig>`. 



### Catchment attributes

Catchment attributes are used for all three models, and these are derived from four geospatial data sources:

**Table: Summary of input data sources used to characterize attributes of monitored catchments**

| **Data Type**           | **Source Name**                                                      | **Reference**                            |
|-------------------------|----------------------------------------------------------------------|------------------------------------------|
| Daily streamflow        | Large sample hydrology dataset for N. America and Mexico (HYSETS)   | {cite}`arsenault2020comprehensive` |
| Terrain                 | USGS 1 arc-second Digital Elevation Data (3DEP)                      | {cite}`3dep`                                   |
| Land cover              | North American Land Change Monitoring System (NALCMS)               | {cite}`latifovic2010nalcms` |
| Soil properties         | Global hydrogeological dataset (GLHYMPS)                            | {cite}`gleeson2014glimpse` |
| Meteorological forcings | Daily surface weather and climatological summaries ([Daymet](https://daymet.ornl.gov/))         | {cite}`thornton2022daymet` |


For details on the data processing pipeline for the catchment attributes, see *BCUB - A large sample ungauged basin attribute dataset for British Columbia, Canada* {cite}`kovacek2025bcub` ([https://doi.org/10.5194/essd-2023-508](https://doi.org/10.5194/essd-2023-508)).  Pre-processed catchment attributes are provided in the `data/` folder of this repository, and they can be used directly in the notebook.    



### Daily Meteorological Forcings

The catchment attributes related to meteorological forcings represent single catchment indices of each variable, however the LSTM neural network model requires daily meteorological forcings to train the model for the catchments in the study region.  These are derived from the Daymet dataset {cite}`thornton2022daymet`, which provides daily meteorological data at a 1km resolution.  The forcings include:

* **Precipitation**: total daily precipitation in mm
* **Minimum daily temperature**: minimum daily temperature in degrees Celsius
* **Maximum daily temperature**: maximum daily temperature in degrees Celsius
* **Shortwave radiation**: average daily shortwave radiation in W/m²
* **Vapour pressure**: daily average vapour pressure in Pa
* **Snow water equivalent**: total daily snow water equivalent in mm

These must be processed to catchment-average daily timeseries in netcdf file form for each catchment according to the [NeuralHydrologydocumentation](https://neuralhydrology.readthedocs.io/en/latest/tutorials/add-dataset.html).  The daily timeseries have been processed for the sample of catchments in this study, and they can be accessed at [https://doi.org/10.5683/SP3/65FXAS](https://doi.org/10.5683/SP3/65FXAS).  The full replication code for processing the meteorological forcings from the Daymet dataset is provided at [https://github.com/dankovacek/process_metforcings](https://github.com/dankovacek/process_metforcings)



### Pre-processed data files

The following pre-processed files are included in the `data/` folder of the repository at [https://github.com/dankovacek/distribution_estimation](https://github.com/dankovacek/distribution_estimation):

```{note}
Before proceeding with the computations in the notebook, the streamflow time series and (optionally) catchment boundaries from the HYSETS dataset must be downloaded from the [HYSETS open data repository](https://osf.io/rpc3w/).  Some data are provided in the `data/` folder as part of this repository.  Data pre-processing can be skipped by downloading the input data files from (add dataset repository link)
```

### Additional data from external sources

Download the following files and update the file paths below to your local file system:

**FDC estimation by log-normal distribution parameter prediction**:
* catchment attributes: `data/BCUB_watershed_attributes_updated_20250227.csv`
* streamflow summary statistics (see Notebook 3): `data/catchment_attributes_with_runoff_stats.csv`

**FDC estimation by k-nearest neighbours**:
* catchment attributes as above
* daily streamflow timeseries (as published in HYSETS): `data/HYSETS_2023_update_QC_stations.nc`.  Must be downloaded from the HYSETS open data repository at [https://osf.io/rpc3w/](https://osf.io/rpc3w/).

**FDC estimation by recurrent neural network model (LSTM)**:
* catchment attributes as above are used as conditioning variables
* The LSTM FDC estimation is done using the [NeuralHydrology](https://neuralhydrology.readthedocs.io/en/latest/) python library.  The LSTM model uses daily meteorological timeseries for the HYSETS stations in the study region.  The processing of catchment-average daily timeseries is a computationally intensive process, and the pre-processed timeseries are provided for six meteorological variables (precipitation, min and max daily temperature, shortwave radiation, vapour pressure, snow water equivalent) 
* Pre-processed daily meteorological forcings are provided at [https://doi.org/10.5683/SP3/65FXAS](https://doi.org/10.5683/SP3/65FXAS) and should be downloaded to replicate the LSTM modelling component.  



## View the data

In [1]:
import os
from time import time
from pathlib import Path
import pandas as pd
import numpy as np
import geopandas as gpd

from utils.kde_estimator import KDEEstimator
from utils.evaluation_metrics import EvaluationMetrics
from utils import data_processing_functions as dpf

In [2]:
# update this to the path where you stored `HYSETS_2023_update_QC_stations.nc`
BASE_DIR = Path(os.getcwd())
HYSETS_DIR = Path('/home/danbot/code/common_data/HYSETS')

# import the HYSETS attributes data
hysets_df = pd.read_csv(HYSETS_DIR / 'HYSETS_watershed_properties.txt', sep=';')
da_dict = {row['Official_ID']: row['Drainage_Area_km2'] for _, row in hysets_df.iterrows()}
official_id_dict = {row['Official_ID']: row['Watershed_ID'] for _, row in hysets_df.iterrows()}

In [3]:
camels_df = pd.read_csv('data/camels/camels_hydro.txt', sep=';')
camels_df['gauge_id'] = camels_df['gauge_id'].astype(str)
camels_df.head()


Unnamed: 0,gauge_id,q_mean,runoff_ratio,slope_fdc,baseflow_index,stream_elas,q5,q95,high_q_freq,high_q_dur,low_q_freq,low_q_dur,zero_q_freq,hfd_mean
0,1013500,1.699155,0.543437,1.528219,0.585226,1.845324,0.241106,6.373021,6.1,8.714286,41.35,20.170732,0.0,207.25
1,1022500,2.173062,0.602269,1.77628,0.554478,1.702782,0.204734,7.123049,3.9,2.294118,65.15,17.144737,0.0,166.25
2,1030500,1.820108,0.555859,1.87111,0.508441,1.377505,0.107149,6.854887,12.25,7.205882,89.25,19.402174,0.0,184.9
3,1031500,2.030242,0.576289,1.494019,0.445091,1.648693,0.111345,8.010503,18.9,3.286957,94.8,14.697674,0.0,181.0
4,1047000,2.18287,0.656868,1.415939,0.473465,1.510238,0.196458,8.095148,14.95,2.577586,71.55,12.776786,0.0,184.8


### Import the study region stations

In [4]:
station_fpath = 'data/study_region_stations.geojson'
bcub_gdf = gpd.read_file(station_fpath)    # get the number of unique stations in the dataset
bcub_gdf['watershedID'] = bcub_gdf['Official_ID'].apply(lambda x: official_id_dict.get(x, None))
unique_stations = np.unique(bcub_gdf['Official_ID'])
print(f'{len(unique_stations)} unique monitored catchments in the dataset')
# what is the minimum drainage area of the BCUB stations?
min_da = bcub_gdf['Drainage_Area_km2'].min()
print(f'Minimum drainage area of the BCUB stations: {min_da:.3f} km²')

1618 unique monitored catchments in the dataset
Minimum drainage area of the BCUB stations: 1.010 km²


In [5]:
# visualize the locations (centroids) of the catchments
# convert to geodataframe
# convert coordinate reference system to 3857 for plotting
gdf = bcub_gdf.copy().to_crs(3857)
bbox = gdf.geometry.total_bounds

In [6]:
# visualize the catchment centroid locations
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.palettes import Colorblind, Sunset10
output_notebook()

# range bounds supplied in web mercator coordinates
p = figure(x_axis_type="mercator", y_axis_type="mercator", width=700, height=400,
          x_range=(bbox[0], bbox[2]), y_range=(bbox[1], bbox[3]))
p.add_tile("CartoDB Positron", retina=True)
p.scatter(x=gdf.geometry.x, y=gdf.geometry.y, color='orange', size=4)

show(p)

### Import the catchment attributes

```{note}
Some stations are excluded from the analysis due to data quality issues.  These are listed in the `exclude_stations` list below.
```

In [7]:
def match_with_padding(oid):
    if oid in hysets_df['Official_ID'].values:
        return oid
    print(f'{oid} not found in HYSETS data, trying padded versions...')
    for pad in range(1, 4):
        padded = oid.zfill(len(oid) + pad)
        if padded in hysets_df['Official_ID'].values:
            print(f'    Found padded version: {padded}')
            return padded
    raise ValueError(f"Official ID {oid} not found in HYSETS data, even with padding.")

rev_date = '20250227'
attribute_file = f'BCUB_watershed_attributes_updated_{rev_date}.csv'
updated_attribute_file = 'catchment_attributes_with_runoff_stats.csv'
if not os.path.exists(os.path.join('data', updated_attribute_file)):
    print(f'Updated attribute file {updated_attribute_file} not found. Using {attribute_file} instead.')
    updated_attribute_path = os.path.join('data', attribute_file)
    process_statistics = True
else:
    updated_attribute_path = os.path.join(os.getcwd(), 'data', updated_attribute_file)
    process_statistics = False

attr_df = pd.read_csv(updated_attribute_path, dtype={'official_id': str})
attr_df['official_id'] = attr_df['official_id'].apply(lambda x: match_with_padding(x))
attr_df = attr_df[[c for c in attr_df.columns if 'unnamed:' not in c.lower()]]
attr_df.columns = [c.lower() for c in attr_df.columns]
attr_df.sort_values('official_id', inplace=True)
attr_df.reset_index(drop=True, inplace=True)
print(len(attr_df), 'catchments in the attribute file.')

# filter the bcub_gdf for stations in attr_df
# bcub_gdf = bcub_gdf[bcub_gdf['Official_ID'].isin(attr_df['official_id'].values)]
# print(f'{len(bcub_gdf)} catchments in the BCUB dataset after filtering for attributes.')

1017 catchments in the attribute file.


## Streamflow data validation

Given the range of environmental conditions and the dynamic nature of rivers, streamflow monitoring is a challenging task.  It is common for stations to be damaged by high flows, affected by ice, or erosion or deposition of sediment.  Streamflow monitoring stations require periodic maintenance, and gaps in records are common.  The figure below illustrates a critical issue underlying hydrological studies, the continuity of streamflow records.

```{figure} images/weekly_data_availability.png
---
alt: A visualization of weekly data availablity for the streamflow monitoring stations in the study region shows many gaps in the records.
name: data-continuity-fig
width: 800px
align: center
---
Discontinuous and non-overlapping records is a problem underlying any hydrological analysis, and the problem is compounded for large sample studies..  
```

### Import streamflow timeseries from HYSETS 2023 update file

In [8]:
import xarray as xr
# Load dataset
streamflow = xr.open_dataset(HYSETS_DIR / 'HYSETS_2023_update_QC_stations.nc')

# Promote 'watershedID' to a coordinate on 'watershed'
streamflow = streamflow.assign_coords(watershedID=("watershed", streamflow["watershedID"].data))

# Set 'watershedID' as index
streamflow = streamflow.set_index(watershed="watershedID")

# Select only watershedIDs present in bcub_df
valid_ids = [int(wid) for wid in bcub_gdf['watershedID'].values if wid in streamflow.watershed.values]
ds = streamflow.sel(watershed=valid_ids)

In [9]:
def retrieve_and_preprocess_timeseries_discharge(stn, zero_flow_threshold=1e-3):
    """
    Retrieve and preprocess timeseries discharge data for a given station.
    The zero flow values reflect a threshold below which discharge 
    is considered indistinguishable from zero.  
    We want to flag the zero flows such that we can handle them separately.
    
    Parameters:

    stn (str): Official ID of the station.
    zero_flow_threshold (float): Threshold below which discharge is considered zero flow.

    Returns:
        pd.DataFrame: Preprocessed discharge timeseries for the station.
    """
    watershed_id = official_id_dict[stn]
    da = da_dict[stn]

    try:
        df = ds['discharge'].sel(watershed=str(watershed_id)).to_dataframe(name='discharge').reset_index()
    except KeyError:
        print(f"Warning: Station {stn} not found in dataset under watershedID {watershed_id}.")
        return pd.DataFrame()
    
    df = df.set_index('time')[['discharge']]
    df.dropna(inplace=True, subset=['discharge'])
    zero_flow_flag = (df['discharge'] <= zero_flow_threshold).any()
    # df['discharge'] = np.clip(df['discharge'], 1e-4, None)
    # df.rename(columns={'discharge': stn}, inplace=True)
    df[f'{stn}_mm'] = df['discharge'] * (24 * 3.6 / da)
    df[f'{stn}_uar'] = 1000 * df['discharge'] / da

    return df, zero_flow_flag

In [10]:
# test loading streamflow time series for a pair of monitoring stations
s1, s2 = unique_stations[0], unique_stations[1]

df1, _ = retrieve_and_preprocess_timeseries_discharge(s1)
df2, _ = retrieve_and_preprocess_timeseries_discharge(s2)
test_df = pd.concat([df1, df2], axis=1)

flow_fig = figure(width=700, height=350, x_axis_type='datetime')
flow_fig.line(test_df.index, test_df[f'{s1}_uar'], color='navy', legend_label=s1)
flow_fig.line(test_df.index, test_df[f'{s2}_uar'], color='dodgerblue', legend_label=s2)
flow_fig.yaxis.axis_label = r'$$\text{Unit Area Runoff } L/s/\text{km}^2$$'
flow_fig.xaxis.axis_label = r'$$\text{Date}$$'
show(flow_fig)

In [11]:
# retrieve 10 random stations and plot their time series
import random
color_palette = Sunset10
sample_stations = random.sample(list(unique_stations), 10)
print(sample_stations)
sample_stations = ['12302055', '05BF018',  '08KH003', '08BB001', '10AB001', '08EE004', '08NP004',  '08MA001','08LG048', '08KB006',]
print('Sample stations:', sample_stations)

example_dfs = [retrieve_and_preprocess_timeseries_discharge(stn)[0] for stn in sample_stations]
example_df = pd.concat(example_dfs, axis=1)

example_df = example_df.loc['1982-01-01':'1984-12-31', [c for c in example_df.columns if c.endswith('_uar') and c != 'log_uar']]
temporal_mean = example_df[[f'{s}_uar' for s in sample_stations]].mean(1)



['10AC003', '08NB010', '12039005', '15081614', '12414900', '05AB046', '07EF004', '10CD004', '08CC001', '12359800']
Sample stations: ['12302055', '05BF018', '08KH003', '08BB001', '10AB001', '08EE004', '08NP004', '08MA001', '08LG048', '08KB006']


In [12]:
# set a common support discretization
x_bins = np.exp(np.linspace(np.log(0.5), np.log(600), 2**8+1))
# compute empirical PMFs for each station on the common support
x_vals = 0.5 * (x_bins[1:] + x_bins[:-1])
cdfs, pdfs = {}, {}
for s in sample_stations:
    hist, _ = np.histogram(example_df[f'{s}_uar'].clip(1e-6), bins=x_bins, density=True)
    pdfs[s] = hist
    # convert from density to mass
    pm = hist * np.diff(x_bins)
    pm /= pm.sum()    # normalize to 1
    cs = pm.cumsum()
    cdfs[s] = cs / cs[-1]    # normalize to 1

# compute the empirical PDF of the temporal mean
temporal_mean_pdf, _ = np.histogram(temporal_mean.clip(1e-6).dropna().values, bins=x_bins, density=True)
pm = temporal_mean_pdf * np.diff(x_bins)
temporal_mean_pmf = pm / pm.sum()
temporal_mean_cdf = temporal_mean_pmf.cumsum()
temporal_mean_cdf /= temporal_mean_cdf[-1]

average_density = pd.DataFrame(pdfs).mean(axis=1)
average_density *= np.diff(x_bins)
average_density /= average_density.sum()
average_density = average_density.cumsum()

dist_fig = figure(width=600, height=350, toolbar_location='above', x_axis_type='log', )

for i, s in enumerate(sample_stations):
    dist_fig.line(x_vals, cdfs[s], line_width=1, line_alpha=0.7,
                  color='grey', legend_label='Ensemble donors')
dist_fig.line(x_vals, temporal_mean_cdf, line_width=2.0, line_alpha=0.7,
                color='red', legend_label=f'Temporal mean')
dist_fig.line(x_vals, average_density, line_width=2, line_alpha=0.7,
                color='black', legend_label=f'Average density')

dist_fig.xaxis.axis_label = r'$$\text{Unit Area Runoff } L/s/\text{km}^2$$'
dist_fig.yaxis.axis_label = r'$$\text{Cumulative Density}$$'
dist_fig.legend.click_policy="hide"
dist_fig.legend.background_fill_alpha = 0.5
dist_fig.legend.location = "top_left"
dist_fig = dpf.format_fig_fonts(dist_fig, font_size=14)


In [13]:
from bokeh.layouts import row

flow_fig = figure(width=600, height=350, x_axis_type='datetime', y_axis_type='log', toolbar_location='above')

for i, s in enumerate(sample_stations):
    flow_fig.line(example_df.index, example_df[f'{s}_uar'], line_width=1, line_alpha=0.7,
                  color='grey', legend_label='Ensemble donors')
flow_fig.line(example_df.index, temporal_mean.values, line_width=1.5, line_alpha=0.7,
                color='red', legend_label=f'Temporal mean')
flow_fig.yaxis.axis_label = r'$$\text{Unit Area Runoff } L/s/\text{km}^2$$'
flow_fig.xaxis.axis_label = r'$$\text{Date}$$'
flow_fig.legend.click_policy="hide"
flow_fig.legend.background_fill_alpha = 0.5
flow_fig.legend.location = "top_left"
flow_fig = dpf.format_fig_fonts(flow_fig, font_size=14)
show(row(flow_fig, dist_fig))

### Streamflow data validation for length of record


Here we set criteria for minimum record length and completeness to define a POR flow duration curve.

Evaluate the prevalence of very low flows in the network.

In [14]:
from multiprocessing import Pool
lf_thresholds = [0.00] + np.exp(np.linspace(-10, -2, 500).tolist())
# compute the probability of zero-flow days for each station
def compute_low_flow_probabilities(stn):    
    # mean_annual_precip = attr_df[attr_df['official_id'] == stn]['prcp'].values[0]
    df, _ = retrieve_and_preprocess_timeseries_discharge(stn) 
    flow_values = df['discharge'].values
    # compute the probability of flow less than each lf_threshold
    p_low_flow = [np.mean(flow_values <= lf) for lf in lf_thresholds]
    p_zero_flow = (df['discharge'] <= 1e-4).sum() / len(df)
    threshold_df = pd.DataFrame({'lf_threshold': lf_thresholds, 'p_low_flow': p_low_flow})
    return {'stn': stn, 'p_lf': threshold_df, 'p_zero_flow': p_zero_flow}


results = []
# with Pool() as pool:
#     results = pool.map(compute_low_flow_probabilities, attr_df['official_id'].values)
if results:
    lf_results = [res['p_lf'].set_index('lf_threshold') for res in results]
    # set the columns to the station_ids
    lf_df = pd.concat(lf_results, axis=1)
    lf_df.columns = [res['stn'] for res in results]
    lf_df = lf_df.T
    pct_above_thresh = []
    for lf in lf_thresholds:
        pct_above = (lf_df.loc[:, lf] > 0.0).sum() / len(lf_df)
        pct_above_thresh.append(pct_above)

    p = figure(width=500, height=300, title='Number of Stations with Non-Zero Probability of Low Flows', x_axis_type='log')
    p.line(lf_thresholds, pct_above_thresh, line_width=2)
    p.xaxis.axis_label = 'Low Flow Threshold (m³/s)'
    p.yaxis.axis_label = 'Fraction of Stations'
    p = dpf.format_fig_fonts(p, font_size=14)
    show(p)


### Preprocess all streamflow records for completeness

Find all complete years (>= 20 days of record in all 12 months) by calendar year and by hydrological year (October 1 - September 30).

Then look at mean annual total volume for catchments and compare hydrological to calendar years.

In [15]:

complete_year_stats_fpath = f'data/complete_year_stats.npy'

# start month for the hydrological year October based on 
# September still having significant glacial melt influence in some parts of BC                                                                            
# more typically maybe september or october is used but here we use October, 
# either way this study is about long-term trends and not
# temporal dynamics so either may be fine.
hyd_ms = 'OCT'

if os.path.exists(complete_year_stats_fpath):
    complete_year_stats = np.load(complete_year_stats_fpath, allow_pickle=True).item()
    mean_annual_hyd = [complete_year_stats[stn]['mean_annual_hyd'] for stn in complete_year_stats.keys()]
    mean_annual_cal = [complete_year_stats[stn]['mean_annual_cal'] for stn in complete_year_stats.keys()]
    mean_annual_vol_hyd = [complete_year_stats[stn]['annual_vol_hyd_bounds'] for stn in complete_year_stats.keys()]
    mean_annual_vol_cal = [complete_year_stats[stn]['annual_vol_cal_bounds'] for stn in complete_year_stats.keys()]
else:
    complete_year_stats = {}
    mean_annual_hyd, mean_annual_cal, mean_annual_vol_hyd, mean_annual_vol_cal = [], [], [], []
    for i, stn in enumerate(attr_df['official_id'].values):
        df, _ = retrieve_and_preprocess_timeseries_discharge(stn)
        # convert mean daily discharge to volume in M m³/day
        df['daily_vol_mcm'] = df['discharge'] * 24 * 3600 / 1e6    # M m³/day
        s = df.loc[:, f'{stn}_uar'].sort_index().dropna()
        v = df.loc[:, 'daily_vol_mcm'].sort_index().dropna()
        mcount = s.resample('MS').count()  # days with data per month
        min_nonzero_uar = s[s > 0].min()

        # hydrological year: Oct-Sep        
        # get complete months and years by calendar and hydrological year
        cal_sum = mcount.ge(20).groupby(pd.Grouper(freq=f'YE-DEC')).sum().eq(12)
        hyd_sum = mcount.ge(20).groupby(pd.Grouper(freq=f'YE-{hyd_ms}')).sum().eq(12)
        # compute mean annual total volume discharge for complete years 
        annual_cal = pd.DataFrame({'mean': s.resample('YE-DEC').mean(),
                            'total': s.resample('YE-DEC').sum()}).loc[cal_sum[cal_sum].index]
        annual_hyd = pd.DataFrame({'mean': s.resample(f'YE-{hyd_ms}').mean(),
                                'total': s.resample(f'YE-{hyd_ms}').sum()}).loc[hyd_sum[hyd_sum].index]
        
        annual_cal_vol = pd.DataFrame({'mean': v.resample('YE-DEC').mean(),
                            'total': v.resample('YE-DEC').sum()}).loc[cal_sum[cal_sum].index]
        annual_hyd_vol = pd.DataFrame({'mean': v.resample(f'YE-{hyd_ms}').mean(),
                            'total': v.resample(f'YE-{hyd_ms}').sum()}).loc[hyd_sum[hyd_sum].index]

        annual_cal.index = annual_cal.index.year  # calendar year label
        annual_hyd.index = annual_hyd.index.year  # water year = ending year (Oct–Sep)
        complete_year_stats[stn] = {'cal_years': annual_cal.index.tolist(),
                                    'hyd_years': annual_hyd.index.tolist(),
                                    'mean_annual_cal': annual_cal['mean'].mean(),
                                    'mean_annual_hyd': annual_hyd['mean'].mean(),
                                    'annual_vol_cal': annual_cal_vol['total'].mean(),
                                    'annual_vol_hyd': annual_hyd_vol['total'].mean(),
                                    'annual_vol_cal_bounds': np.percentile(annual_cal_vol['total'].values, (2.5, 50, 97.5)),
                                    'annual_vol_hyd_bounds': np.percentile(annual_hyd_vol['total'].values, (2.5, 50, 97.5)),
                                    'min_nonzero_uar': min_nonzero_uar,
                                    'max_uar': s.max(),
                                    'min_nonzero_discharge': df[df['discharge'] > 0]['discharge'].min(),
                                    'max_discharge': df['discharge'].max(),
                                    }
        
        mean_annual_hyd.append(complete_year_stats[stn]['mean_annual_hyd'])
        mean_annual_cal.append(complete_year_stats[stn]['mean_annual_cal'])
        mean_annual_vol_hyd.append(complete_year_stats[stn]['annual_vol_hyd_bounds'])
        mean_annual_vol_cal.append(complete_year_stats[stn]['annual_vol_cal_bounds'])
    
    # save the dict as a pickle file
    np.save(complete_year_stats_fpath, complete_year_stats)

In [16]:
complete_year_stats['08JE005']
meet_minimum_hyd_years = []
for stn in complete_year_stats.keys():
    if len(complete_year_stats[stn]['hyd_years']) < 5:
        print(f'Station {stn} has {len(complete_year_stats[stn]['hyd_years'])} complete hydrological years of data.')
    else:
        meet_minimum_hyd_years.append(stn)

Station 05BG002 has 4 complete hydrological years of data.
Station 08JE005 has 4 complete hydrological years of data.
Station 08ME015 has 4 complete hydrological years of data.
Station 08NG004 has 4 complete hydrological years of data.
Station 08PA001 has 4 complete hydrological years of data.
Station 12073000 has 4 complete hydrological years of data.
Station 12164000 has 4 complete hydrological years of data.
Station 12392895 has 4 complete hydrological years of data.
Station 12444490 has 4 complete hydrological years of data.
Station 15081614 has 4 complete hydrological years of data.


### Find and set the global (nonzero) min and max UAR

The global min and max are used to set the common support discretization.

In [17]:
stats_df = pd.DataFrame()
stats_df['min_nonzero_discharge'] = [complete_year_stats[stn]['min_nonzero_discharge'] for stn in complete_year_stats.keys()]
stats_df['max_discharge'] = [complete_year_stats[stn]['max_discharge'] for stn in complete_year_stats.keys()]
stats_df['min_nonzero_uar'] = [complete_year_stats[stn]['min_nonzero_uar'] for stn in complete_year_stats.keys()]
stats_df['max_uar'] = [complete_year_stats[stn]['max_uar'] for stn in complete_year_stats.keys()]
stats_df['official_id'] = complete_year_stats.keys()
stats_df['drainage_area_km2'] = [da_dict[stn] for stn in complete_year_stats.keys()]
print(f'Min non-zero discharge across all stations: {stats_df["min_nonzero_discharge"].min():.6f} m³/s')
print(f'Min non-zero uar across all stations: {stats_df["min_nonzero_uar"].min():.7f} L/s/km²')
print(f'Max discharge across all stations: {stats_df["max_discharge"].max():.0f} m³/s')
print(f'Max uar across all stations: {stats_df["max_uar"].max():.0f} L/s/km²')
print(f'Min drainage area across all stations: {stats_df["drainage_area_km2"].min():.3f} km²')

Min non-zero discharge across all stations: 0.000283 m³/s
Min non-zero uar across all stations: 0.0000964 L/s/km²
Max discharge across all stations: 19400 m³/s
Max uar across all stations: 8374 L/s/km²
Min drainage area across all stations: 1.010 km²


In [18]:
global_min_uar = 5e-5
global_max_uar = 1e4
global_max_flow = 2e4
assert stats_df['min_nonzero_uar'].min() >= global_min_uar, "Min non-zero UAR below global minimum."
assert stats_df['max_uar'].max() <= global_max_uar, "Max global UAR above global maximum."
assert stats_df['max_discharge'].max() <= global_max_flow, "Max global discharge above global maximum."

### Compare mean annual volumes by hydrological and calendar years.

In [19]:
from bokeh.models import ColumnDataSource
from scipy.stats import linregress

cal_df = pd.DataFrame(mean_annual_vol_cal, columns=['lb', 'median', 'ub'])
hyd_df = pd.DataFrame(mean_annual_vol_hyd, columns=['lb', 'median', 'ub'])

# plot a regression of mean annual hydrological year vs calendar year
reg_fig = figure(width=500, height=400, title='Total Annual Volume: Hydrological vs Calendar Year')
# plot the 95% CI for each point as a quad in Bokeh
source = ColumnDataSource(data=dict(
    x=cal_df['median'],
    y=hyd_df['median'],
    lower_x=cal_df['lb'],
    upper_x=cal_df['ub'],
    lower_y=hyd_df['lb'],
    upper_y=hyd_df['ub'],
))
reg_fig.quad(left='lower_x', right='upper_x', bottom='lower_y', top='upper_y', source=source,
             fill_alpha=0.6, line_alpha=0.0, fill_color='lightblue', legend_label='95% CI')
# plot the median values
reg_fig.scatter(cal_df['median'], hyd_df['median'], size=5, color='dodgerblue', alpha=0.8)
max_val = max(max(cal_df['ub']), max(hyd_df['ub'])) * 1.1
slope, intercept, r_value, p_value, std_err = linregress(cal_df['median'], hyd_df['median'])
reg_fig.line([0, max_val], [0, max_val], line_width=2, color='black', alpha=0.8, legend_label='1:1 Line', line_dash='dotted')
label = f'Y={slope:.2f}X+{intercept:.2f}, R²={r_value**2:.4f}'
reg_fig.line([0, max_val], [intercept, slope * max_val + intercept], line_width=2, color='red', alpha=0.8, line_dash='dashed', legend_label=label)
reg_fig.xaxis.axis_label = 'Avg. Ann. Vol (Calendar) [Mcm]'
reg_fig.yaxis.axis_label = 'Avg. Ann. Vol. (Hydrological) [Mcm]'
reg_fig = dpf.format_fig_fonts(reg_fig, font_size=14)
reg_fig.legend.location = 'top_left'
reg_fig.legend.background_fill_alpha = 0.5
reg_fig.xaxis.major_label_orientation = np.pi / 4
show(reg_fig)

### Extra catchments to exclude

* Kakuhan Creek Near Haines AK - 15056030

```{figure} images/kakuhan_creek.png
---
width: 600px
name: Example of a catchment polygon with delineation issues.
---
There is uncertainty in the delineation of the Kakuhan Creek catchment polygon.  The historical station location does not align with the stream network derived from 30m DEM data.
```

### Excluded due to no complete years of data (seasonal / <= 90% complete)

* Genessee Creek at the Mouth - 08FA009
* McNair Creek near Port Mellon - 08GA037
* Canoe River near Valemount - 08NC003
* Big Quilcene River Near Quilcene, WA - 12052500
* Morey Creek above McChord Afb near Parkland, WA - 12090480
* North Fork Newaukum Creek Near Enumclaw, WA - 12107950
* Newaukum Creek Tributary Near Blacik Diamond, WA - 12108450
* May Creek near Issaquah, WA - 12119300
* Honey Creek near Renton, WA - 12119450
* Carpenter Creek near Bacon Rod near Mount Vernon, WA - 12200684
* Unnamed Tributary Massacre Bay on Orcas Island, WA - 12200762
* Whatcom Creek near Bellingham, WA - 12203000
* Hall Creek at Inchelium, WA - 12409500
* Dayebas Creek Near Haines, AK - 15056070
* Bonne Creek near Klawock, AK - 15081510

### Stations representing regulated streams that QA appears to have missed

* 12323760 - Silver Lake Dam
* 12143700 - Boxley Creek near Cedar Falls (unregulated but heavily influenced by seepage from adjacent reservoir -- Chester Morse Lake)
* 12143900 - Boxley Creek Near Edgewick, WA (also unregulated but heavily influenced by seepage from adjacent reservoir -- Chester Morse Lake)
* 12117500 - Cedar River at Landsburg below the diversion dam
* 12398000 - Sullivan Lake
* 12058800 - NF SKOKOMISH R BELOW LWR CUSHMAN DAM NR POTLATCH, WA
* 12137800 - Sultan River below diversion dam
* 12100000 - White River near Buckley - Several upstream diversions for a) power generation, b) flood control

In [20]:
exclude_stations = ['08FA009', '08GA037', '08NC003', '12052500', '12090480', '12107950', '12108450', '12119300', 
                    '12119450', '12200684', '12200762', '12203000', '12409500', '15056070', '15081510',
                    '12323760', '12143700', '12143900', '12398000', '12058800', '12137800', '12100000']

for ex_stn in exclude_stations:
    # check if it is in the QC Hysets list
    df, _ = retrieve_and_preprocess_timeseries_discharge(ex_stn)
    if df.empty:
        print(f'Station not included in HYSETS.')
        continue
    else:
        print(f'{ex_stn} found in HYSETS.  See notes for quality issues.')
    

08FA009 found in HYSETS.  See notes for quality issues.
08GA037 found in HYSETS.  See notes for quality issues.
08NC003 found in HYSETS.  See notes for quality issues.
12052500 found in HYSETS.  See notes for quality issues.
12090480 found in HYSETS.  See notes for quality issues.
12107950 found in HYSETS.  See notes for quality issues.
12108450 found in HYSETS.  See notes for quality issues.
12119300 found in HYSETS.  See notes for quality issues.
12119450 found in HYSETS.  See notes for quality issues.
12200684 found in HYSETS.  See notes for quality issues.
12200762 found in HYSETS.  See notes for quality issues.
12203000 found in HYSETS.  See notes for quality issues.
12409500 found in HYSETS.  See notes for quality issues.
15056070 found in HYSETS.  See notes for quality issues.
15081510 found in HYSETS.  See notes for quality issues.
12323760 found in HYSETS.  See notes for quality issues.
12143700 found in HYSETS.  See notes for quality issues.
12143900 found in HYSETS.  See not

In [21]:
min_years_of_record = 5
#create a binary matrix of the stations (rows) and complete years (columns)
# year_matrix = np.zeros((len(bcub_stations), len(all_years)), dtype=int)
validated_stations = sorted(list(set(list(complete_year_stats.keys()) + meet_minimum_hyd_years)))
# temporally validated
validated_stations = [stn for stn in validated_stations if len(complete_year_stats[stn]['hyd_years']) >= min_years_of_record]
# remove excluded for human influence
validated_stations = [stn for stn in validated_stations if stn not in exclude_stations]
# N = len(validated_stations)
attr_df = attr_df[attr_df['official_id'].isin(validated_stations)]
print(f'There are {len(attr_df)} unregulated monitoring stations with at least {min_years_of_record} complete years of data.')


There are 1007 unregulated monitoring stations with at least 5 complete years of data.


In [22]:
def compute_runoff_stats(data):
    out = {}
    for label in [f'uar', f'log_uar']:
        vals = data[label].values
        vals = vals[~np.isnan(vals) & ~np.isinf(vals)]
        # classical moments
        m   = vals.mean()
        median = np.median(vals)
        s   = vals.std(ddof=1)
        mad = np.mean(np.abs(vals - m))
        sk  = pd.Series(vals).skew()
        kt  = pd.Series(vals).kurtosis()

        # l-moments
        # params = distr.gev.lmom_fit(vals)

        out.update({
            f'{label}_mean': m,
            f'{label}_median': median,
            f'{label}_mad': mad,
            f'{label}_std': s,
            f'{label}_skew': sk,
            f'{label}_kurt': kt,
            # f'{label}_lmom_xi': params['c'],
            # f'{label}_lmom_loc': params['loc'],
            # f'{label}_lmom_scale': params['scale'],
        })
    return out

# reset the index to ensure the split is done correctly
def process_row(data):
    stn = str(data['official_id'])
    data, _ = retrieve_and_preprocess_timeseries_discharge(stn)
    
    # Compute the runoff statistics
    runoff_data = compute_runoff_stats(data)
    camels_data = camels_df[camels_df['gauge_id'] == stn].copy()
    if len(camels_data) > 1:
        camels_q = camels_data['q_mean'].values[0]
        raise Exception(f'Multiple CAMELS data found for {stn}.')
    else:
        camels_q = camels_data['q_mean'].values[0] if not camels_data.empty else np.nan

    # Merge your existing mm‐based mean + the new metrics
    out = {
      **runoff_data,
      'camels_q_mean_mm': camels_q,
    }
    return pd.Series(out)


In [23]:
updated_attribute_file = 'catchment_attributes_with_runoff_stats.csv'
if not os.path.exists(os.path.join('data', updated_attribute_file)):
    print(f'Updated attribute file {updated_attribute_file} not found. Using {attribute_file} instead.')
    updated_attribute_path = os.path.join('data', attribute_file)
    process_statistics = True
else:
    updated_attribute_path = os.path.join(os.getcwd(), 'data', updated_attribute_file)
    process_statistics = False


In [24]:
if process_statistics == True:
    print(f'Processing runoff statistics for {len(validated_stations)} stations')
    updated_fpath = os.path.join(os.getcwd(), 'data', f'catchment_attributes_with_runoff_stats.csv')
    stats_results = attr_df.apply(lambda x: process_row(x), axis=1)
    target_cols = stats_results.columns.tolist()
    attr_df.loc[stats_results.index, stats_results.columns] = stats_results
    print(f'   Saving updated attributes with runoff statistics for {len(attr_df)} catchments to:', updated_fpath)
    attr_df.to_csv(updated_fpath)

In [25]:
# import the HYSETS attributes data
ws_id_dict = hysets_df.set_index('Official_ID')['Watershed_ID'].to_dict()
da_dict = hysets_df.set_index('Official_ID')['Drainage_Area_km2'].to_dict()
official_id_dict = {row['Official_ID']: row['Watershed_ID'] for _, row in hysets_df.iterrows()}

In [26]:
# import the BCUB (study) region boundary
bcub_df = pd.read_csv(os.path.join('data', f'catchment_attributes_with_runoff_stats.csv'), dtype={'official_id': str})
bcub_df['official_id'] = bcub_df['official_id'].astype(str)
bcub_df = bcub_df[bcub_df['official_id'].isin(validated_stations)].copy()
# map the Hysets watershed IDs to the BCUB watershed IDs
# create a dict to map HYSETS watershed IDs to the Official station IDs
bcub_df['watershedID'] = bcub_df['official_id'].apply(lambda x: official_id_dict.get(x, None))
print(f'   Found {len(bcub_df)} catchments in the BCUB region with runoff statistics.')

   Found 1007 catchments in the BCUB region with runoff statistics.


## Compute reference (observed) distributions

Here we compute the baseline, or empirical probability distribution of daily unit area runoff (UAR, in L/s/km²) for each monitored catchment in the study region based on observed streamflow records. For each station, we construct an empirical probability mass function (PMF) over a common support $\Omega$, discretized into $N = 2^{12}$ bins of equal width in log space. This approach ensures that all catchments are evaluated on a consistent, physically meaningful scale, capturing the full range of observed runoff values from $5 \times 10^{-6}$ to $10^{4}$ L/s/km².  

The number of bins corresponds to 12 bit encoding where bin widths are $\sim 1\%$ of the quantized value (bin midpoint), representing precision beyond what is typically achievable with daily mean streamflow estimation. The resulting reference distributions serve as the baseline for evaluating predicted flow duration (reliability) curves.

The empirical PMF is computed for the evaluation of "ground truth" over the common support $\Omega$, allowing states to have zero probability.  This is a deliberate choice to represent what has been observed as truly as possible.  Streamflow in nature is a continuous variable, so the common cases we see of zero probability between nonzero are a result of finite sampling.  A secondary "baseline" is computed here using a kernel density estimator (KDE) with a log-normal kernel and bandwidth set according to an assumed measurement error model, and these are used as part of the FDC estimation models described in Notebook 2.  

In [27]:
class ReferenceDistribution:
    def __init__(self, **kwargs):

        for k, v in kwargs.items():
            setattr(self, k, v)

        self._initialize_station()
        self._filter_complete_hydrological_years()
        self._digitize_uar_series()


    def _initialize_station(self):
        self.da = self.da_dict[self.stn]
        self.df, self.zero_flow_flag = retrieve_and_preprocess_timeseries_discharge(stn)
        self.df['uar'] = 1000 * self.df['discharge'] / self.da    


    def _filter_complete_hydrological_years(self):
        s = self.df.loc[:, ['discharge']].sort_index()  # daily discharge series

        # Calendar years (A-DEC)
        ok_cal = (s.resample('MS').count().ge(20)
                    .groupby(pd.Grouper(freq='YE-DEC')).sum().eq(12))
        ok_cal.index = ok_cal.index.to_period('Y-DEC')       # <- PeriodIndex

        per_cal = s.index.to_period('Y-DEC')
        mask_cal = ok_cal.reindex(per_cal, fill_value=False).to_numpy()
        self.cal_df = s[mask_cal].copy()  # daily values in complete calendar years
        # add the discharge label
        self.cal_df['uar'] = 1000 * self.cal_df['discharge'] / self.da

        # Hydrologic years, e.g. Oct–Sep -> A-SEP
        hyd_ms = 'SEP'
        ok_hyd = (s.resample('MS').count().ge(20)
                    .groupby(pd.Grouper(freq=f'YE-{hyd_ms}')).sum().eq(12))
        
        ok_hyd.index = ok_hyd.index.to_period(f'Y-{hyd_ms}')    # <- PeriodIndex

        per_hyd = s.index.to_period(f'Y-{hyd_ms}')
        mask_hyd = ok_hyd.reindex(per_hyd, fill_value=False).to_numpy()
        self.hyd_df = s[mask_hyd].copy()
        self.hyd_df['uar'] = 1000 * self.hyd_df['discharge'] / self.da
        self.hyd_df.dropna(subset=['uar'], inplace=True)
    
    
    def _digitize_uar_series(self):
        # digitize the uar series
        lin_edges_extended = np.exp(self.log_edges_extended)
        self.minimum_uar_threshold = float(1000.0 * self.zero_equiv_flow_threshold / self.da)
        self.hyd_df['uar_bin'] = np.digitize(self.hyd_df['uar'], lin_edges_extended, right=False) - 1 # hydrologic year data
        self.cal_df['uar_bin'] = np.digitize(self.cal_df['uar'], lin_edges_extended, right=False) - 1 # calendar year data
        self.zero_bin_index = max(0, np.digitize(self.minimum_uar_threshold, lin_edges_extended, right=False) - 1)
        
        # map the quantized bin values back to the series
        self.lin_x_extended = np.exp(0.5 *(self.log_edges_extended[1:] + self.log_edges_extended[:-1]))
        self.hyd_df['uar_discrete'] = self.lin_x_extended[self.hyd_df['uar_bin'].clip(0, np.inf)]
        # clip the bin indices to valid range
        assert self.hyd_df['uar_bin'].max() < len(self.lin_x_extended), f"uar_bin index out of range. {self.hyd_df['uar_bin'].max()} >= {len(self.lin_x_extended)}"
        
        # handle bin values below the minimum measurable threshold
        self.hyd_df['uar_bin_adjusted'] = self.hyd_df['uar_bin'].copy()
        # handle values below the minimum measurable threshold
        self.hyd_df['uar_zero_adjusted'] = self.hyd_df['uar'].copy()
        if self.hyd_df['uar_bin'].min() < self.zero_bin_index and self.zero_bin_index > 0:
            # get the minimum log value
            # the discrete x value to the left of the bin containing the "minimum measurable value"
            min_uar = self.lin_x_extended[self.zero_bin_index - 1] 
            # min_uar2 = self.lin_x_extended[self.zero_bin_index]
            # check that the assigned min_uar corresponds to the correct 
            # # bin x value (midpoint in log space)
            # print(f"Values found below minimum measurable threshold: {minimum_uar_threshold:.5f}")
            # print(min_uar, minimum_uar_threshold, min_uar2)
            # print(self.lin_x_extended[self.zero_bin_index - 2:self.zero_bin_index + 3])
            # print(lin_edges_extended[self.zero_bin_index - 2:self.zero_bin_index + 3])
            self.hyd_df.loc[self.hyd_df['uar_bin'] < 0, 'uar_discrete'] = np.float32(min_uar)
            # self.hyd_df.loc[self.hyd_df['uar_bin'] < 0, 'uar'] = np.float32(min_uar)
            
            # adjust the uar bin where the bin index is smaller than the zero bin index            
            self.hyd_df.loc[self.hyd_df['uar_bin_adjusted'] < self.zero_bin_index, 'uar_bin_adjusted'] = 0
            # adjust the uar values below the minimum measurable threshold
            self.hyd_df.loc[self.hyd_df['uar_bin_adjusted'] < self.zero_bin_index, 'uar_zero_adjusted'] = np.float32(min_uar)
            # print(self.hyd_df[self.hyd_df['uar_bin'] == self.hyd_df['uar_bin'].min()].copy(), self.zero_bin_index)
            # raise ValueError(f"uar_bin index negative: {self.hyd_df['uar_bin'].min()} < 0")
        # else:
        #     print(f"No values below minimum measurable threshold: {minimum_uar_threshold:.5f}")
        #     print(self.hyd_df[self.hyd_df['uar_bin'] == self.hyd_df['uar_bin'].min()].copy(), self.zero_bin_index)
        
    
    def _compute_kl_divergence(self, p, q, lam=1e-9):
        """Compute the KL divergence between two probability distributions."""
        p = p.astype(float)
        q = q.astype(float)

        # q = q * (1 - lam) + lam / len(q)
        # q = q / q.sum()

        mask = (p > 0) & (q > 0)

        with np.errstate(divide='ignore', invalid='ignore'):
            log_term = np.zeros_like(p)
            log_term[mask] = np.log(p[mask] / q[mask])
            terms = np.zeros_like(p)
            terms[mask] = p[mask] * log_term[mask]

        bad_idx = ~np.isfinite(terms)
        if np.any(bad_idx):
            print("[DEBUG] Invalid values in KL divergence computation:")
            for i in np.flatnonzero(bad_idx):
                print(
                    f"  i={i}, p={p[i]:.3e}, q={q[i]:.3e}, "
                    f"p/q={p[i]/q[i]:.3e}, log(p/q)={log_term[i]:.3e}, "
                    f"term={terms[i]:.3e}"
                )

        return np.sum(terms[mask])  # Only use valid terms
    
    
    def build_station_pmf(self):
        """
        Build the discretized and KDE smoothed PMFs for the daily uar timeseries.
        """

        # counts = np.histogram(log_uar, bins=self.pos_edges, density=False)[0].astype(np.int64, copy=False)
        unique_bin_idxs, bin_counts = np.unique(self.hyd_df['uar_bin_adjusted'].values, return_counts=True)

        # initialize the PMF 
        pmf = np.zeros(len(self.lin_x_extended))

        # assign the observation counts to the pmf by bin index
        pmf[unique_bin_idxs] = bin_counts.astype(int)

        # assert the counts match
        assert pmf.sum() == len(self.hyd_df), f"PMF counts {pmf.sum()} do not match number of observations {len(self.hyd_df)}"

        # normalize to PMF
        pmf /= pmf.sum()
        
        # KDE on the same grid given the observed uar values
        # with the smallest value adjusted to the minimum measurable threshold
        positive_values = self.hyd_df[self.hyd_df['uar'] >= self.minimum_uar_threshold]['uar'].values
        N_p = len(positive_values)
        N_n = len(self.hyd_df) - N_p
        pmf_kde_raw, _ = self.kde_estimator.compute(
            positive_values,
            self.drainage_area_km2
            )  # returns pmf over pos_edges intervals
        
        
        kde_counts = (pmf_kde_raw * N_p)

        assert len(kde_counts) == len(self.lin_x_extended), f"KDE counts length {len(kde_counts)} does not match PMF length {len(self.lin_x_extended)}"
        
        if N_n > 0:
            pmf_kde = np.zeros_like(kde_counts)
            if self.zero_bin_index == 0: 
                # the minimum measurable threshold is below the support, N_n are zero flows.
                # all zero flows go to the first bin and there is no lower bin mass to consider
                low_probability_mass = N_n 
            else:
                # compute the low probability mass from the KDE below the zero bin index
                low_probability_mass = N_n + kde_counts[:self.zero_bin_index].sum()
            
            # if zero_bin_index == 0, the zero index bin will be reassigned in the second step
            pmf_kde[self.zero_bin_index:] = kde_counts[self.zero_bin_index:]
            pmf_kde[0] = low_probability_mass
            # print(N_n, low_probability_mass, pmf_kde[0], self.zero_bin_index)
            
        else:
            assert N_p == len(self.hyd_df)
            pmf_kde = (pmf_kde_raw * len(self.hyd_df))

        count_match = int(pmf_kde.sum() - len(self.hyd_df))
            
        assert np.isclose(count_match, 0), f"PMF counts {pmf_kde.sum()} do not match number of observations {len(self.hyd_df)} after zero bin adjustment"
        pmf_kde /= pmf_kde.sum()  # renormalize to PMF
        assert np.isclose(pmf_kde.sum(), 1.0), f"KDE PMF does not sum to 1: {pmf_kde.sum()}"
        assert np.isclose(pmf.sum(), 1.0), f"Discrete PMF does not sum to 1: {pmf.sum()}"
        return pmf, pmf_kde


In [28]:

bitrates = [5, 6, 8, 10]

# the threshold below which flows are considered indistinguishable from zero
zero_equivalent_flow_threshold = 1e-4
distribution_dict = {}
all_dkls = []

for bitrate in bitrates:
    # compute log edges, leave one bin for zero-equivalent flows (i.e. not 2**bitrate+1)
    log_edges_uar = np.linspace(np.log(global_min_uar), np.log(global_max_uar), 2**bitrate)
    log_x_uar = 0.5 * (log_edges_uar[1:] + log_edges_uar[:-1])
    log_w = np.diff(log_x_uar)
    log_edges_extended = np.concatenate(([log_edges_uar[0] - log_w[0]], log_edges_uar))
    log_w_extended = np.diff(log_edges_extended)
    log_x_extended = 0.5 * (log_edges_extended[1:] + log_edges_extended[:-1])
    pct_diff = 100 * (np.exp((log_edges_uar[1:] - log_edges_uar[:-1]) / 2) - 1)

    print(f'{bitrate}-bit bin edges are +/- {pct_diff.max():.0f}% from the bin midpoints.')
    print(f'Processing baseline distributions for bitrate: {bitrate} bits')
    output_folder = Path(os.getcwd()) / 'data' / 'baseline_distributions' / f'{bitrate:02d}_bits'

    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
        print(f'Created output folder: {output_folder}')

    process_distributions = True
    if os.path.exists(output_folder / 'pmf_obs.csv'):
        print(f'Baseline distribution files already exist in {output_folder}. Skipping processing.')
        process_distributions = False

    if process_distributions == False:
        distribution_dict[bitrate] = {
            # 'pdf_kde': pd.read_csv(output_folder / 'pdf_kde.csv'),
            'pmf_kde': pd.read_csv(output_folder / 'pmf_kde.csv'),
            # 'pdf_obs': pd.read_csv(output_folder / 'pdf_obs.csv'),
            'pmf_obs': pd.read_csv(output_folder / 'pmf_obs.csv'),
        }
    else:
        shared_config = {
            'da_dict': da_dict,
            'complete_year_dict': complete_year_stats,
            }
        
        # start_no = 210

        stations  = bcub_df['official_id'].values#[start_no:]
        nstations = len(stations)
        nbins     = int(2**bitrate) # add one bin for zero-equivalent flows

        # pre-allocate arrays
        obs_pmf_arr = np.zeros((nbins, nstations))
        kde_pmf_arr = np.zeros((nbins, nstations))
        kde_estimator = KDEEstimator(log_edges_extended)
        for j, stn in enumerate(stations):
            # initialize the station (use the left edge from binning)
            baseline  = ReferenceDistribution(
                stn=stn,
                kde_estimator=kde_estimator,
                zero_equiv_flow_threshold=zero_equivalent_flow_threshold,
                drainage_area_km2=da_dict[stn],
                log_edges_extended=log_edges_extended,
                **shared_config
                )

            obs_pmf_stn, kde_pmf_stn = baseline.build_station_pmf()

            obs_pmf_arr[:, j] = obs_pmf_stn
            kde_pmf_arr[:, j] = kde_pmf_stn

            # compute the kl divergence between the KDE and obeserved PMFs
            kl_div = baseline._compute_kl_divergence(obs_pmf_stn, kde_pmf_stn)
            assert kl_div >= 0, f"KL divergence must be non-negative: {kl_div}"
            # print(f'    Station {stn}: D_KL(KDE||Discrete): {kl_div:.6f}')
            all_dkls.append({'station': stn, 'bitrate': bitrate, 'D_KL': kl_div})

            if (j + 1) % 50 == 0:
                print(f"Processed {j+1}/{nstations} stations")

        # Build DataFrames to output results
        meta_cols = {
            'log_x_uar': log_x_extended.astype(np.float32), # log bin midpoints (flow units)
        }

        def to_df(vals):
            df_out = pd.DataFrame(vals, columns=stations)
            for k, v in meta_cols.items():
                df_out[k] = v
            return df_out
        
        pmf_kde_df = to_df(kde_pmf_arr)
        pmf_obs_df = to_df(obs_pmf_arr)

        # Optional inexpensive sanity: exact pmf sums
        sums = pmf_kde_df[stations].sum(axis=0).values
        if not np.allclose(sums, 1.0, atol=1e-4):
            raise AssertionError(f'KDE PMFs not normalized: min={sums.min():.6f}, max={sums.max():.6f}')

        # Save to file
        pmf_kde_df.to_csv(output_folder / 'pmf_kde.csv', index=False)
        pmf_obs_df.to_csv(output_folder / 'pmf_obs.csv', index=False)
        # pdf_kde_df.to_csv(output_folder / 'pdf_kde.csv', index=False)
        # pdf_obs_df.to_csv(output_folder / 'pdf_obs.csv', index=False)
        kld_df = pd.DataFrame(all_dkls)
        kld_df.to_csv(output_folder / f'discrete_vs_kde_KLDs_{bitrate}.csv', index=False)


5-bit bin edges are +/- 36% from the bin midpoints.
Processing baseline distributions for bitrate: 5 bits
Baseline distribution files already exist in /home/danbot/code/distribution_estimation/docs/notebooks/data/baseline_distributions/05_bits. Skipping processing.
6-bit bin edges are +/- 16% from the bin midpoints.
Processing baseline distributions for bitrate: 6 bits
Baseline distribution files already exist in /home/danbot/code/distribution_estimation/docs/notebooks/data/baseline_distributions/06_bits. Skipping processing.
8-bit bin edges are +/- 4% from the bin midpoints.
Processing baseline distributions for bitrate: 8 bits
Baseline distribution files already exist in /home/danbot/code/distribution_estimation/docs/notebooks/data/baseline_distributions/08_bits. Skipping processing.
10-bit bin edges are +/- 1% from the bin midpoints.
Processing baseline distributions for bitrate: 10 bits
Baseline distribution files already exist in /home/danbot/code/distribution_estimation/docs/note

In [29]:
def downsample_distribution(dist_df, stns, start_bits, end_bits=10):
    """Downsample the distribution to the given bitrate end_bits."""
    quant_cols = ['log_x_uar']
    if start_bits == end_bits:
        for s in stns:
            assert s in dist_df.columns, f'Station {s} not in distribution dataframe'
        df = dist_df[stns+quant_cols].copy()
        return df[stns]

    assert start_bits > end_bits, f'Cannot downsample from {start_bits} bits to {end_bits} bits'
    assert end_bits >= 1, f'End bits must be at least 1'    
    new_log_edges = np.linspace(np.log(global_min_uar), np.log(global_max_uar), 2**end_bits)

    # get indices for grouping based on new edges
    bin_indices = np.digitize(dist_df['log_x_uar'].values[1:], new_log_edges) 
    
    dist_df.loc[1:, 'bin_index'] = bin_indices
    dist_df.loc[0, 'bin_index'] = 0  # zero-flow bin remains the first bin

    new_dists = dist_df.groupby('bin_index').sum()[stns]
    
    # re-normalize the distributions
    new_dists /= new_dists.sum(axis=0)
    sums = new_dists.sum(axis=0)
    assert np.all(np.isclose(sums, 1.0)), f'Downsampled PMFs do not sum to 1, min={sums.min():.4f}, max={sums.max():.4f}'

    # add back in the log_x_uar bin midpoints
    new_left_edge = new_log_edges[0] - (new_log_edges[1] - new_log_edges[0])
    new_dists['log_x_uar'] = [new_left_edge] + list(0.5 * (new_log_edges[:-1] + new_log_edges[1:]))
    # get the new left log edges from the FIRST VALUE in each group
    return new_dists


In [30]:
# compute the entropy of the prior adjusted distribution for each station

bits = list(range(2, 11)) # set a range that is both too low and too high for the data
entropy_output_folder = Path(os.getcwd()) / 'data' / 'results' / 'entropy_results'
if not entropy_output_folder.exists():
    entropy_output_folder.mkdir(parents=True, exist_ok=True)

# eps = 1e-22 # set a small epsilon to avoid numerical issues
start_bits = 10
pmf_df = pd.DataFrame(distribution_dict[start_bits]['pmf_obs'])
for b in bits:
    # resample the PMF by q_values to the number of states
    resampled_df = downsample_distribution(
        pmf_df, validated_stations, start_bits=start_bits, end_bits=b
    )
    # save the resampled PMF
    resampled_df.to_csv(entropy_output_folder / f'kde_pmf_resampled_{b}bits.csv', index=False)


## Baseline Reliability (Flow Duration) Curve 

As a baseline to compare against the FDC prediction models tested in this study, we could use the maximum uncertainty or uniform distribution, but such a low bar may not provide much insight.  Instead, we compute the mean PDF across all stations in the study region, and use this as a baseline FDC to compare against the curves yielded by the different models.  As an upper bound, we compute the Bayes posterior log-normal fit for each station.

### Compute the mean PDF across all stations

Given a state space $\Omega$ with $M$ discrete states, and $N$ spatially distributed sensors (streamflow monitoring stations) with PMFs $P=\{p_j\}_{j=1}^N$, we can define the mean PDF across all sensors in terms of the observed states $\omega$ as follows:

$$\Omega=\{\omega_i\}_{i=1}^M,\quad P\in\mathbb{R}^{M\times N},\quad P_{i j}=p_j(\omega_i),\ \sum_{i=1}^M P_{i j}=1.$$

$$P_{i j}=\Pr_j(B_i),\qquad \bar p_i =\frac{1}{N}\sum_{j=1}^N P_{i j}\quad(\text{mean PMF over all sensors}).$$

where $B_i$ denotes the $i$-th quantization bin (interval) in log-unit-area-runoff (L-UAR), i.e., $B_i = [y_i, y_{i+1})$ with $y_i = \log(x_i)$. Each $P_{ij}$ is the probability assigned by sensor $j$ to bin $B_i$.

$$\text{Density per log-}x\ (\text{piecewise constant on }B_i):\quad h_i=\frac{\bar p_i}{\Delta y_i},\qquad \sum_i h_i\,\Delta y_i=1$$


### Entropy of the mean PDF and fraction of $b$-bit quantization capacity

$$M = 2^{b},\qquad \bar{\mathbf p} = (\bar p_1,\ldots,\bar p_M)^T,\quad \bar p_i \ge 0,\ \sum_{i=1}^M \bar p_i = 1$$

$$\text{Shannon entropy (bits):}\quad H_2(\bar{\mathbf p}) \;=\; -\sum_{i:\,\bar p_i>0} \bar p_i \log_2 \bar p_i \;\le\; b.
$$

$$\text{Normalized entropy (fraction of capacity):}\quad \rho_b \;=\; \frac{H_2(\bar{\mathbf p})}{b} \in [0,1].$$

$$\text{Perplexity (effective bins):}\quad \mathcal P(\bar{\mathbf p}) \;=\; 2^{H_2(\bar{\mathbf p})},\qquad
\text{occupancy fraction } \phi_b \;=\; \frac{\mathcal P}{2^b} \;=\; 2^{H_2(\bar{\mathbf p})-b}.
$$


## Compute $D_\text{KL}(P_\text{ref} \| P_\text{mean})$ for each model and each station

Above we computed the mean PDF across all stations in the study region.  Now as a baseline we will compute the KL divergence of each station's reference PMF from the mean PMF, i.e. $D_\text{KL}(P_\text{ref} \| P_\text{mean})$.  The interpretation is the additional information (in bits) required to encode each observed (reference) PMF if we use the mean global PMF as the codebook instead of the posterior PMF for that station.

In [31]:
from bokeh.palettes import RdYlGn, Bokeh5

pdf_fig = figure(title=f"Mean PDF across Station Sample", width=750, height=400, x_axis_type='log')
entropy_distributions = {}
# quant_cols = ['log_x', 'lin_x', 'left_log_edges', 'right_log_edges']
# states = [2**b for b in bits]
non_stn_cols = ['log_x_uar']
# obs_pmf_df = distribution_dict[bitrate]['pmf_obs']
# stations = [c for c in obs_pmf_df.columns if c not in non_stn_cols]
for i, b in enumerate([5, 6, 8, 10]):

    # read the pre-processed empirical PMFs for all stations
    pmf_folder = Path(os.getcwd()) / 'data' / 'baseline_distributions' / f'{b:02d}_bits'
    pmf_path = pmf_folder / f'pmf_obs.csv'
    assert os.path.exists(pmf_path), f'PMF file not found at {pmf_path}'
    pmf_resampled = pd.read_csv(pmf_path)
    stn_cols = [c for c in pmf_resampled.columns if c != 'log_x_uar']
    pmf = pmf_resampled[stn_cols].mean(axis=1)  # median PMF across stations
    assert np.isclose(pmf.sum(), 1.0, atol=1e-6), f'Mean PMF not normalized for {b} bits.'

    mean_dist_dict = {'pmf': pmf, 'log_x_uar': pmf_resampled['log_x_uar'].values}
    mean_df = pd.DataFrame(mean_dist_dict)
    mean_df.to_csv(BASE_DIR / 'data' / 'results' / 'sample_distribution_mixture' / f'mean_distribution_{b}bits.csv')

    # Entropy of PMF (still valid for info content calc)
    mask = pmf > 0
    entropy = -np.sum(pmf[mask] * np.log2(pmf[mask]))
    ratio = entropy / b
    perplexity = 2 ** entropy

    log_edges_uar = np.linspace(np.log(global_min_uar), np.log(global_max_uar), 2**b)
    log_x_uar = 0.5 * (log_edges_uar[1:] + log_edges_uar[:-1])

    log_edges_extended = np.concatenate(([log_edges_uar[0] - (log_edges_uar[1] - log_edges_uar[0])], log_edges_uar))
    log_x_extended = 0.5 * (log_edges_extended[1:] + log_edges_extended[:-1])
    log_w = np.diff(log_edges_extended)
    x_vals = list(np.exp(log_x_extended))

    # Plot PDF
    # if b == 10: i 
    pdf_fig.quad(
        top=pmf.values / log_w, bottom=0,
        left=x_vals[:-1], right=x_vals[1:],
        fill_color=Bokeh5[i],
        line_color=None, fill_alpha=0.8,
        legend_label=f'{b:.0f}b (H={entropy:.1f}, ρ={100*ratio:.0f}%, ϕ={perplexity:.1f})'
    )
    # count the number of zero bins
    n_zeros = np.sum(pmf == 0)
    print(f'{b} bits: Entropy={entropy:.2f}, Ratio={ratio:.2f}, Perplexity={perplexity:.2f}, Zero bins={n_zeros}/{len(pmf)}')

# Final formatting
pdf_fig.legend.click_policy = 'hide'
pdf_fig.legend.location = 'top_left'
pdf_fig.xaxis.axis_label = "UAR [L/s/km²]"
pdf_fig.yaxis.axis_label = "Probability Mass"
pdf_fig.legend.background_fill_alpha = 0.3
pdf_fig = dpf.format_fig_fonts(pdf_fig, font_size=14)

show(pdf_fig)

# export the figure to an html file
from bokeh.resources import CDN
from bokeh.embed import file_html
html = file_html(pdf_fig, CDN, f'Mean PDF across station sample')
with open('data/results/mean_pdf_across_stations.html', 'w') as f:
    f.write(html)

5 bits: Entropy=3.51, Ratio=0.70, Perplexity=11.42, Zero bins=1/32
6 bits: Entropy=4.52, Ratio=0.75, Perplexity=22.96, Zero bins=3/64
8 bits: Entropy=6.52, Ratio=0.82, Perplexity=91.80, Zero bins=25/256
10 bits: Entropy=8.51, Ratio=0.85, Perplexity=363.79, Zero bins=176/1024




In [32]:
# for b in [5, 10]:#[5, 6, 8, 10]:
#     mean_pmf_fpath = BASE_DIR / 'data' / 'results' / 'sample_distribution_mixture' / f'mean_distribution_{b}bits.csv'
#     mean_pmf_df = pd.read_csv(mean_pmf_fpath)

#     assert np.isclose(mean_pmf_df['pmf'].sum(), 1.0), f'Mean PMF does not sum to 1, sum={mean_pmf_df["pmf"].sum():.4f}'
#     log_x = mean_pmf_df['log_x_uar'].values
#     log_w = np.diff(np.concatenate(([log_x[0] - (log_x[1]-log_x[0])], log_x)))
#     eval = EvaluationMetrics(log_x=log_x, bitrate=b)
#     mean_baseline_dict = {}
#     mean_pmf = mean_pmf_df['pmf'].values
#     for station in stations:
#         kde_pmf = distribution_dict[b]['pmf_kde'][station].values
#         obs_pmf = distribution_dict[b]['pmf_obs'][station].values
#         assert np.isclose(kde_pmf.sum(), 1.0), f'KDE PMF not normalized for {station} at {b} bits.'
#         assert np.isclose(obs_pmf.sum(), 1.0), f'Obs PMF not normalized for {station} at {b} bits.'
#         assert np.isclose(mean_pmf.sum(), 1.0), f'Mean PMF not normalized at {b} bits.'
#         obs_eval_result = eval._evaluate_fdc_metrics_from_pmf(mean_pmf, obs_pmf)
#         kde_eval_result = eval._evaluate_fdc_metrics_from_pmf(mean_pmf, kde_pmf)
#         mean_baseline_dict[station] = {f'obs_{k}': v for k, v in obs_eval_result.items()}
#         mean_baseline_dict[station].update({f'kde_{k}': v for k, v in kde_eval_result.items()})

In [33]:
# mean_obs_klds = [v['obs_kld'] for v in mean_baseline_dict.values()]
# mean_kde_klds = [v['kde_kld'] for v in mean_baseline_dict.values()]

# p = figure(width=550, height=400, title='KLD to Mean PMF across Stations', x_axis_type='log')
# # compute the cdf from mean_obs_klds
# sorted_obs_klds = np.sort(mean_obs_klds)
# cdf = np.arange(1, len(sorted_obs_klds) + 1) / len(sorted_obs_klds)
# p.line(sorted_obs_klds, cdf, line_width=2, color='navy', legend_label='Empirical CDF of Observed KLD')
# # add a 45-degree line
# sorted_kde_klds = np.sort(mean_kde_klds)
# cdf_kde = np.arange(1, len(sorted_kde_klds) + 1) / len(sorted_kde_klds)
# p.line(sorted_kde_klds, cdf_kde, line_width=2, color='orange', legend_label='Empirical CDF of KDE KLD')
# # add scatter points of kde vs observed klds
# p.xaxis.axis_label = 'KLD to Mean PMF (bits)'
# p.yaxis.axis_label = 'CDF'
# p.legend.location = 'top_left'
# p = dpf.format_fig_fonts(p, font_size=14)
# show(p)

In [34]:
# mean_dkl_df = pd.DataFrame(mean_baseline_dict).T.reset_index().rename(columns={'index': 'station_id'})
# mean_dkl_df.to_csv('data/results/sample_distribution_mixture/kld_to_mean_pmf_across_stations.csv', index=False)

### Ephemeral streams and precipitation

Here we compare the time series precipitation records to streamflow records to analyze the probability that the river is dry for each catchment.



## Catchment Attributes 


View the catchment attributes as a distribution across the sample.  Catchment attributes are the primary information source for the first experiment (prediction of log-normal distribution parameters), and they are used as conditioning variables for the second and third experiments (k-nearest neighbours and LSTM daily unit area runoff estimation).  The attributes are derived from four geospatial data sources:

In [35]:
rev_date = '20250227'
attribute_file = f'BCUB_watershed_attributes_updated_{rev_date}.csv'
attribute_fpath = os.path.join('data', attribute_file)
df = pd.read_csv(attribute_fpath, dtype={'official_id': str})
df = df[[c for c in df.columns if 'unnamed:' not in c.lower()]]
df.columns = [c.lower() for c in df.columns]
df.sort_values('official_id', inplace=True)
df.reset_index(drop=True, inplace=True)

df['n_complete_years'] = df['official_id'].apply(lambda x: len(complete_year_stats.get(x, {}).get('hyd_years', [])))
df.head()

Unnamed: 0,region,official_id,drainage_area_km2,centroid_lon_deg_e,centroid_lat_deg_n,logk_ice_x100,porosity_x100,land_use_forest_frac_2010,land_use_shrubs_frac_2010,land_use_grass_frac_2010,...,high_prcp_duration,high_prcp_freq,q_mean_mm,q_sd_mm,record_len_years,camels_q_mean_mm,n_years,log_drainage_area_km2,n_complete_years,tmean
0,ERK,05AA008,402.9,-114.555134,49.634643,-1507.61,15.45272,0.64,0.1,0.21,...,1.0,0.1,1.09905,1.388195,69,,68,6.001167,58,1.75
1,ERK,05AA022,820.6,-114.367911,49.374374,-1397.57,8.62704,0.55,0.15,0.25,...,1.0,0.1,1.625479,2.671128,68,,68,6.711254,72,2.1
2,ERK,05AA023,1447.8,-114.471124,49.959981,-1545.38,17.50991,0.71,0.1,0.18,...,1.0,0.1,0.765007,1.286813,59,,59,7.278491,58,0.95
3,ERK,05AA035,1837.9,-114.392622,49.938909,-1486.82,15.64992,0.62,0.14,0.21,...,1.0,0.1,0.773177,1.449857,7,,7,7.516923,13,1.45
4,ERK,05AB022,1.4,-114.068624,50.101887,-1520.0,19.0,0.87,0.05,0.06,...,1.0,0.1,0.054656,0.243414,7,,7,0.875469,5,2.9


In [36]:
climate_attributes = ['tmean', 'prcp', 'vp', 'swe', 'srad', 'low_prcp_duration', 'low_prcp_freq', 'high_prcp_duration', 'high_prcp_freq']
terrain_attributes = ['slope_deg', 'aspect_deg', 'elevation_m', 'log_drainage_area_km2']
soil_attributes = ['porosity_x100', 'logk_ice_x100']
land_cover_attributes = ['land_use_forest', 'land_use_shrubs', 'land_use_grass', 'land_use_wetland', 'land_use_crops', 
                       'land_use_urban', 'land_use_water', 'land_use_snow_ice']

if not 'tmean' in df.columns:
    # compute the mean temperature for each catchment
    df['tmean'] = (df['tmax'] + df['tmin']) / 2
if not 'log_drainage_area_km2' in df.columns:
    df['log_drainage_area_km2'] = np.log(df['drainage_area_km2'] + 1)

# save the dataframe with attributes
df.to_csv(attribute_fpath, index=False)

In [37]:
from pathlib import Path
from bokeh.plotting import figure, show, gridplot
from bokeh.io import output_notebook
import numpy as np
output_notebook()


def compute_empirical_cdf(values):
    """Compute the empirical cumulative distribution function (CDF) of the given values."""
    sorted_values = np.sort(values)
    cdf = np.arange(1, len(sorted_values) + 1) / len(sorted_values)
    return sorted_values, cdf


def plot_cdf(values, label=None):
    fig = figure(width=700, height=400)
    x, y = compute_empirical_cdf(values)
    fig.line(x, y, legend_label=label, line_width=2)
    fig.legend.location = "top_left"
    if label.startswith('land_use'):
        fig.legend.location = 'bottom_right'
    
    fig.xaxis.axis_label = label
    fig.yaxis.axis_label = 'Cumulative Probability'
    fig.legend.background_fill_alpha = 0.6
    fig = dpf.format_fig_fonts(fig, font_size=14)
    return fig


### Terrain attributes

```{figure} images/terrain_attributes.png
:alt: Terrain attributes are shown for an example catchment.
:name: terrain-attributes
:width: 700px
:align: center
---
Terrain attributes are shown for an example catchment.  Images are from a video presentation on Streamflow Monitoring Network Optimization prepared for the 2024 Canadian Water Resources Association (CWRA) Annual Conference.  The video is available at [https://vimeo.com/1094107902](https://vimeo.com/1094107902).
```

In [38]:
figs = []
for c in terrain_attributes:
    values = df[c].values
    print(f'{c} - {np.mean(values):.2f} [{np.min(values):.2f}, {np.max(values):.2f}]')
    cdf_fig = plot_cdf(values, label=c)
    cdf_fig = dpf.format_fig_fonts(cdf_fig, font_size=14)
    figs.append(cdf_fig)

lt = gridplot(figs, ncols=4, width=300, height=300)
show(lt)

slope_deg - 16.86 [0.38, 35.12]
aspect_deg - 183.37 [0.22, 359.81]
elevation_m - 1081.06 [23.42, 2438.27]
log_drainage_area_km2 - 5.32 [0.53, 12.52]


### Climate attributes

```{figure} images/climate_attributes.png
---
alt: Climate attributes are shown for an example catchment.
name: climate-attributes
width: 700px
align: center
---
Climate attributes are shown for an example catchment.
```


In [39]:
figs = []
for c in climate_attributes:
    values = df[c].values
    cdf_fig = plot_cdf(values, label=c)
    figs.append(cdf_fig)

lt = gridplot(figs, ncols=4, width=300, height=300)
show(lt)

Above, there is a problem with the precision of the derived precipitation frequency attributes.  In particular the high precipitation frequency and duration, which have very few unique values.  As a result, these are not expected to contain much information or be useful for predictive modelling.

### Soil attributes

```{figure} images/soil_attributes.png
---
alt: Soil attributes are shown for an example catchment.
name: soil-attributes
width: 600px
align: center
---
Soil attributes are shown for an example catchment.
```

In [40]:
figs = []
for c in soil_attributes:
    values = df[c].values
    cdf_fig = plot_cdf(values, label=c)
    figs.append(cdf_fig)

lt = gridplot(figs, ncols=4, width=300, height=300)
show(lt)


```{figure} images/land_cover_attributes.png
---
alt: Land cover classifications are shown for an example catchment.
name: land-cover-classifications
width: 700px
align: center
---

Land cover classifications are shown for an example catchment.
```

In [41]:
figs = []
for c in land_cover_attributes:
    values = df[f'{c}_frac_2010'].values
    cdf_fig = plot_cdf(values, label=c)
    figs.append(cdf_fig)

lt = gridplot(figs, ncols=4, width=300, height=300)
show(lt)

## Citations

```{bibliography}
:filter: docname in docnames
```