# Methods

In this section:

1. We define the Flow Duration Curve (FDC) in terms of the discrete probability mass function of the streamflow distribution.
2. We define the three models used to estimate FDCs in ungauged locations, and the data requirements for each.
3. We define the spatial network segmentation for ensuring cross validation training and test sets are independent across folds, and we check the distribution of target variable (streamflow statistics) across the folds.
4. We introduce the performance metrics used to evaluate the models, and discuss their interpretation in the context of FDC estimation.

## Flow Duration Curve Estimation: widely varying input data requirements and model complexity

The set of experiments presented in this notebook are intentionally varied in their data requirements and model complexity.  The purpose is to highlight some of the nuance in model evaluation and the interpretation of metrics.  The FDC does not contain information about timing of flow, but it is widely used as a model diagnostic tool in hydrological model evalutation {cite}`gupta2008reconciling` to assess the ability of a model to reproduce the distribution of flow.

The FDC can be estimated from a variety of data sources, from empirical functional mapping of physical descriptors of catchments to runoff percentiles, from direct mapping of streamflow observations at other locations, from physical conceptual models describing the processes governing the rainfall-runoff response of a basin.  How the quality of the estimate is evaluated ultimately depends on the question being asked.  For example, if you are a run-of-river hydropower developer interested in the potential of a river for energy generation, the range of quantiles above or below what can be put through the plant does not affect your revenue model (it still affects design as far as PH location, turbine centreline elevation, spillway elevation and design, etc.).  However a planner may be interested in the low flow characteristics to a) assess the availability of water for environmental flow release and b) to meet specific energy generation incentives and/or avoid penalties.

Following a brief introduction to motivate the epistemic nature of the FDC, an overview of the three experiments is provided.  The methods will focus on additional supporting information not elaborated on in the code, in particular to address assumptions related to data quality, data leakage between training and validation (testing), and methodological assumptions that may have material affect on the results.  The code is provided in the respective notebooks, but the focus is on the methods and interpretation of results.


## Introduction & Motivation: discrete representation of continuous streamflow

The Water Survey of Canada (WSC) publishes the [HYDAT](https://www.canada.ca/en/environment-climate-change/services/water-overview/quantity/monitoring/survey/data-products-services/national-archive-hydat.html) database of estimated daily (and in some cases hourly) streamflow at over 1000 stations in Canada.  Mean daily streamflow series from the HYDAT dataset uses 3 decimal precision  Given an example range of 0.1 to 100 $m^3/s$, three decimal precision suggests $(100-0.1) / 0.001 = 99900$ unique states, or roughly 17 bits ($2^{17} = 131,072$) to represent the data.  

The flow of water in a stream is a continuous quantity.  Water level in a stream rises and falls tracing a smooth line in time as opposed to stepping up and down abruptly at fixed time intervals.  When recording streamflow observations quantitatively, the continuous values representing streamflow at a moment in time are converted to a discrete form and stored on a computer in 32 or 64-bit floating point format.  These two formats can represent approximately $4.3\times 10^9$ and $1.8 \times 10^{19}$ distinct states, or 7 and 16 decimal precision respectively.  This level of precision is much more than can be justified by actual streamflow observation because of measurement uncertainty.  The uncertainty in streamflow measurement is multiplicative, meaning the uncertainty varies *in proportion to the magnitude*.  

We cannot escape this discrete representation of natural phenomena, and we run against its implications no matter which way we approach the analysis.  The "computational boundedness" {cite}`wolfram2023second` of our observation of the natural world is a fundamental limitation of our ability to make informed decisions from uncertain data.  In the case of streamflow, the challenge is to find a balance between the precision of the representation, the uncertainty in the measurements, and the questions we ask of the data.  

In [None]:
import os
import numpy as np
import pandas as pd
from pathlib import Path
import xarray as xr

from utils.kde_estimator import KDEEstimator
import utils.data_processing_functions as dpf
BASE_DIR = Path(os.getcwd())

# update this to the path where you stored `HYSETS_2023_update_QC_stations.nc`
HYSETS_DIR = Path('/home/danbot/code/common_data/HYSETS')

from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.palettes import Sunset10, Vibrant7
from bokeh.layouts import gridplot
output_notebook()


In [None]:

def discrete_series(wl, bits):
    min_w, max_w = np.min(wl)-1e-9, np.max(wl)+1e-9
    # edges_log = np.linspace(np.log10(min_w), np.log10(max_w), 2**bits)
    edges = np.linspace(min_w, max_w, 2**bits)
    
    # edges = np.array([10**e for e in edges_log])    
    midpoints = (edges[1:] + edges[:-1]) / 2

    # digits = np.digitize(wl, edges)
    digits = np.digitize(wl, edges) - 1
    digits = np.clip(digits, 0, len(midpoints) - 1)
    
    return midpoints[digits]
    
# Generate example data
time = np.linspace(0, 10, 500)
wl_0 = 60 + 40 * np.sin(time) + 23 * np.cos(3 * time) 
p = figure(title="Discrete to continuous streamflow", width=800, height=300)

n = 0
for b in range(2,9):
    wl = discrete_series(wl_0, b)
    p.line(time, wl, color=Vibrant7[n], line_width=3,
           legend_label=str(b)+' bits')
    n += 1
p.line(time, wl_0, legend_label='continuous', line_dash='dashed', color='red',
      line_width=3)    
p.yaxis.axis_label = 'Water level'
p.add_layout(p.legend[0], 'right')
p.xaxis.axis_label = 'Time'
p.legend.click_policy='mute'
show(p)

The figure above illustrates how increasing the number of states representing the observed series converges to the continuous function: $$y(t) = 60 + 40\sin(t) + 25\cos(3t)$$ In the example above, even 5 bits gives a close representation of the continuous function, though this reflects in large part to the range of inputs and the nature of the function.  Click on the legend labels to toggle series and see the effect more clearly.  But "close" depends on what answers we need from the data.  

Duration curves provide a complementary view of the data that better support decisions.  In the example above, 

In [None]:
from bokeh.layouts import column

def plot_fdc(wl_series, bits_range, label_base=''):
    p_fdc = figure(title="Flow Duration Curves", width=800, height=300)
    for n, b in enumerate(bits_range):
        wl_q = discrete_series(wl_series, b)
        sorted_vals = np.sort(wl_q)[::-1]
        exceedance = np.linspace(0, 100, len(sorted_vals))
        p_fdc.line(exceedance, sorted_vals, color=Vibrant7[n],
                   line_width=3, legend_label=f"{label_base}{b} bits")
    sorted_vals = np.sort(wl_series)[::-1]
    exceedance = np.linspace(0, 100, len(sorted_vals))
    p_fdc.line(exceedance, sorted_vals, color='red', line_dash='dashed',
               line_width=3, legend_label=f"{label_base}continuous")
    p_fdc.xaxis.axis_label = 'Exceedance probability (%)'
    p_fdc.yaxis.axis_label = 'Flow'
    p_fdc.add_layout(p_fdc.legend[0], 'right')
    p_fdc.legend.click_policy = 'mute'
    return p_fdc

bits_range = range(2, 9)
fdc_plot = plot_fdc(wl_0, bits_range)
show(column(p, fdc_plot))

But streams don't really look like the above where water is more evenly distributed across the range of flows.  Below is a more representative example of a flow duration curve for a stream in British Columbia, Canada, where the flow is highly skewed towards low flows.  

In [None]:
wl_1 = 200 * np.random.beta(1.0, 40, size=5000)  # More density near 0
mean, median = np.mean(wl_1), np.median(wl_1)
print(f"Mean: {mean:.2f}, Median: {median:.2f}")
p2 = plot_fdc(wl_1, bits_range, )
show(p2)

There are many questions that warrant more precision around where the flow spends most of its time.  Using the logarithmic scale to quantize the series emphasizes flow values where the system spends most of its time.  The lower bitrate series are more striking illustrations of how the two approaches shift how precision is defined.

In [None]:
def plot_fdc_log(wl, bits, eps=1e-3):
    p_fdc = figure(title="Flow Duration Curves", width=800, height=300)#, y_axis_type='log')
    n = 0
    for b in bits:
        min_w, max_w = wl.min() * 0.999, wl.max() * 1.001
        log_edges = np.linspace(np.log(min_w), np.log(max_w), 2**b + 1)
        edges = np.exp(log_edges)
        midpoints = (edges[1:] + edges[:-1]) / 2
        digits = np.digitize(wl, edges) - 1
        digits = np.clip(digits, 0, len(midpoints) - 1)
        wl_q = midpoints[digits]

        sorted_vals = np.sort(wl_q)[::-1]
        exceedance = np.linspace(0, 100, len(sorted_vals))
        p_fdc.line(exceedance, sorted_vals, color=Vibrant7[n],
                   line_width=3, legend_label=f"{b} bits")
        n += 1
    sorted_vals = np.sort(wl)[::-1]
    exceedance = np.linspace(0, 100, len(sorted_vals))
    p_fdc.line(exceedance, sorted_vals, color='red', line_dash='dashed',
               line_width=3, legend_label=f"continuous")
    p_fdc.xaxis.axis_label = 'Exceedance probability (%)'
    p_fdc.yaxis.axis_label = 'Flow'
    p_fdc.add_layout(p_fdc.legend[0], 'right')
    p_fdc.legend.click_policy = 'mute'
    return p_fdc

p_log = plot_fdc_log(wl_1, range(2, 9))
show(p_log)

## Experiment 1: Predicting Hydrological Signatures from Catchment Attributes

{cite}`mcmillan2021review` presents a comprehensive review of approaches to hydrological signature prediction from catchment attributes, covering a large number of signatures and their links to hydrological processes.  Several signatures relate to specific exceedance percentiles, which may be seen as first order characteristics since they represent positions in the FDC, and others may be seen as second order since they describe slopes (for example the slope of the FDC between the log-transformed 33rd and 66th streamflow percentiles).  The mean is a summary statistic in the sense that it encapsulates all observations.  

### Predicting parameters of the log-normal distribution

```{figure} images/param_prediction_test_result.png
---
width: 800px
alt: Example result showing the prediction error after each group of predictors is added, a scatter plot of observed and predicted mean runoff predicted from catchment attributes based on all predictors, the learning curve of training and test sets, and the distribution of target variables.
---
Example result showing the prediction error after each group of predictors is added, a scatter plot of observed and predicted mean runoff predicted from catchment attributes based on all predictors, the learning curve of training and test sets, and the distribution of target variables.
```

The capacity of catchment attributes to predict hydrological signatures has been shown to be linked to spatial smoothness {cite}`addor2018ranking`.  Since summary statistics are hydrological signatures and are also sufficient statistics to fully describe probability distributions, the terms are used somewhat interchangeably here.  The log-normal distribution has long been used to describe streamflow distributions, but it is limited to describing a single mode, and many catchments exhibit more than one mode. The aim is to see how this simple but rigid model compares to more complex approaches with far greater input data requirements.  We test the accuracy of the log-normal using location and scale parameters estimated from the mean and standard deviation of daily runoff by the method of moments, and also by directly predicting the log-mean and log-standard deviation from catchment attributes.  The target variables are summarized in the table below.  Each target is tested for predictability using catchment attributes, and the order that catchment attribute groups are added to the training covariate matrix (feature set) is permuted to test influence of groups of related attributes.  In addition, to show that the gradient boosting model is learning from structure in the data, we repeat the experiment with randomly permuted attributes.


```{list-table} Summary of Model Scenarios
:header-rows: 1
:name: model-summary-table

* - Number
  - Target Variable
  - Description
  - Variations
* - 1
  - Mean daily unit area discharge $L/s/km^2$ (MEAN)
* - 2
  - Median daily unit area discharge $L/s/km^2$ (MEDIAN)
* - 3
  - Standard deviation of daily unit area discharge $L/s/km^2$ (SD)
* - 4
  - Mean absolute deviation of daily unit area discharge $L/s/km^2$ (MAD)
* - 5
  - The log of each of the above variables (LOG_MEAN, LOG_MEDIAN, LOG_SD, LOG_MAD)
  - Predict the log of each of the above variables ($mm/day$) using catchment attributes.  
  - Vary the ordering of catchment attribute groups to test influence of groups of related attributes.
  ```

### Gradient Boosted Decision Trees

Gradient Boosting Decision Tree (GBDT) is a widely used machine learning algorithm that builds an ensemble of (simple) decision trees in a sequential manner. Trees are constructed to gradually improve the overall prediction accuracy by adapting the function (decision tree) in each round to minimize the residuals of the previous model. 

General procedure:

1. **Initialization**: The algorithm starts with an initial prediction, often the mean of the target values.
2. **Iterative Learning**: In each iteration (boosting round), a new decision tree is trained to predict the **residual errors** (differences between the actual values and the current predictions) of the preceding decision tree.
3.  **Functional Space Optimization**: XGBoost uses a second order Taylor Polynomial expansion to approximate the loss function behaviour for each decision tree, and the function (base learner $f_t(x_i)$) which gives the greatest improvement to the loss function $L(-g_i, f_t(x_i))$, where $g_i$ is the gradient of the loss function $L(y_i, \hat y_i)$, i.e. mean square or absolute error between observed and predicted variables $y_i$ and $\hat y_i$.

The GBDT approach is used for its strength in representing nonlinear relationships in high-dimensional input feature sets, for the ability to set up training and testing for robust model training, and for the ability to test relative importance of features.  One advantage of GBDT over the random forest (RF) approach used in {cite}`addor2018ranking` for hydrological signature prediction from attributes is **the training data is not limited by incomplete feature sets** -- rather the ensemble tree construction method, which uses random subsamples of both rows and columns, allows samples with missing attributes to remain in the training data.

### Model Validation / Testing of Gradient Boosting Decision Tree Experiments

To address the problem of overfitting, 5-fold cross validation is used such that each catchment in the sample is tested out of sample once.  The GBDT model procedure is carried out as follows: 

1. Split the catchment sample into five spatially distributed folds (subsets).  Verify the distribution of target variables is similar across each validation fold, this ensures the differences between tests are attributable to the models themselves as opposed to effects of data partitioning.
2. Run repeated experiments with randomly permuted gradient boosting hyperparameters.  The purpose of this step is to a) see the sensitivity of the results to hyperparameter settings, and b) find the hyperparameter set that yields the best average validation score.
3. Retrain a model using the full training set based on the hyperparameters associated with the median result, generate predictions on the hold-out test set to determine the model performance on unseen data.
4. (Possibly) repeat steps 1-3 several times to evaluate selection bias in the hold-out set. 

### Visualize the partitioning of the data on a map

Partitioning of the dataset is done by spatially distributing the catchments into five folds, ensuring that each fold is representative of the overall study region.  Below we visualize the cross validation folds on a map.

In [None]:
import os
import geopandas as gpd
from shapely.geometry import Point

from bokeh.plotting import figure, show
from bokeh.layouts import gridplot, row, column
from bokeh.transform import factor_cmap
from bokeh.models import ColumnDataSource
from bokeh.io import output_notebook
from bokeh.palettes import Sunset10, Vibrant7, Category20, Bokeh6, Bokeh7, Bokeh8

from sklearn.cluster import AgglomerativeClustering
output_notebook()
import xyzservices.providers as xyz
from utils import data_processing_functions as dpf

BASE_DIR = os.getcwd()

tiles = xyz['USGS']['USTopo']

In [None]:
# load the catchment characteristics
fname = 'BCUB_watershed_attributes_updated_20250227.csv'
attr_df = pd.read_csv(os.path.join('data', fname))
attr_df.columns = [c.lower() for c in attr_df.columns]
station_ids = attr_df['official_id'].values
print(f'There are {len(station_ids)} monitored basins in the attribute set.')

### Map Colouring Partition

Instead of spatial cluster-based partitioning, define N classes partitioned by alternating classes spatially.  This will result in classes that cover roughly the same full region, but with $1/N$ of the points /density.  This is shown in the map below.

In [None]:
def alternate_partition_geopandas(gdf, n_classes=2):
    # Extract coordinates from the GeoDataFrame
    gdf['coords'] = gdf.geometry.apply(lambda geom: (geom.x, geom.y))
    
    # Sort by coordinates, for example, first by x and then by y
    sorted_gdf = gdf.sort_values(by=['coords'])
    
    # Alternate assignment of nodes to N classes
    sorted_gdf['class'] = np.arange(len(sorted_gdf)) % n_classes
    
    return sorted_gdf

In [None]:
n_classes = 5
sorted_df = attr_df.sort_values(by=['centroid_lon_deg_e', 'centroid_lat_deg_n'])
# Alternate assignment of nodes to N classes
sorted_df[f'{n_classes}_spatial'] = np.arange(len(sorted_df)) % n_classes
sorted_df['geometry'] = sorted_df.apply(lambda row: Point(row['centroid_lon_deg_e'], row['centroid_lat_deg_n']), axis=1)

sorted_gdf = gpd.GeoDataFrame(sorted_df, geometry='geometry', crs='EPSG:4326')
sorted_gdf.to_crs(epsg=3857, inplace=True)
sorted_gdf['lat'] = sorted_gdf.geometry.y
sorted_gdf['lon'] = sorted_gdf.geometry.x

In [None]:
glyphs = ['circle', 'square', 'triangle', 'diamond', 'inverted_triangle']  # glyphs
colors = Category20[10]  # 2 colors * 5 glyphs = 10 unique symbols

def map_cluster_to_glyph_color(cluster_id):
    cluster_id = int(cluster_id)  # Ensure cluster_id is an integer
    return glyphs[cluster_id % len(glyphs)], colors[cluster_id % len(colors)]

In [None]:
# Create a Bokeh plot
p = figure(title="Spatial Partitioning II", 
           tools="pan,wheel_zoom,reset", match_aspect=True, 
           width=900, height=650)

# Create a ColumnDataSource for plotting
for cluster_id in sorted_gdf[f'{n_classes}_spatial'].unique():
    # Get marker and color for each cluster
    marker, color = map_cluster_to_glyph_color(cluster_id)
    
    # Filter data for the current cluster
    cols = [c for c in sorted_gdf.columns if c != 'geometry']
    cluster_data = sorted_gdf[sorted_gdf[f'{n_classes}_spatial'] == cluster_id].copy()[cols]
    
    # Plot using scatter with the marker and color
    p.scatter('lon', 'lat', source=cluster_data, marker=marker, size=5, 
              color=color, legend_label=f'{cluster_id}',
             line_color='black', line_alpha=0.5, line_width=1)

# Customize the plot
p.add_tile(tiles, retina=True)
p.grid.visible = False
# Customize and sort the legend
p.legend.title = "Fold #"
p.legend.ncols = 1
p.legend.label_text_font_size = '8pt'
p.legend[0].items = sorted(p.legend[0].items, key=lambda t: f'{int(t.label.value):02d}')
p.add_layout(p.legend[0], 'right')

# Show the plot
show(p)

From the plots above, the target variable distribution is similar across all validation folds.

In [None]:
# save the partitions for later use
sorted_gdf.to_file(f'data/stn_attributes_with_{n_classes}_spatial_partitions.geojson')

Below we show that there are 5 classes and each has roughly the same number of points (261 or 262).

In [None]:
unique, counts = np.unique(sorted_gdf[f'{n_classes}_spatial'], return_counts=True)
print(unique)
counts

## Experiment 2: k Nearest Neighbours for Daily Runoff Prediction

The second experiment tests the existing (and historical) streamflow monitoring network for its capacity to provide information about ungauged locations.  Each monitored location is in turn used as an "ungauged" location, and the nearest monitored locations are used to derive estimates of daily (unit area) runoff.  The period of record (POR) runoff distribution is computed from these estimates by kernel density estimation (KDE). 

We test ensembles of 1 to 10 nearest neighbours to see the effect of the ensemble size on the accuracy of the FDC estimate.  We test different ensemble member selection criteria (spatial distance and hydrological similarity) and different ensemble weightings (inverse distance and inverse square distance weighting) to control the influence of each member on the final estimate.  The final ensemble estimate is compared ot the reference (period of record) distribution (pre-processed in Notebook 4) by several evaluation metrics, described later in this notebook.

The estimated distribution of runoff is computed from k contributing neighbours in two ways: 1) by temporal ensemble of the daily runoff values and estimating the distribution from the temporally averaged ensemble, and 2) by computing distributions on individual ensemble members and then averaging the resulting probability densities.

### Practical interpretation of k-nearest neighbours

```{figure} images/weekly_data_availability.png
---
alt: A visualization of weekly data availablity for the streamflow monitoring stations in the study region shows many gaps in the records.
name: data-continuity-fig
width: 800px
align: center
---
Discontinuous and non-overlapping records is a problem underlying any hydrological analysis, and the problem is compounded for large sample studies..  
```

The figure above illustrates key problem in k-nearest neighbour streamflow estimation. The discontinuity of streamflow records changes the interpretation of 'k', and studies either infill by similar k-means (the interpretation is then "at least k nearest") or use what is available (with the corresponding interpretation of "at most k-nearest").  The figure above shows the weekly data availability for the streamflow monitoring stations in the study region, and illustrates the problem of non-overlapping records.  The problem is compounded for large sample studies, where the number of stations is large and the probability of overlap is sufficiently low to cause problems in interpreting results.

In this study, we are interested in estimating the POR at an ungauged location because it resembles the question of predicting water availability for a period we do not have data for for simplicity we use the interpretation of "at most k nearest".



## Experiment 3: LSTM Neural Network for Daily Runoff Prediction

For the last FDC estimation approach, daily meteorological inputs are used to predict daily runoff at out-of-sample locations.  Cross validation is similarly used in this experiment, however 12 folds are used to a) ensure a large and diverse sample of catchments for training and b) to follow the methods used in the original experiments {cite}`kratzert2019towards`.  The [NeuralHydrology](https://neuralhydrology.readthedocs.io/en/latest/) {cite}`kratzert2022joss` is used with default settings wherever possible, and otherwise detailed here or in the code in Notebook 5.

There are several components of the model setup that for conciseness were left out of the accompanying paper, but we include them here for completeness.

1. **Adding a new dataset**: we want to use the large sample of monitored catchments covering British Columbia for our experiments, so we followed the [documentation](https://neuralhydrology.readthedocs.io/en/latest/tutorials/add-dataset.html) in adding a new `GenericDataset`. 
2. **Specification of training and testing data**:  The LSTM model requires `.yml` files to specify which station ids are part of the training and testing sets.  An additional set of `.yml` configuration files are created to specify the training and validation periods for all training stations, and all testing stations.  
3. **Model Training, Validation, and Testing**: The training stations must have a minimum 5 years of data.  We specify 60/40 training/validation split for *training stations*.  The LSTM is trained (over 30 epochs) on 11/12 (folds) of the total sample for one experiment.  The remaining fold is held out for testing, and there is no training/validation, or fine-tuning on the test set, the trained model is simply applied to the held-out test fold.  This is repeated ten times using different random seeds to ensure robustness of the results and to better understand the variability in the results.  The entire process is repeated for each of the 12 folds, such that each fold is used as a test set once, and the remaining folds are used for training.  The results are then averaged across all folds to obtain an ensemble prediction, but we keep the individual ensemble predictions to explore the variability among them.

### Assignment of random seeds

For replicability, we specify the seed for the random number generator in a shell script that calls the training script on each ensemble simulation. This way, the same set of random seeds is used for the ensemble of 10 simulations for the 12 folds.  The shell script is provide below:

```bash
#!/bin/bash

# Directory containing your config files
CONFIG_DIR="bcub_test/batch_config_files"
RUNS_DIR="./runs"
OUTPUT_PREFIX="bcub_test_expt"

# set an array of random seeds
random_seeds=(42 123 113 752 101 72 13 1617 188 202 2223 252 627 4 3741 555 96 854 1202 442)

# Loop over 12 folds
for fold in {0..11}; do
    CONFIG_FILE="$CONFIG_DIR/bcub_experiment_${fold}.yml"

    # Count how many runs already exist for this fold
    run_count=$(find "$RUNS_DIR" -maxdepth 1 -type d -name "${OUTPUT_PREFIX}_${fold}_*" | wc -l)

    
    # Repeat 10 times per fold for ensemble
    for ((run=run_count; run<10; run++)); do
        seed=${random_seeds[$((run % ${#random_seeds[@]}))]}
        echo "Running Fold $fold, Ensemble Member $run with seed $seed"
        
        # insert the random seed into the config file where the "seed:" line 
        # is located
        sed -i "s/seed: .*/seed: ${random_seeds[$((run % ${#random_seeds[@]}))]}/" "$CONFIG_FILE"
        nh-run train --config-file "$CONFIG_FILE"
    done
done
```


## Evaluation Metrics

The evaluation metrics used to assess the performance of the FDC estimation methods are as follows:


1. **PB**: Percent Bias, a measure of the average error relative to the observed values, expressed as a percentage. It is calculated as the absolute error divided by the observed value, multiplied by 100.
2. **NAE**: Normalized Absolute Error, a measure of the average absolute errors between predicted and observed values, normalized by the mean of the observed values.
3. **RMSE**: Root Mean Square Error, a measure of the average magnitude of the errors between predicted and observed values.
4. **NSE**: Nash-Sutcliffe Efficiency, a normalized statistic that compares the variance of the residuals to the variance of the observed data. It ranges from -1 to 1, where 1 indicates perfect prediction, 0 indicates that the model is as good as the mean of the observed data, and negative values indicate worse than the mean.
5. **KGE**: *(not presented in the paper)* Kling-Gupta Efficiency, a multi-objective metric that combines correlation, bias, and variability into a single score. It ranges from $-\infty$ to 1, where 1 indicates perfect prediction, 0 indicates no skill, and negative values indicate worse than the mean.
6. **KLD**: Kullback-Leibler Divergence, a measure of how one probability distribution differs from a second, reference probability distribution. It emphasizes differences where probability mass is concentrated, meaning states that are more frequently observed.
7. **EMD**: *(not presented in the paper)* Earth Mover's Distance, a measure of the distance between two probability distributions. It focuses on the "work" needed to move probability mass from one distribution to another, and is particularly useful for comparing distributions with different shapes in the units of the target variable.

The first four evaluation metrics are commonly used in hydrological model evaluation, and we follow convention from the literature in calculating them on the set of runoff values corresponding to the 1st to 99th percentiles of the FDC.  The KLD and EMD metrics are evaluated over a global evaluation grid such that they provide a consistent basis of support, since this is not known *a priori* and adjusting it *post-hoc* represents data leakage. The implementation of these computations is nuanced and care has been taken to be explicit about assumptions and limitations of the methodology.



## Uniform Noise Mixture

The simplest way to smooth an empirical distribution is to mix it with a uniform distribution.  This ensures all states in $\Omega$ have nonzero probability, and the amount of smoothing is controlled by the mixture weight $\alpha$.  The smoothed PMF is given by:
$$p_{smoothed}(x) = (1-\lambda)p_{empirical}(x) + \lambda \mathbb{U}(x)$$

where $p_{uniform}(x) = 1/|\Omega|$ for all $x \in \Omega$.  The uniform distribution is the maximum entropy distribution for a given range, and mixing with it increases the entropy of the empirical distribution.  

## Citations

```{bibliography}
:filter: docname in docnames
```