# Methods

## Introduction: Discrete vs. Continuous

It is natural to think of water flowing in a stream as a continuous quantity.  Water level in a stream rises and falls tracing a smooth line in time as opposed to stepping up and down abruptly at fixed time intervals.  When recording streamflow observations quantitatively, the continuous values representing streamflow at a moment in time are converted to a discrete form and stored on a computer in 32 or 64-bit floating point format.  These two formats can represent approximately $4.3\times 10^9$ (appriximately 7 decimal precision) and $1.8 \times 10^{19}$ distinct states, or 7 and 16 decimal precision respectively.  Streamflow observations don't need that much precision because measurement (rating curve) uncertainty is a limiting factor.

Streamflow measurement uncertainty is multiplicative, meaning the uncertainty varies *in proportion to the magnitude*.  The Water Survey of Canada published the [HYDAT](https://www.canada.ca/en/environment-climate-change/services/water-overview/quantity/monitoring/survey/data-products-services/national-archive-hydat.html) database of estimated daily (and hourly in some cases) streamflow at 1000+ stations in Canada.  Mean daily streamflow series from the HYDAT dataset uses 3 decimal precision.  Given an example range of 0.1 to 100 $m^3/s$, this precision suggests $(100-0.1) / 0.001 = 99900$ unique states, or roughly 17 bits ($2^{17} = 131,072$).

The figure below illustrates how increasing the number of states representing the observed series converges to the continuous function $y(t) = 5 + \sin(t) + 0.5\sin(3t)$. In the example below, even 5 bits gives a close representation of the continuous function, though this reflects in large part to the range of inputs and the nature of the function.  Click on the legend labels to toggle series and see the effect more clearly. 

In [1]:
import numpy as np
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.palettes import Sunset10, Vibrant7
output_notebook()

def discrete_series(wl, bits):
    min_w, max_w = np.min(wl)-1e-9, np.max(wl)+1e-9
    # edges_log = np.linspace(np.log10(min_w), np.log10(max_w), 2**bits)
    edges = np.linspace(min_w, max_w, 2**bits)
    
    # edges = np.array([10**e for e in edges_log])    
    midpoints = (edges[1:] + edges[:-1]) / 2

    # digits = np.digitize(wl, edges)
    digits = np.digitize(wl, edges) - 1
    digits = np.clip(digits, 0, len(midpoints) - 1)
    
    return midpoints[digits]
    
# Generate example data
time = np.linspace(0, 10, 500)
wl_0 = 5 + np.sin(time) + 0.5 * np.sin(3 * time)
p = figure(title="Discrete to continuous streamflow", width=800, height=300)

n = 0
for b in range(2,9):
    wl = discrete_series(wl_0, b)
    p.line(time, wl, color=Vibrant7[n], line_width=3,
           legend_label=str(b)+' bits')
    n += 1
p.line(time, wl_0, legend_label='continuous', line_dash='dashed', color='red',
      line_width=3)    
p.yaxis.axis_label = 'Water level'
p.add_layout(p.legend[0], 'right')
p.xaxis.axis_label = 'Time'
p.legend.click_policy='mute'
show(p)

The introdution above describes two key assumptions in the methods of this study: the uncertainty in the observations themselves, and the detail or resolution in which we represent (uncertain) observations.  

### Streams as Systems with Discrete States

Streamflow can be viewed as a system observed in various states defined by $X$, where states are defined by monotonically increasing intervals over a range of flow.  The number and width of intervals determine how precisely the system states can represent streamflow, and how much information is lost when converting continuous streamflow observations to discrete values.  The effect of quantization is a key assumption tested in this study.

Signal quantization is the process of mapping a continuous range of signal values to a finite set of discrete levels. This is often done in digital signal processing to represent a signal with a limited number of bits. However, quantization introduces an error or "noise" because the original signal's fine details are lost when it is forced into discrete levels represented by a single value for each level. The quantization error represents the difference between the actual signal and the quantized version, introducing uncertainty into the representation of the signal.  


### Streamflow as a Noisy Signal

In hydrology, streamflow is derived from water level (stage) measurements using a stage-discharge relationship, often modeled by a power law or interpolation. This process, like quantization, introduces uncertainty into the estimation of streamflow. Quantization noise is in a limited sense analogous to rating curve uncertainty because in both cases error is introduced when mapping continuous observations to discrete values through an imperfect model.  {cite}`hamilton2012quantifying` describes many other sources of uncertainty in defining the stage-discharge relationship.

In this study we compute a number of information measures based on discrete distributions of streamflow timeseries, and test the effect of quantization on the predictability of each measure from catchment attributes.  If too many quantization levels are used, false precision is introduced and the quantization could be overfitting to noise.  In contrast, too few levels results in a loss of information by oversimplifying the data and underestimating variability in the signal.  By testing a wide range of quantization schemes, a balance is sought in quantization to reflect uncertainty in the noisy streamflow signal.

## Entropy and Distance Measures 

## Predicting Hydrological Signatures from Catchment Attributes

{cite}`mcmillan2021review` provides a comprehensive review of approaches to hydrological signature prediction from catchment attributes, in particular it notes the diversity of signatures and their links to hydrological processes.  Several signatures relate to specific exceedance percentiles, which may be seen as first order characteristics since they represent positions in the FDC, and others may be seen as second order since they describe slopes (for example the slope of the FDC between the log-transformed 33rd and 66th streamflow percentiles).  The mean is a summary statistic in the sense that it encapsulates all observations.  F-divergence measures are more closely aligned with these latter signatures since they characterize the full distribution, however the test setup is different in the case of testing predictability of divergence metrics due to the target variable being a function of two probability distributions representing two locations.  

The capacity of catchment attributes to predict hydrological signatures has been shown to be linked to spatial smoothness {cite}`addor2018ranking`.  Two levels of model complexity are tested in this study.  The first is a binary classification model which tests the model's ability to predict whether a regional donor/proxy catchment provides a more informative model (smaller f-divergence) than the maximum uncertainty prior (uniform model Q).  The second is a nonlinear regression model that outputs predictions of the target variables.  A summary of tests is provided in {numref}`model-summary-table`.


```{list-table} Summary of Model Scenarios
:header-rows: 1
:name: model-summary-table

* - Number
  - Target Variable
  - Description
  - Variations
* - 1
  - Mean Annual Unit Area Runoff (MAUR)
  - Predict the long-term MAUR from observed daily mean streamflow series from catchment attributes.
  - \-
* - 2
  - Shannon Entropy (H)
  - Predict H computed from unit area runoff distributions at single locations from catchment attributes.
  - Vary the encoding dictionary size between 4, 6, and 8 bits.
* - 3
  - KL Divergence
  - Predict the Kullback-Leibler (KL) divergence of pseudo-simulated (Q) from observed (P) unit area runoff distributions between pairs of locations from catchment attributes.
  - Vary the encoding dictionary size between 4, 6, and 8 bits, vary the prior applied to Q, test concurrent and nonconcurrent periods of record, test the effect of probabilistic treatment of observation error (10% uniform).
* - 4
  - Earth Mover's Distance (EMD)
  - Predict the EMD between observed (Q) and pseudo-simulated (P) unit area runoff distributions between pairs of locations from catchment attributes.
  - Vary the encoding dictionary size between 4, 6, and 8 bits, test concurrent and nonconcurrent periods of record, test the effect of probabilistic treatment of observation error (10% uniform).
* - 5
  - Total Variation Distance (TVD)
  - Predict the total variation distance between observed (Q) and pseudo-simulated (P) unit area runoff distributions between pairs of locations from catchment attributes.
  - Vary the encoding dictionary size between 4, 6, and 8 bits, test concurrent and nonconcurrent periods of record, test the effect of probabilistic treatment of observation error (10% uniform).
```





### Mean Annual Unit Area Runoff (MAUR)
{cite}`addor2018ranking` predicts MAUR from catchment attributes for approximately 600 catchments over the continental US from the CAMELS {cite}`arsenault2020comprehensive` dataset.  A similar approach is taken to replicate those results here.  $\text{MAUR(}X)$ is defined as:

$$\text{MAUR}(X) = \frac{1}{n} \sum_{i=1}^n X_i $$


### Shannon Entropy

{cite}`shannon1948mathematical` describes entropy as a measure of the uncertainty or randomness in a probability distribution. It quantifies the amount of information required to describe the state of a system. Mathematically, for a discrete random variable $X$ with possible outcomes (states) ${x_1, x_2, \dots, x_n}$ and corresponding probabilities $P(X=x_i) = p_i$, the Shannon entropy $H(X)$ is defined as:

$$H(X) = -\sum_{i=1}^n P(x_i) \log_2 P(x_i)$$

### Kullback-Leibler Divergence

KL divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. It quantifies the amount of information lost when approximating one distribution with another. Mathematically, for two discrete probability distributions $P(X)$ and $Q(X)$ over the variable $X$, the KL divergence from $Q$ to $P$ is defined as (note here we use the log base 2 to represent bits, where the base e for "nats" is sometimes used):

$$D_{KL}(P(X)||Q(X)) = \sum_{i=1}^n P(x_i) \log_2\frac{P(x_i)}{Q(x_i)}$$

In the context of observed streamflow, KL divergence can be used to compare the true distribution of streamflow values at a given location (derived from measurements) to a modelled or assumed distribution. For example, given a simple model that predicts streamflow at ungauged locations, KL divergence helps us quantify how much information is missing or lost when a model’s predictions are used instead of having actual observed values.  It represents the extra number of yes or no questions you would have to ask on average to get the correct answer above the minimum number of yes or no questions required given perfect frequency estimates.  The optimal sequence of yes or no questions simply divides the overall probability into as close to 50/50 as possible each time until an unambiguous answer is arrived at.  Imperfect frequency estimates result in sub-optimally dividing the space when asking yes/no questions, so a few more questions must be asked in the long run.  The children's board game Guess Who is an excellent example.

* In the information theory literature, the KL divergence is described as the inefficiency in the encoding dictionary due to incorrect estimation of the frequency of symbols or states.
    * The Kolmogorov complexity represents a lower bound on optimal encoding.
    * The optimality of encoding translates to the average length of message transmitted, expressed in bits
    * A better model represents compression of the information in the observed signal
    * There is a fundamental link between compression and prediction.  "A model that compresses well generalizes well" (HUTTER).

### Total Variation Distance

Total Variation Distance (TVD) is a measure of the difference between two probability distributions. It quantifies the maximum difference in probabilities assigned to the same events (states) by two distributions. For two discrete probability distributions $P$ and $Q$ over the variable $X$, the TVD is defined as:

$$\text{TVD}(P,Q) = \frac{1}{2}\sum_i|P(x_i) - Q(x_i)|$$

In the context of observed streamflow, TVD can be used to compare the true distribution of streamflow values at a given location (derived from measurements) to a modelled or assumed distribution. For example, if we have a model that predicts streamflow at various locations, TVD helps us quantify the **maximum discrepancy** between the model’s predictions and the actual observed values.

### Earth Mover's (Wasserstein) Distance

The Earth Mover's Distance (EMD), also known as the Wasserstein Distance, is a measure of the difference between two probability distributions that incorporates how far the "mass" of probability must be moved to transform one distribution to another. For two discrete probability distributions $P$ and $Q$ over the variable $X$, the EMD is defined as:

$$W(P, Q) = \text{inf}_{\gamma \in \Gamma(P, Q)} \mathbb{E}_{(x, y)\sim \gamma}\left[ d(x, y) \right]$$

where $\Gamma (P, Q)$ denotes the set of all possible joint distributions $\gamma(x, y)$ whose marginals are $P$ and $Q$ respectively, and $d(x, y)$ is a distance metric between points $x$ and $y$ which is in the units of the target variable, in this case unit area runoff $\left[ \frac{L}{s \cdot \text{km}^2} \right]$.

In the context of observed streamflow, the EMD can be used to compare the true distribution of unit area runoff at a given location (derived from measurements) to a modelled or assumed distribution. For example, if we have a model that predicts streamflow at various locations, the EMD helps us quantify how different the predicted distribution is from the distribution of observed values in the same units of the distributions, in this case UAR ($L/s/\text{km}^2$).  

## Predictive Model

### Gradient Boosted Decision Trees

Gradient Boosting Decision Tree (GBDT) is a widely used machine learning algorithm that builds an ensemble of decision trees in a sequential manner. Each tree is trained to correct the errors of the previous trees, gradually improving the overall prediction accuracy. The method combines the strengths of decision trees with boosting, a technique that focuses on balancing the model's bias and variance.

General procedure:

1. **Initialization**: The algorithm starts with an initial prediction, often the mean of the target values.
2. **Iterative Learning**: In each iteration, a new decision tree is trained to predict the residual errors (differences between the actual values and the current predictions).
3.  **Gradient Descent**: The algorithm uses gradient descent to minimize the loss function, guiding the training of each new tree to correct the errors of the ensemble.
4. **Combination**: The predictions of all trees are combined to form the final prediction, typically by summing the weighted outputs of the trees.

The GBDT model is selected for their strength in detecting nonlinear relationships from high-dimensional input feature sets, for the ability to set up training and testing for robust model training, and for the ability to test relative importance of features.  One advantage of GBDT over the random forest (RF) approach used in {cite}`addor2018ranking` for hydrological signature prediction from attributes is that the training data is not limited by incomplete feature sets -- rather the ensemble tree construction method, which uses random subsamples of both rows and columns, allows samples with missing attributes to remain in the training data.



### Application of GBDT for Predicting Entropy and Divergence Measures

In the context of predicting various information and divergence measures, GBDT can be used to model complex relationships between features and target variables. We use GBDT models to capture nonlinear patterns and predict KL divergence, Total Variation Distance, and Earth Mover's Distance by learning from a large set of catchment attributes.  

### Model Validation / Testing

To address the problem of overfitting, 5-fold cross validation is done on a training dataset which is separated at the outset from a test set to prevent information leakage from training to validation.  The GBDT model procedure is carried out as follows: 

1. Split the data into 95% training and 5% testing.  Note that the split is done by station id, not by row since rows contain pairwise information.  Since data from two stations is contained in each row, separating by station ID ensures that data from individual stations does not appear in both training and testing.
2. Run a number of cross validation iterations within the (95%) training set and determine which iteration yields the best average validation score, also evaluating the variability of validation performance scores for all folds per iteration.
3. Train the (95%) training set with the approximately optimal hyperparametners, generate predictions on the held-out 5% test set to determine the model performance on truly unseen data.
4. (Possibly) repeat steps 1-3 several times to evaluate selection bias in the held-out set. 

## Model Sensitivity to Key Assumptions

## Prior Distribution

The EMD and TVD are not affected by models that have frequencies of zero, however the KL divergence must address $q_i = 0$ by applying a prior distribution that represents the strength of belief in the model.  Priors are added by incorporating a uniform distribution of pseudo-counts into the simulated distribution, which is actually based on the observed distribution of a proxy or donor catchment. This means that the confidence in the model, represented by the prior, is linked to the length of the observation period. Consequently, models that produce zero probabilities are penalized more as the length of the observation record increases, reflecting greater divergence.  

We test the effect of the prior by running the analysis under a range of prior distributions, from $10^c$ pseudo-counts for $c=-2, -1, 0, 1, 2$.

### Quantization

The dictionary size represents the number of states defined for the system, and it reflects the precision of observation.  Given the streamflow observations are high precision in relation to rating curve uncertainty, the information content of the streamflow time series includes information about the observation setup and the data pre-processing which are difficult to separate The number of quantization levels represents how much information is removed from the original time series data.  Using fewer symbols reflects a diminishing strength of belief in the observed data since a wider range of values are reassigned to a single value correspondign to the center of a bin.  Another interpretation is that it neglects information in the signal that is relevant to the interpretation of comparing long-term distributions -- in other words if the shape of high-resolution runoff distributions truly reflects physical processes, then the smoothing effect of using fewer bins to describe the distribution ignores detail in proportion to the number of bins.  

We test the sensitivity of the predictive models to the dictionary size by running the analysis using quantizations of $2^b$ bits for $b=2, 4, 6$.

### Partial Counts

Similarly to quantization effects, if observations are treated probabilistically by assuming an error distribution, the observed frequency of each state can reflect partial observations based on the proportion of the observation distribution within each bin. This smoothing effect increases with larger dictionary sizes, as observation density is low at both extremes. The impact of assuming a 10% uniform error distribution on streamflow observations is compared to the impact of not assuming such a distribution.

### Temporal Concurrency of Streamflow Observations

Regression and other regional information transfer models normally use concurrent data to control for temporal effects in part, however this criteria limits the amount of observed data used to train models.  The divergence metrics tested in this study neglect the temporal component of prediction and instead focus on similarity and difference in the long-term distributions.  Annual runoff distributions vary from year to year which means non-concurrent years are expected to introduce uncertainty, the distribution is expected to converge as the period of record increases.  The analysis is run twice to correspond with training the model using distributions based on concurrent data only, and then using distributions derived from all observations including non-concurrent periods.

## Notes on f-divergence and consistency

f-divergence is any convex function of the likelihood ratio.

### Types of Expectation of the Loss Function

In general, consistency means as the number of data points gets large, your estimate converges to the right answer.  One way to test consistency is to look at bias and variance separately.  $\mathbb{E}_\theta$ is interested in "other sample points beyond the ones you have", while Bayesian focuses on "fixed X", or the sample points one does have.

1.  The unconditional (frequentist) and conditional (Bayesian) expectation of the loss function:

* **Frequentist**: $\mathbb{E }_{\theta} \mathcal{l} \left(\delta(X), \theta \right)$ "expectation over the sample space", theta held fixed, expectation with respect to the $X$, or probability distribution of X "indexed by" theta.
* **Bayesian**: $\mathbb{E} \left[ \mathcal{l} \left(\delta(X), \theta \right) | X \right]$ theta is the random variable, expectation is conditional on X

Where:  
* $\mathcal{l}$ is some loss function (square, absolute, zero-one, etc.),
* $\delta(X)$ is the statistical procedure which is for simplicity also denoted $\hat \theta$,
* $\theta$ is the parameterization of some distribution, and $X$ is the target variable.

E.g. the frequentist risk of the square loss $\mathcal{l}$:
* square loss: $\mathcal{l}(\delta(X)) = \mathcal{l}(\hat \theta, \theta) = (\hat \theta - \theta)^2$
    * the $\hat \theta$ is a random quantity from frequentist point of view, and
    * $\theta$ is a random quantity from the Bayesian point of view.
* The risk: $R_{\theta} = \mathbb{E}_{\theta} \left( \hat \theta - \theta \right)^2 = \mathbb{E}_{\theta} \left( \hat \theta - \mathbb{E}_{\theta} \hat \theta - (\theta - \mathbb{E}_{\theta} \hat \theta) \right)^2 = \mathbb{E}_{\theta} \left( \hat \theta - \mathbb{E}_{\theta} \hat \theta \right)^2 - \left(\theta - \mathbb{E}_{\theta} \hat \theta \right)^2$
    * the first quantity $\mathbb{E}_{\theta} \left( \hat \theta - \mathbb{E}_{\theta} \hat \theta \right)^2$ is the variance
    * the second quantity $\left( \theta - \mathbb{E}_{\theta} \hat \theta \right)^2$ is the squared bias,
    * this is the frequentist expectation because it is over the sample space



### Surrogate Loss Functions, f-Divergences, and Experimental Design


* **Problem**: Find the decision $(Q: \gamma)$ that minimizes the probability of error $P\left(Y \neq \gamma(Z)\right)$

* The decision has two parts: it's determining the quantizer and the discriminant function **jointly**.  Historically the two main approaches approach half the problem.

In statistical machine learning:
* Q is assumed known and the problem is to find the discriminant function $\gamma$
* done by minimization of a "surrogate loss function" such as boosting, logistic regression, SVM.
* decision-theoretic, "consistency results"?

In signal processing:
* all the probabilities are known
* discriminant can be determined by Bayes rule (how?)
* but focuses on getting Q.  "once you get Q, use Bayes rule to get quantizer".
* focused on f-divergenes as a heuristic -- "want to push distributions apart" -- then a principle for determining the quantizer is maximizing the f-divergence (or experimental design in general)

Goal is to find the discriminant function and the quantizer that minimize the probability of error.  

Either you assume Q is known and you try to find the discriminant $\gamma$ (machine learning), or (all probabilities are known but focus on getting Q to then arrive at a quantizer.


### Blackwell (1951)

* If a procedure A has a smaller f-divergence than a procedure B (for some fixed $\mathcal{f}$), then there exists some set of prior probabilities such that procedure A has a smaller probability of error than procedure B.
* Given that it is intractable to minimize probability of error, f-divergences have been used as surrogates for probability of error
    * i.e. choose a quantizer Q by maximizing an f-divergence between $P(Z|Y=1)$ and $P(Z|Y=-1)$
        * Hellinger distance (Kalath 1967; Longo et al. 1990)
        * Chernoff distance (Chamberland & Veeravalli, 2003)
 
* Supporting arguments from asymptotics
    * Kullback-Leibler divergence in the Neyman-Pearson setting
    * Chernoff distance in the Bayesian setting
      

From {cite}`nguyen2009surrogate`:

>"An interesting example of this more elaborate formulation is a “distributed detection” problem, in which individual components of the d-dimensional covariate vector are measured at spatially separated locations, and there are communication constraints that limit the rate at which the measurements can be forwarded to a central location where the classification decision is made."

In the example above the d-dimensional covariate vector X represents d independent observation points with their own signals.  The quantizer Q is the decision of how to map the signals of Q to some lower dimension such that a classification 

Given a number of sensors outputting independent signals X, where there is a communication constraint preventing the processing of the complete set of raw signals, a quantizer Q is some (fixed) mapping of X to (lower dimensional) Z.  One half of the problem is to determine Q that minimizes prediction error $P\left(Y \neq \gamma(Z)\right)$ and the other half of the problem is to determine the discriminant function that ...

Can we map the streamflow monitoring network problem in similar terms?

## The General Problem Description

Consider the set of points defining river network confluences as a space of potential locations to monitor streamflow.  A simple approach to predicting streamflow in unmonitored basins (PUB) might be to assume: 
1) a range of streamflow that could occur $[a, b]$, and
2) a prior distribution over some set of states $\omega$ defining streamflow at a point in time.

The states could be defined in relation to some application of interest, or in relation to measurement error/uncertainty.  For example, if the rating curve (stage-discharge relationship) uncertainty is 10% for flows between 0 and 1 $m^3/s$, the appropriate precision is in the order of $0.1 m^3/s$, or to define ten states (discrete intervals) between 0 and 1.  The definition of states is referred to as the quantization of the streamflow signal.

Observing streamflow at one location provides information that can be used to predict streamflow at other locations.  A simple and common first approximation for prediction in ungauged basins (PUB) is to assume the discharge at distinct locations is equal on a unit area basis (UAR model).  This is a reasonable and useful assumption where the locations receive similar climate inputs due to close spatial proximity, or where the catchments are similar in terms of the characteristics related to the mechanisms governing the rainfall runoff response, such as the vegetation, soil, or slope which play a role in how water moves across the surface, into and through the soil, and into the atmosphere.  

A **proxy** location is a monitored catchment whose observations can be used to estimate runoff at a **target** location defining a different catchment. Both locations are watersheds, defined as land areas where all precipitation drains to a common outlet.  A simple PUB model uses the distribution of UAR $(\mathbf{Q})$ at a proxy location to estimate the distribution of UAR $(\mathbf{P})$ at a target location.  Both locations can be characterized by a set of measurable attributes associated with the rainfall-runoff response.

The goal in this exercise is to select the sensor location (or N locations as resources permit) that minimizes the probability of prediction error, in this case incorrect frequency estimation across the unmonitored space.  Stated in another way, the goal is to determine how to prioritize adding stations in such a way as to maximize the expected reduction in uncertainty over the unmonitored space.  The expectation of uncertainty reduction from adding a sensor (monitoring station) is estimated by developing a relationship between catchment attributes and the similarity of $(\mathbf{P})$ and $(\mathbf{Q})$.  In cases where minimizing the probability of error is intractable (support why, namely), f-divergences have been used as surrogates for probability of error (Blackwell 1951, Nguyen/Jordan 2009).  (An f-divergence is defined as any convex function of the likelihood ratio.)

Extending the example of the UAR PUB model above, assuming the maximum uncertainty prior uniform distribution $\mathbb{U}$ over $\omega$ is in many cases a better model for $\mathbf{P}$ than $\mathbf{Q}$.  "Better" in this case means the distributions are more similar, or that the expected message length to transmit observations matching the true distribution is greater than the naive assumption that all states are equally probable.  This feature leads to a possible binary classification /  prediction problem:


## Binary prediction formulation 

Consider an unmonitored stream network where we assume a uniform prior distribution at all possible observation locations.  It is not the case that adding a sensor in some location will provide predictive information for all unmonitored locations.  For many catchments, the uniform prior will be closer to the "true" distribution $\mathbf{P}$.  The term "true" refers to the distribution based on the set of observations held out of sample for computing the loss.  The probability that a potential monitoring location is a "better" proxy than the uniform prior for the UAR distribution at another location describes a binary prediction problem formulation (test the balance of data across bitrates!).  

The discriminant function is what quantifies a proxy distribution as "better", and since the minimization of prediction error in such a formulation is not readily computable or intractable (because?), determine the f-divergence that maps to an equivalent (loss or discriminant) function that is a (consistent) surrogate of prediction error.  The f-divergences are profiled for consistency are the total variation distance (TVD), the Kullback-Leibler divergence (KLD), and the Earth Mover's Distance (EMD), also known as the Wasserstein distance.

Given a covariate vector $\mathbf{X}$ of catchment attributes representing proxy (model) and target (predicted) catchments and a quantization $\mathbf{C}$ mapping streamflow $\mathbf{S}$ from continuous to discrete $\mathbf{Z}$.  $\mathbf{P}$ and $\mathbf{Q}$ are proxy and target distributions of $\mathbf{Z}$.  The discriminant is then a multi-part problem where (what is known and what isn't?) 

From {cite}`nguyen`, the total variational distance is the f-divergence consistent with the 0-1 loss in the in procedures that optimize simultaneously over the discriminant function $\gamma$ and the quantizer $Q$.



* the loss function penalizes incorrect predictions that a proxy location UAR distribution is a better "model" than the uniform distribution when it actually isn't. The goal is to miminize the (Bayes) error $R_{\textbf{Bayes}}(\gamma, C) := \mathbb{P} \left(Y \neq \text{sign}(D_{KL}(\mathbf{P}||\mathbb{U}) - D_{KL}(\mathbf{P}||\mathbf{Q}) \right)$
* For each f-divergence, there is a 1 to many mapping of divergence to loss function that is consistent.  There is a set of loss functions that yield (Bayes) consistency, that is that approach the "true" value as the sample size increases.

$$\text{min } P \left(Y \neq \gamma (X, C(S) \rightarrow Z) \right)$$




$$\text{min } P \left(D_{KL}(P||Q) < D_{KL}(P|| \mathbb{U} ) |X \right)$$

### Universal equivalence among surrogate loss functions

For a given (general, i.e. not necessarily 0-1) surrogate loss function, $\phi$-risk cak be represented as $R_{\phi}(\gamma, \mathbf{Q}) = \sum_x \phi (\gamma(z))\mu(z) + \phi(-\gamma(z))\pi(z)$

Take the infimum over the discriminant function:
* **Infimum (inf)**: The infimum of a set is the greatest value that is less than or equal to every element of the set. It is a type of "lower bound." In cases where the set has a minimum value, the infimum is equal to the minimum. However, if the set does not have a minimum (but has a lower bound), the infimum is the largest value that is not exceeded by any element of the set.

## Citations

```{bibliography}
:filter: docname in docnames
```