In [1]:
import os
from time import time
from scipy.stats import entropy
import multiprocessing as mp
import data_processing_functions as dpf
import numpy as np
import pandas as pd
import geopandas as gpd

# visualize the catchment centroid locations
from bokeh.plotting import figure, show
from bokeh.layouts import gridplot
from bokeh.io import output_notebook, export_png
from bokeh.models import ColumnDataSource, LinearAxis, Range1d
from bokeh.palettes import Colorblind, Sunset10
output_notebook()

from scipy.stats import lognorm, expon, kappa4, gaussian_kde
from scipy.special import kl_div, gammaln
from scipy.optimize import minimize

import jax
import jax.numpy as jnp
from jax.scipy.stats import gaussian_kde as jkde

BASE_DIR = os.getcwd()

## Streamflow observation as a model of streamflow observation

The streamflow monitoring network is a set of "models" of streamflow at a specific set of locations.  The rating curve equation *compresses* streamflow observations at monitoring stations since only the (three) parameters defining the power law relationship are needed to know the flow for any water level.  Let $X_m$ represent streamflow at a monitored location.

Streamflow records at monitored locations also contain predictive information about streamflow in ungauged basins (PUB).  "Predictive" can mean "the ungauged catchment will generate runoff as a function of the observed catchment runoff $f(X_m)$" or it can mean "the overall basin response to precipitation is a function of the observed catchment response $g(\text{Pr}(X=x))$".  

The overarching motivating question asks "*how well does a monitoring network represent a region **overall**?*", so the latter interpretation of prediction is the focus here.  In other words, what arrangement of monitoring stations provides the closest representation of an unmonitored region *overall* in terms of long-term behaviour.  This question doesn't solve the problem of predicting timing and magnitude of all ungauged locations, but it corresponds to the objective of covering the range of basin response "shapes" for the ungauged region, which may not be reflected in the existing network.  

The expected shape of an unobserved catchment is not directly, instead the divergence from another distribution is estimated based on a very large sample of distribution comparisions using catchment attributes of both catchments as predictors.  In other words, the physical descriptors of catchments are used to predict how different the distributions are expected to be.


## Characterizing Runoff Variability

In hydrology, the overall basin response to precipitation is described by the flow duration curve (FDC), which is a non-parametric representation of runoff that expresses the percentage of time that a certain flow rate is equaled or exceeded over some period.  A probability density function (PDF) describes the underlying probability density that the FDC is derived from, which emphasizes the frequency that certain ranges of flows are expected to occur.  The PDF captures the overall runoff behaviour in a catchment, reflecting how the system of storages and flux controls shape the hydrological response.

To evaluate how well the monitoring network reflects the runoff behaviour of a region *overall*, we compare PDFs between many different locations and quantify the differences in a pairwise manner.  The difference or divergence between two probability distributions is quantified by the Kullback-Liebler (KL) divergence, defined as the "extra description length" when using an approximate distribution $Q$ to represent the "ground truth" distribution $P$.  In this definition, the approximation $Q$ is the distribution generated by a rainfall-runoff or regional transfer model that simulates $P$.

Defining differences between PDFs in terms of "extra description length" reflects a fundamental link between prediction and compression.  If the process that generates $P$ can be defined by its physics or by some relationship with a related observable variable, then direct observation of runoff is not required.  Instead, a *compressed* representation of the process can be stored in the form of a governing equation, or an encoding that minimizes the amound of computer disk space needed to store the sequence of bits and the instructions needed to decode these exactly into the original observations. 



## Optimal Encoding

>*It's all a story of the interplay between underlying computational irreducibility and our nature as computationally bounded observers.* -Stephen Wolfram (from The Second Law)

Streamflow observations (and environmental signals generally) are observed, stored, and operated on computationally, and in computer systems data are fundamentally bound by nested layers of encoding to allow computers to handle a range of data types optimally for storage and computational operations.  One encoding might use more disk space to make read-write operations faster, etc.  

A series of observations can similarly be *encoded* by defining a set of distinct 'states' or 'symbols' that represent different values within the data.  The ‘dictionary’ of these states can be designed based on the frequency of each state’s occurrence, allowing for a compressed representation that maximizes information content while preserving the essential patterns of the original signal.  Optimal encoding of time series observations relies on the algorithm used to assign bit strings to symbols based on their frequencies, and it also depends critically on how the states are defined in the first place.  Algorithms like Huffman and arithmetic coding can be counted on to approach the theoretical limit of compression described by the (Shannon) Entropy of the source distribution.  The compression limit is the minimum average description length needed to perfectly encode the information contained in a sequence of symbols.

> *For an optimal encoding, the true distribution $P$ provides the shortest possible average description length for data sampled from $P$.*

A catch in this definition is that the "true" probability distribution must be defined in a form that supports quantitative analysis.  

## Probabilistic Representations of Observed Sequences

The probability distribution can be described in several ways, each with its own set of implications for data representation and computation:

Distributions are expressed in parametric and nonparametric, continuous and discrete forms:

* **(continuous) parametric**: the parameters of some distribution are calibrated to the data, often using Maximum Likelihood Estimation (MLE).  This is itself an approximation of the data, and there is no one best parametric form for all streamflow time series.  Log-normal has been found to work well in some places, exponential in arid places, and the four-parameter Kappa "best overall" (Castellarin, 2007).  No matter what family of distributions is chosen, this approach still represents curve fitting.
* **nonparametric**:
    * **continuous**: the kernel density estimator (KDE) treats individual observations as distributions themselves, and their sum creates a smooth, continuous distribution function based (almost) purely on the data.  The caveat is the kernel represents the probability distribution of individual observations, which is difficult to quantify for individual observations, and probably varies considerably.  The simplest form is to assume a constant kernel across all observations, and the computational cost of this approach is significant to begin with.
    * **discrete**: a simpler approach is to define a set of intervals that describe "bins", or ranges of flow to represent the expected precision of streamflow observations, or some other question of practical interest.  This approach is somewhat analogous to assuming some kind of kernel since defining bin edges means that all observations falling within any interval are treated equal.  Too coarse a binning risks erasing real information, and too fine binning is both inefficient and potentially misleading assuming there is uncertainty in the observations.

The three categories above are "data driven" approaches to representing a "true distribution" P.  In each case, there is some kind of model applied, whether it is a) the parametric form, b) the kernel, or c) the quantization.  Each of these "models" introduce epistemic (model structure) uncertainty.


### Motivating Example

Imagine a simple system that we know nothing about besides that it generates (small) non-negative integers.  Before making any observations, assume we had reason to be curious about the number of times the following values were produced: i) 0, ii) between 1 and 4 inclusive, and iii) between 5 and 10 inclusive.  These questions can be reflected by defining the system states as $\Omega = \{[0, 1), [1, 5), [5, 10)]\}$.  Assume four observations are then recorded, $X = [1, 3, 4, 1]$, yielding a likelihood of $\text{Pr}(X|\Omega) =  [0, 1, 0]$.  

Ignoring deep nuance in coding theory, we could define a coding dictionary for this system as $\mathcal{C} = \{ [0, 1): \text{00}, [1, 5): \text{01}, [5, 10): \text{10}\}$, yielding the encoded sequence $\mathcal{C}(X) = [01, 01, 01, 01]$.  A perfect model could be achieved by simply predicting the mean (2.25), or any constant in the range $\omega_1 = [1, 5)$.  An infinite number of models exist to satisfy these predictions perfectly, and none of them are useful or informative.  Our choice of quantization, that is the intervals we defined our system over, eliminated the variance in the original signal.  

Finding our model uninformative, we could collect more observations to see if the system visits other states of interest (test ergodicity).  Alternatively we could redefine $\Omega$ if there is something of practical interest in the observed variance $X = [1, 3, 4, 1]$.  An alternate set of states could be defined from the observations as $\Omega_1 = \{[0, 2), [2, 4), [4, 6)\}$ which for the same $X$ yields a likelihood $\text{Pr}(X|\Omega) = [0.5, 0.25, 0.25]$.  The same coding dictionary yields $\mathcal{C}(X) = [00, 01, 10, 00]$ which clearly contains less redundancy and therefore preserves more of the original signal variance.  

The (Shannon) entropy describes the lower bound on this distribution, given by $-\sum p \log_2 p = -(0.5 \log_2 0.5 + 2\cdot ( 0.25 \log_2 0.25)) = 1.5$ bits, and the a.  


## Epistemic Noise

In this section we revisit the idea of a "True distribution P" and how it is affected by model choices and measurement uncertainty.  How do we choose a representation that best suits a particular problem, that supports the broadest possible questions, or better yet anticipates future questions?  The short answer is by retaining the most information in the original signal.  Ultimately we want a representation of P that preserves information that can be used to support decisions.  

There is plenty to explore on the side of reducing uncertainty in the observation itself, and developing error models to reduce uncertainty after the fact.  Since we are working with a large database of historical observations, we can't change how the observations were done in the past, and developing a model of uncertainty to cover data collection across different governing bodies, across generations of technologists and measurement technologies, and across quality assurance adjustments that cannot be replicated, we assume there is varying degress of uncertainty in the data and we do not know its structure.  

Instead, we test the information content of the signal by varying the degree of noise added (or variance removed) to the observations used to train a predictive model.  A central component of this model is quantifying the difference between distributions representing simulated and observed runoff.

## Comparing Probability Distributions

The interpretation of the divergence between two distributions is affected by how the distributions are represented.  The example quantization preserved no information about the set of observations (besides a range of values).  The question is how do we preserve the most information from the original signal?  Any information we lose from quantization is unavailable for subseqent computation (or learning).  

Coming back to simulating the process that generates $X$, let the distribution of simulated values be denoted by $Q$.  If we want to quantify how much $Q$ diverges from $P$, the quantization should aim to preserve as much information in $P$ as possible.  But what about the information in $Q$?  If we compare the distributions based on a range most suitable to preserving information in $P$, then we are bound to find many cases where the quantization is not suited to preserve in formation in $Q$, in particular when the ranges do not align.  

We could quantize $P$ and $Q$ to:
1. the range of the target $P$: best aligns the chosen quantization method with the "ground truth", so preserves the most information for a given quantization.
2. the range of $Q$: a more intuitive way of approaching PUB since the target range is normally unknown.  Since $D_{KL}(P||Q)$ is not symmetric, this
3. some general range (practical question): An example would be based on measurement uncertainty, or on an overall desired precision for "bands" or ranges of unit area runoff to cover all possible states in all possible catchments.
4. **Maximum uncertainty: retains the most information in $P$** by setting bin edges according to equiprobable states.  This yields highly variable bin widths according to frequency of observation.

### Target/Proxy Range Binning


### Application-based Binning


### Maximum Uncertainty Binning

A binning scheme that maximizes uncertainty also maximizes the information in the distribution.  Since ultimately we are measuring how much information is lost when approximating $P$ with $Q$, if $P$ is binned to maximize uncertainty (uniform distribution), equal probability binning reduces bias introduced by "arbitrary" bin edges which risk overrepresenting areas with few observations.  The tradeoff is that heavy-tailed distributions, which distributions of streamflow observations are typically characterized by, lose the greatest amount of precision, or in other words the added noise is concentrated in the high magnitude bins.  

## Visualize the Noise added by Quantization and the Prior Assumption

The objective is to determine the description length "cost" associated with an (imperfect) simulation $\hat X_T$ instead of having the "ground truth" $X_T$ observations.  The description length cost is a function of the accuracy state frequencies, or the accuracy of the simulated flow duration curve (FDC).  A poor simulation of the target catchment will "cost" $D_\text{KL}(T||\hat T)$ bits per sample because errors in estimating the "ground truth" FDC results in mis-allocation of bits.  The encoding algorithm is near optimal *assuming the frequencies*, so incorrect frequencies mean that the expected message length will be longer because the true frequencies were not used to create the coding dictionary.  The cost is the KL divergence, and it is being proposed as a discriminant function to choose one proxy over another in simulating a target.  

The calculated cost of frequency estimation error $D_\text{KL}(T||\hat T)$ is affected by the methodology in two ways:  
* using small dictionaries reduces the variance of the original signal,  
* assuming a prior on the likelihood changes the shape of the distribution in proportion to the prior, 

The goal is to measure how much the dictionary size and the choice of prior affect the KL divergence in comparison to the discriminant value being used to choose between proxies for the better simulation (compression) of a target.  If the discriminant is larger than the vestigial effects of the model, then we have confidence that the epistemic a.k.a methodological uncertainty does not overpower the "genuine differences" between any proxy $S$ and any target $T$.  

For two catchments where streamflow monitoring observations have been recorded, let one location represent a **Surrogate** (a.k.a. proxy) catchment that will be used to simulate runoff at a **Target** catchment.  In this case the target catchment is also observed such that the model can be evaluated against "ground truth" observations.

1. Import full streamflow timeseries of both.  Test for minimum 1 year concurrent data.
    * 1 complete year contains 12 months with less than 5 days missing in any month.


In [2]:
# import streamflow record timeseries
attributes_filename = 'BCUB_watershed_attributes_updated.csv'
attributes_fpath = os.path.join(os.getcwd(), 'data', attributes_filename)
attr_df = pd.read_csv(attributes_fpath)
attr_df.columns = [e.lower() for e in attr_df.columns]
# df.columns
filtered_stns = sorted(list(set(attr_df['official_id'].values)))

In [3]:
test_proxy, test_target = filtered_stns[0], filtered_stns[1]
test_proxy, test_target = filtered_stns[15], filtered_stns[16]

# Retrieve the data for both stations
# this is all data, including non-concurrent
adf = dpf.retrieve_nonconcurrent_data(test_proxy, test_target)

# add a very small amount of noise to prevent duplicate bin edges
for stn in [test_proxy, test_target]:
    depth = 1e-9
    noise = np.random.uniform(-depth, depth, size=len(adf))
    adf[stn] += noise

df = adf.copy().dropna(subset=[test_proxy, test_target], how='any')
df[[test_proxy, test_target]] = df[[test_proxy, test_target]].round(3)
if df.empty:
    num_complete_concurrent_years = 0
else:
    df.reset_index(inplace=True)
    num_complete_concurrent_years = dpf.count_complete_years(df, 'time', test_proxy)
    print(num_complete_concurrent_years, 'concurrent years of record')
    

24 concurrent years of record


2. Let $X_S$ and $X_T$ represent unit area runoff timeseries observations of the **Surrogate** and **Target** catchments
2. Simulate runoff at the target as $\hat X_T = X_S \cdot \frac{A_T}{A_S}$ Where $A_T$ and $A_S$ are the target and surrogate catchment areas.

In [4]:
# simulate runoff at the target based on equal unit area runoff scaling
target_da = attr_df.loc[attr_df['official_id'] == test_target, 'drainage_area_km2'].values[0]
proxy_da = attr_df.loc[attr_df['official_id'] == test_proxy, 'drainage_area_km2'].values[0]
# by the equal unit area runoff model assumption, the simulated target runoff
# is equal to the observed proxy unit area runoff
sim_target = f'{test_target}_sim'
df[sim_target] = df[test_proxy] * (target_da / proxy_da)
# convert to Unit Area Runoff
df[sim_target] = (1000 * df[sim_target] / target_da).round(1).clip(0.1)
df[test_target] = (1000 * df[test_target] / target_da).round(1).clip(0.1)
df[test_proxy] = (1000 * df[test_proxy] / proxy_da).round(1).clip(0.1)
print(df[[sim_target, test_target, test_proxy]])

            05BF016_sim  05BF016  05BB001
time                                     
1962-03-06          3.6      1.1      3.6
1962-03-19          3.7      2.2      3.7
1962-03-26          3.7      2.2      3.7
1962-04-16          5.7      4.3      5.7
1962-04-17          5.7      3.2      5.7
...                 ...      ...      ...
2016-05-02         19.2      9.7     19.2
2016-05-30         30.5     37.6     30.5
2016-06-29         31.7     14.0     31.7
2016-08-15         24.6      7.5     24.6
2016-11-04          9.9      4.3      9.9

[14044 rows x 3 columns]


3. Compute $2^b - 2$ log-spaced bins over the **observed target range** $[\min (X_T), \max (X_T)]$ for $b = \{ 4, 5, \dots, 14\}$ bits
    * binning over the target range ensures that the ground truth distribution $T \sim \text{Pr}(X_T = x)$ aligns with the ground-truth range and not the simulated range.
    * and also that the simulated distribution $\hat T \sim \text{Pr}(\hat X_T = x)$ better reflects the "ground truth" in terms of binning, assuming the log-spaced binning procedure.

In [5]:
# reserve two bins for out of range values at left and right
bitrates = list(range(4, 17))
bitrate = 4#bitrates[-1]

def digitize_by_log_binning(df, label, n_bins):

    min_log_val = np.log10(df[label].min())
    max_log_val = np.log10(df[label].max()) + 1e-3
    
    # set the bin edges to be evenly spaced between the
    # observed range of the simulated series 
    # np.digitize will assign 0 for out-of-range values at left
    # and n_bins + 1 for out-of-range values at right
    log_bin_edges = np.linspace(
        min_log_val,
        max_log_val,
        n_bins + 1,
    ).flatten()
    
    # convert bin edges back to linear space
    return [10**e for e in log_bin_edges]
    
n_bins = 2**bitrate - 2
# convert bin edges back to linear space
target_bin_edges = digitize_by_log_binning(df, test_target, n_bins)

4. Quantize the simulated and observed series:
    * this quantization represents a "dictionary optimized on a simulated distribution"

In [6]:
obs_q_label = f'{test_target}_quantized'
df[obs_q_label] = np.digitize(df[stn], target_bin_edges)
sim_q_label = f'{test_target}_sim_quantized'
df[sim_q_label] = np.digitize(df[sim_target], target_bin_edges)

5. Compute the quantization noise

In [7]:
# Compute bin midpoints
# bin_midpoints = (np.array(target_bin_edges[:-1]) + np.array(target_bin_edges[1:])) / 2


6. Compute the likelihood of $\text{Pr}(S|\text{data})$ and $\text{Pr}(T|\text{data})$.

In [8]:
# count the occurrences of each quantized value
# the "simulated" series is the proxy/donor series
# and the "observed" series is the target location
obs_count_df = df.groupby(obs_q_label).count()
sim_count_df = df.groupby(sim_q_label).count()

count_df = pd.DataFrame(index=range(2**bitrate))
count_df[test_target] = 0
count_df[sim_target] = 0

count_df[test_target] += obs_count_df[test_target]
count_df[sim_target] += sim_count_df[sim_target]
count_df.fillna(0, inplace=True)

In [9]:
# plot the observed and simulated histograms
def plot_histograms(df, col1, col2, bins):
    """
    Plot two overlaid histograms for specified columns in a DataFrame using Bokeh.
    
    Parameters:
    - df: pandas DataFrame containing the data.
    - bins: list or array of bin edges for the histograms.
    - col1: name of the first column in df to plot.
    - col2: name of the second column in df to plot.
    """
    # Compute histograms for each column
    counts1, counts2 = df[col1].values, df[col2].values
    
    # Normalize histograms to make them comparable (optional)
    counts1 = counts1 / counts1.sum()
    counts2 = counts2 / counts2.sum()

    # Compute bin width in log space and extend bins
    log_width_factor = bins[1] / bins[0]
    # Extend bins by dividing and multiplying the edges by the log factor
    extended_bins = [bins[0] / log_width_factor] + list(bins) + [bins[-1] * log_width_factor]
    
    # Prepare data for Bokeh plotting
    left_edges = extended_bins[:-1]
    right_edges = extended_bins[1:]
    
    source1 = ColumnDataSource(data={'left': left_edges, 'right': right_edges, 'top': counts1})
    source2 = ColumnDataSource(data={'left': left_edges, 'right': right_edges, 'top': counts2})
    
    # Create the Bokeh plot
    p = figure(title="", width=700, height=400, x_axis_type='log')
    p.quad(source=source1, top='top', bottom=0, left='left', right='right', 
           fill_color="grey", fill_alpha=0.5, line_color="grey", legend_label="P(x)")
    p.quad(source=source2, top='top', bottom=0, left='left', right='right', 
           fill_color="red", fill_alpha=0.5, line_color="red", legend_label="Q(x)")
    
    # Add labels and legend
    p.xaxis.axis_label = r'$$\text{Flow} [m^3/s]$$'
    p.yaxis.axis_label = r'$$\text{Pr}(X)$$'
    p.legend.location = "top_left"
    p.legend.click_policy = "hide"
    return p

In [10]:
plot = plot_histograms(count_df, test_target, sim_target, target_bin_edges)
plot = dpf.format_fig_fonts(plot)
show(plot)

The figure above compares observed runoff at the target catchment to runoff simulated from a surrogate/proxy catchment based on an equal unit area runoff model.  Both distributions have zero probability for the left and right-most edges (states), this is a feature of the quantization which reserves a state at each extreme to capture simulated values outside of the range of the observed series, which is not expressed in this case since the variance of the observed series is slightly greater than the proxy.  The left side of the distribution shows that the simulated values do not provide support coverage of the target distribution. This scenario presents a problem since the discriminant $D_\text{KL}(P||Q) = P \log (P/Q)$ is undefined where the simulation is inadmissible when $p_i > 0 \cap q_i = 0$ for any $i$ due to division by zero.  This case is addressed by assuming a Dirichlet prior which incorporates pseudo-counts to ensure all (simulated) states have non-zero probability.


## {good, bad}^2

The "closeness" of one distribution to another can be described in terms of the implications of mismatch:

* **good good**: a good approximation of $P(x)$ with complete support coverage,
* **bad good**: a bad approximation of P(x) with complete support coverage,
* **bad bad**: incomplete support coverage of P(x) with high distortion rate, and
* **good bad**: incomplete support coverage of P(x) with small distortion rate

In [11]:
def normalize(hist):
    return hist / np.sum(hist)
bins = 16
x = np.linspace(0, 1, bins)
P = (0.7 * np.exp(-((x - 0.4)**2) / 0.02) + 0.4 * np.exp(-(np.log(x + 0.1) - 0.3)**2 / 0.1)).clip(0)
P = normalize(P)

# Q1: Has support coverage of P and is a good representation of P with slight noise
Q1 = normalize((P + np.random.uniform(-0.02, 0.02, bins)).clip(0))

# Q2: Has support coverage of P but extends beyond it (adds more mass in certain regions)
Q2 = normalize(P + np.random.uniform(0, 0.05, bins) + 0.37 * (x > 0.8))

# Q3: Does not have support coverage of P where P is concentrated (zeros in main regions of P)
Q3 = P * (x < 0.45) + P * (x > 0.15)
Q3 = normalize(np.where(Q3 > 0.04, Q3, 0))

# Q4: Does not have support coverage of P, but only where P is small (focuses on P's large values)
Q4 = normalize(P * (P > 0.01))
ex_plots = []
for q in [Q1, Q2, Q3, Q4]:
    temp_df = pd.DataFrame({'p': P, 'q': q})
    p = plot_histograms(temp_df, 'p', 'q', target_bin_edges)
    p = dpf.format_fig_fonts(p)
    # p.legend.location = "top_right"
    p.legend.background_fill_alpha = 0.6
    ex_plots.append(p)

In [12]:
layout = gridplot(ex_plots, ncols=2, width=500, height=400)
show(layout)

## Dealing with Underspecification: Assuming a Prior on Q

The joint likelihod of observing the data given $Q$ and the prior probability of $Q$ is given by $\text{Pr}(\text{data}|Q)\cdot \text{Prior}(Q)$, and the normalized posterior is given by:

$$\text{Pr}(Q|\text{data}) = \frac{\text{Pr}(\text{data}|Q) \cdot \text{Prior}(Q)}{\text{Pr}(\text{data})}$$

However this doesn't address our (fairly common) problem where some $q_i = 0$.  Normal Bayesian updating still yields $q_i = 0$ because the dot product will multiply by zero.

Let: 
* $Q = (q_1, q_2, \dots, q_k)$ be a probability vector representing a discrete distribution over $k$ categories (states).
* The observed data consist of counts $(n_1, n_2, \dots, n_k)$ for each category
* The prior distribution for $Q$ is a Dirichlet distribution with parameters $\alpha = (\alpha_1, \alpha_2, \dots, \alpha_k)$, and by definition the posterior is also a Dirichlet distribution, or one that satisfies $\sum_{i=1}^k t_i = 1, t_i \geq 0 \text{ for all } i$

$$Q|\text{data}\sim \text{Dirichlet}(\alpha_1 + n_1, \alpha_2 + n_2, \dots, \alpha_k + n_k)$$

The (renormalized) posterior is given by:

$$\mathbb{E}[Q|data] = \left( \frac{\alpha_1 + n_1}{\sum_{j=1}^k(\alpha_j + n_j)}, \frac{\alpha_2 + n_2}{\sum_{j=1}^k(\alpha_j + n_j)}, \dots, \frac{\alpha_k + n_k}{\sum_{j=1}^k(\alpha_j + n_j)} \right)$$

The KL divergence can't be computed where $\hat t_i = 0$.  To address this case we assume a Dirichlet prior on $\hat T$ in the form of a uniform distribution of pseudo-counts.  The assumption of a prior adds noise to the distribution, and the goal is to minimize the influence this procedural step has on the discriminant value in the final objective function.  To illustrate this point, we apply a range of priors to $\hat T$ and evaluate the degree to which it adds noise to the simulated distribution.

4.  For a range of priors representing strength of belief in a model, compute the posterior given by:

$$\mathbb{E}[t_i \mid \text{data}] = \frac{\alpha_i + n_i}{\sum_{j=1}^k (\alpha_j + n_j)}$$

Where $\alpha_i$ represents a Dirichlet prior of pseudo-counts, and $n_i$ represent the observed target counts.



In [13]:
priors = np.arange(-3., 5.5, 1)
for prior in priors:
    pseudo_counts = 10**prior
    post_label = f'{prior}_pseudo_counts'
    count_df[post_label] = count_df[sim_target] + pseudo_counts

# normalize the counts to probabilities
probabilities = count_df / count_df.sum()
col_sums = probabilities.sum().values
assert np.all(np.isclose(col_sums, 1.0)), col_sums

5. Compute the noise on $T$ from assuming a prior as $\eta_{pr} = D_\text{KL}(T||Q)$, where
    * $T$ is the likelihood 
    * $Q$ is the posterior distribution of T after assuming a Dirichlet prior:
        * $Q|data \sim \text{Dirichlet}(\alpha_1 + n_1, \alpha_2 + n_2, \dots, \alpha_k + n_k)$ with
        * each $q_i = \mathbb{E} [q_i|\text{data}] = \frac{\alpha_i+n_i}{\sum_{j=1}^k(\alpha_j + n_j)}$ is the updated expectation of the probability of category $i$

In [14]:
def compute_noise(df, obs_label, posterior_sim_label):
    # compute DKL(Q||Q_post), or the divergence
    # of the posterior from the likelihood due to the prior
    p, q = df[obs_label].values, df[posterior_sim_label].values
    mask = (q > 0) & (p > 0)
    noise = np.zeros_like(p)
    noise[mask] = p[mask] * np.log2(p[mask] / q[mask])
    return sum(noise[mask])

In [15]:
print(test_proxy, test_target)
probabilities.columns

05BB001 05BF016


Index(['05BF016', '05BF016_sim', '-3.0_pseudo_counts', '-2.0_pseudo_counts',
       '-1.0_pseudo_counts', '0.0_pseudo_counts', '1.0_pseudo_counts',
       '2.0_pseudo_counts', '3.0_pseudo_counts', '4.0_pseudo_counts',
       '5.0_pseudo_counts'],
      dtype='object')

In [16]:
# plot the posterior distribution for each prior
from bokeh.palettes import TolRainbow14
from bokeh.layouts import row, column, gridplot

post_fig = figure(width=600, height=450, x_axis_type='log')
# Compute bin width in log space and extend bins
log_width_factor = target_bin_edges[1] /target_bin_edges[0]
# Extend bins by dividing and multiplying the edges by the log factor
extended_bins = [target_bin_edges[0] / log_width_factor] + list(target_bin_edges) + [target_bin_edges[-1] * log_width_factor]
bin_midpoints = np.add(extended_bins[1:], extended_bins[:-1]) / 2 

post_fig.line(bin_midpoints, probabilities[test_target], line_width=3, color='black', 
              line_dash='dashed', legend_label='P(x)')

n = 0
min_post_label = f'{min(priors)}_pseudo_counts'
print(min_post_label)
noise_1, noise_2 = [], []
discriminants = []
for prior in priors:
    post_label = f'{prior}_pseudo_counts'
    noise_1.append(compute_noise(probabilities, sim_target, post_label))
    discriminants.append(compute_noise(probabilities, test_target, post_label))
    post_fig.line(bin_midpoints, probabilities[post_label], legend_label=str(prior),
                 line_width=2.5, color=TolRainbow14[n])
    n += 1

post_fig.xaxis.axis_label = r'$$\text{Flow} [m^3/s]$$'
post_fig.yaxis.axis_label = r'$$\text{Pr}(X)$$'
# post_fig.legend.location = "top_left"
post_fig.add_layout(post_fig.legend[0], 'right')
post_fig.legend.click_policy = "hide"
post_fig = dpf.format_fig_fonts(post_fig)

noise_fig = figure(width=600, height=450)
noise_fig.line(priors, noise_1, line_width=3, color='crimson', 
               line_dash='dotted', legend_label=f'DKL(Q||Q_post)')
noise_fig.line(priors, discriminants, line_width=3, color='black', legend_label='DKL(P||Q_post)')

noise_fig.xaxis.axis_label = r'$$\text{Prior } (\alpha =  10^x) \text{ Pseudo-Counts} $$'
noise_fig.yaxis.axis_label = r'$$D_\text{KL}(P||Q_\text{approx})$$'
noise_fig.legend.location = 'top_right'
noise_fig.legend.click_policy = "hide"
noise_fig = dpf.format_fig_fonts(noise_fig)

-3.0_pseudo_counts


In [17]:
layout = gridplot([post_fig, noise_fig], ncols=2, width=500, height=400)
show(layout)

To address incomplete support coverage of the "ground truth" $P(x)$ by the simulated $Q(x)$ distribution in the discrete representation, we assume a Dirichlet prior on $Q(x)$ representing pseud-counts representing the posterior $Q$ the form $\text{Dirichlet}(10^\alpha + n_i)$, where $n_i$ represents the number of observations in each category $i$.

The plot above-left shows the flattening influence of the increasing prior on the approximated distribution $Q(x)$ against the "ground truth" $P(X)$, where the choice of prior has minimal effect below approximately $10^1$ pseudo-counts.  At right, the dashed red line shows that the influence of the prior ($D_\text{KL}(Q||Q_\text{post})$) is very small until the prior grows beyond $10^1$ t, and eventually the .  The plot above right compares how the noise added by the prior compares to the sensitivity of the discriminant to the prior.  If the noise added by the prior (red dotted line) is greater than the discriminant (black line), then the discriminant isn't meaningful to use in a decision.

Smaller priors preserve information in the approximate distribution $Q$ and add negligible noise, but smaller priors (on $Q$) inflate the KL divergence between the simulated ($Q$) and "ground truth" ($P$) distributions, in particular where the simulation does not provide support coverage of the ground truth.  The result at right shows there is more to consider about the effect of the prior than just how it influences the posterior $Q$.  

In general the prior should be chosen to minimize influence on the input distributions, however the plot at right also shows that where the prior is minimized, the noise added to $Q$ is roughly equal to the discriminant, which erases its meaning.  Next we run this analysis on a sample of $6\times10^5$ pairwise comparisons to see how much the prior influence on the discriminant and the posterior $Q$ varies over a wide range of simple equal unit area runoff regional transfer simulations.

The point of the comparison is that the noise due to the prior can "push" $Q$ further away from $P$, and it can also bring it closer.  Since we can't know this a priori, we should ensure that regardless of the direction the different sources of epistemic noise "pushes" **both** $P$ and $P$ relative to each other should not be significant in comparison to the magnitude of the discriminant.  The example below illustrates how the prior flattens $Q$ and it can make it more or less similar to the target $P$.

In [18]:
def generate_and_plot_lognormal(mu1, sigma1, mu2, sigma2, b, prior_pseudo_count, y_range=(0, 0.035)):
    """
    Generates two log-normal distributions, quantizes them with 2^b symbols,
    applies a uniform pseudo-count prior to one, and plots the distributions
    with Bokeh (including the posterior as a dashed line).

    Parameters:
    - mu1, sigma1: Parameters for the first log-normal distribution.
    - mu2, sigma2: Parameters for the second log-normal distribution.
    - b: Number of symbols = 2^b for quantization.
    - prior_pseudo_count: Uniform pseudo-count to apply to one distribution.
    """
    # Create the two log-normal distributions
    minx, maxx = 0.0, 5
    x = np.linspace(minx, maxx, 200)  # Values over which distributions are evaluated
    dist1 = lognorm.pdf(x, sigma1, loc=mu1, scale=sigma1)
    dist2 = lognorm.pdf(x, sigma2, loc=mu2, scale=sigma2)

    bins = np.linspace(minx, maxx, 2 ** b + 1)
    bins_midpoints = (bins[:-1] + bins[1:]) / 2

    # Create the continuous log-normal distributions evaluated at the midpoints
    dist1 = lognorm.pdf(bins_midpoints, sigma1, scale=np.exp(mu1))
    dist2 = lognorm.pdf(bins_midpoints, sigma2, scale=np.exp(mu2))

    dist3 = lognorm.pdf(bins_midpoints, sigma1, scale=np.exp(mu1))
    dist4 = lognorm.pdf(bins_midpoints, sigma2, scale=np.exp(mu2))

    # Quantize the distributions
    # Normalize the distributions to form proper PDFs over the quantized bins
    P = dist1 / np.sum(dist1)
    Q = dist2 / np.sum(dist2)

    # Apply the uniform prior as pseudo-counts to the second distribution
    prior_counts = np.full_like(Q, prior_pseudo_count)
    posterior_counts = prior_counts + Q * np.sum(dist2)  # Bayesian update
    R = posterior_counts / np.sum(posterior_counts)  # Renormalize

    assert np.abs(sum(P) - 1) < 0.001, 'P does not sum to 1'
    assert np.abs(sum(Q) - 1) < 0.001, 'Q does not sum to 1'
    assert np.abs(sum(R) - 1) < 0.001, 'Q_hat does not sum to 1'

    kl_pq = kl_div(P, Q)
    kl_pr = kl_div(P, R)

    # Prepare data for Bokeh plotting
    source1 = ColumnDataSource(data=dict(x=bins_midpoints, y=P))
    source2 = ColumnDataSource(data=dict(x=bins_midpoints, y=Q))
    source2_posterior = ColumnDataSource(data=dict(x=bins_midpoints, y=R))

    # Create the Bokeh plot
    p = figure(title="", y_range=y_range,
               x_axis_label='x', y_axis_label=r'$$\text{Pr}(X)$$', width=500, height=350)

    p.line('x', 'y', source=source1, line_width=2, color='black', legend_label=f"P(x)=LN(x|{mu1:.1f},{sigma1:.2f})")
    p.line('x', 'y', source=source2, line_width=2, color='red', legend_label=f"Q(x)=LN(x|{mu2:.1f},{sigma2:.2f})")
    p.line('x', 'y', source=source2_posterior, line_width=2, line_dash='dashed',
           color='red', legend_label=f"Q_hat(X) (Prior={prior_pseudo_count})")

    # Configure the legend and show the plot
    p.legend.location = "top_right"
    p.legend.click_policy = "hide"

    return p, sum(kl_pq), sum(kl_pr)

In [19]:
mu, sigma = 0.25, 0.35
p1, klpq1, klpr1 = generate_and_plot_lognormal(mu, sigma, mu+0.01, sigma-0.1, 8, 0.03, y_range=(0, 0.035))
p2, klpq2, klpr2 = generate_and_plot_lognormal(mu, sigma, mu+0.01, sigma+0.1, 8, 0.03)
p1, p2 = dpf.format_fig_fonts(p1), dpf.format_fig_fonts(p2)

kl1_text = "    -->Q is closer to P than Q_hat"
if klpr1 < klpq1:
    kl1_text = "    -->Q hat is closer to P than Q^"
kl2_text = "    -->Q is closer to P than Q_hat"
if klpr2 < klpq2:
    kl2_text = "    -->Q_hat is closer to P than Q_hat"

print(f'DKL_1(P||Q) = {klpq1:.2f}, DKL(P||Q_hat) = {klpr1:.2f}')
print(kl1_text)
print(f'DKL_2(P||Q) = {klpq2:.2f}, DKL(P||Q_hat) = {klpr2:.2f}')
print(kl2_text)
layout = gridplot([p1, p2], ncols=2, width=500, height=350)
show(layout)

DKL_1(P||Q) = 0.14, DKL(P||Q_hat) = 0.10
    -->Q hat is closer to P than Q^
DKL_2(P||Q) = 0.05, DKL(P||Q_hat) = 0.12
    -->Q is closer to P than Q_hat


## Quantization Noise

The discrete representation of a probability distribution also adds noise where the quantization resolution limits the number of unique observed values to the number of symbols defined in the dictionary.  A 4-bit quantization was shown earlier in this notebook which provides 16 distinct symbols which are defined by ranges of continuous values.  Quantization reduces variance because all values mapped to the same symbol through quantization are treated as equal.  As discussed previously in this notebook, a dictionary with enough symbols preserves all unique values from the original time series.  Decreasing the dictionary size has the effect of smoothing the distribution, since the resulting frequency of consolidated bins averages larger subsets of the sample.  The loss of variance from this smoothing changes the frequency estimates, and thus is a form of noise.

The example that follows computes the discrete distribution frequencies estimated from a 5-bit (32 symbol) dictionary and those from a 4-bit (16 symbol) dictionary, and computes the KL divergence between the two to express the divergence of the lower-resolution distribution ($Q_{b=4}$) from the higher ($Q_{b=5}$).  Also plotted in the example are log-normal, exponential distributions, and 4-parameter kappa distributions fit to the data using Maximum Likelihood Estimation (MLE) for the best-fit parameters.  These extra series are included to compare the fit of several parametric distributions widely used in hydrology, and to serve as a preview and a conceptual link to a subsequent exercise linking the current analysis of quantization noise to an analogous "noise" quantity related to parametric distribution fitting error, or "parametric fit noise".

In [20]:
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)

def neg_log_likelihood(params):
    h, k, loc, scale = params
    if scale <= 0:
        return np.inf  # Enforce positive scale
    try:
        logpdf = kappa4.logpdf(data, h, k, loc=loc, scale=scale)
        if np.any(np.isnan(logpdf) | np.isinf(logpdf)):
            return np.inf
        return -np.sum(logpdf)
    except Exception:
        return np.inf

def fit_kappa4_mle(data):
    if len(data) < 2 or np.std(data) <= 0:
        # Handle edge cases where MLE is not feasible
        return {'h': np.nan, 'k': np.nan, 'loc': np.nan, 'scale': np.nan}
    
    # Define the initial guesses for the parameters
    initial_params = [0.0, 0.0, np.mean(data), np.std(data)]
    
    # Bounds for the parameters: h and k between -2 and 2, scale > 1e-5
    bounds = [(-2, 2), (-2, 2), (None, None), (1e-5, None)]
    
    # Minimize the negative log-likelihood
    result = minimize(neg_log_likelihood, initial_params, method='L-BFGS-B', bounds=bounds)
    
    if result.success:
        h, k, loc, scale = result.x
    else:
        h, k, loc, scale = np.nan, np.nan, np.nan, np.nan
        # Log or handle fitting failure if needed
    
    return {'h': h, 'k': k, 'loc': loc, 'scale': scale}

def compute_pdf_cdf_kde_scipy(data, grid_points=1000, bandwidth=None):
    """
    Computes the probability density function (pdf) and cumulative distribution function (cdf)
    from an array of values using Kernel Density Estimation (KDE) with scipy.stats.

    Parameters:
    data (array-like): Input data array.
    grid_points (int): Number of points in the grid where the pdf and cdf are evaluated.
    bandwidth (float or str, optional): The bandwidth of the kernel. If None, Scott's Rule is used.
                                         Can also be a string for methods like 'scott' or 'silverman'.

    Returns:
    x_grid (numpy.ndarray): Grid points where the pdf and cdf are evaluated.
    pdf_values (numpy.ndarray): Estimated pdf values corresponding to x_grid.
    cdf_values (numpy.ndarray): Estimated cdf values corresponding to x_grid.
    """
    # Convert input data to a numpy array
    data = np.asarray(data)
    
    # Create a Gaussian KDE object
    kde = gaussian_kde(data, bw_method=bandwidth)
    
    # Create a grid over which to evaluate the KDE
    x_min = data.min() - 1.0 * data.std()
    x_max = data.max() + 1.0 * data.std()
    x_grid = np.linspace(x_min, x_max, grid_points)
    
    # Evaluate the pdf over the grid
    pdf_values = kde.evaluate(x_grid)
    
    # Compute the cdf by integrating the pdf
    cdf_values = np.array([kde.integrate_box_1d(-np.inf, xi) for xi in x_grid])
    
    return x_grid, pdf_values, cdf_values



In [21]:
def compute_ecdf(data):
    """Compute the empirical CDF of a dataset."""
    sorted_data = np.sort(data)
    n = len(sorted_data)
    y = np.arange(1, n+1) / n
    return sorted_data, y

In [22]:
def compute_discrete_distributions(data, b):
    # 1. Determine the log-spaced bin edges
    bin_dict = {}
    total_count = len(data)
    # 1. Compute the histogram using equal-width bins
    counts, bin_edges = np.histogram(data, bins=2**b, density=False)
    densities, _ = np.histogram(data, bins=2**b, density=True)
    freqs = counts / len(data)
    bin_widths = bin_edges[1:] - bin_edges[:-1] 
    bin_dict['equal'] = {'edges': bin_edges, 'freqs': freqs, 'densities': densities, 'widths': bin_widths}

    # 2. Compute the histogram using log-spaced bins
    minx, maxx = np.min(data), np.max(data)
    log_edges = np.logspace(np.log10(minx), np.log10(maxx), 2**b + 1)
    bin_widths = log_edges[1:] - log_edges[:-1]
    log_densities, _ = np.histogram(data, bins=log_edges, density=True)
    log_counts, _ = np.histogram(data, bins=log_edges, density=False)
    log_freqs = log_counts / sum(log_counts)
    assert abs(sum(log_freqs) - 1) < 0.001, sum(log_freqs)
    bin_dict['log'] = {'edges': log_edges, 'freqs': log_freqs, 'densities': densities, 'widths': bin_widths}

    # 3. Compute the histogram using uniform (probability) bins
    quantiles = np.linspace(0, 1, 2**b + 1)
    uniform_edges = np.quantile(data, quantiles)
    bin_widths = uniform_edges[1:] - uniform_edges[:-1]
    uniform_freqs, _ = np.histogram(data, bins=uniform_edges, density=True)
    uniform_counts, _ = np.histogram(data, bins=uniform_edges, density=False)
    uniform_freqs = uniform_counts / total_count
    # assert abs(sum(uniform_freqs) - 1) < 0.001, sum(uniform_freqs)
    bin_dict['uniform'] = {'edges': uniform_edges, 'freqs': uniform_freqs, 'densities': densities, 'widths': bin_widths}
    return bin_dict

In [23]:
def remap_low_to_high_resolution(data, b1, b2, which_binning='log'):
    """
    Vectorized distribution of low-resolution frequencies over high-resolution bins.
    Expands each low-res bin's frequency uniformly across the corresponding high-res bins.
    """
    b1_bin_dict = compute_discrete_distributions(data, b1)  # Low-res (Q)
    b2_bin_dict = compute_discrete_distributions(data, b2)  # High-res (P)
    b1_bins = b1_bin_dict[which_binning]['edges']
    b1_freqs = b1_bin_dict[which_binning]['freqs']
    b2_bins = b2_bin_dict[which_binning]['edges']
    b2_freqs = b2_bin_dict[which_binning]['freqs']
    # Normalize up-scaled low-res frequencies and convert to probabilities
    b1_probs = b1_freqs / np.sum(b1_freqs)
    # Determine the low-res bin for each high-res bin
    bin_indices = np.digitize(b2_bins[:-1], b1_bins) - 1  # Map high-res bins to low-res bins
    
    # Compute counts of high-res bins falling into each low-res bin
    counts_per_low_bin = np.bincount(bin_indices, minlength=len(b1_probs))

    # Broadcast low-res frequencies to high-res bins, dividing by the count to distribute uniformly
    high_res_probs = b1_probs[bin_indices] / counts_per_low_bin[bin_indices]
    # Normalize to ensure the result sums to 1
    high_res_probs /= np.sum(high_res_probs)
    
    return high_res_probs, b1_bin_dict, b2_bin_dict

In [24]:
def compute_kl_divergence(p, q):
    # compute DKL(Q||Q_post), or the divergence
    # of the posterior from the likelihood due to the prior
    mask = (q > 0) & (p > 0)
    noise = np.zeros_like(p)
    noise[mask] = p[mask] * np.log2(p[mask] / q[mask])
    return sum(noise[mask])

In [25]:
def fit_continuous_distributions(data, bin_edges):
    """simulate the target using the parametric MLE parameters from the proxy"""
    # fit and plot a lognormal distribution
    ln_shape, ln_loc, ln_scale = lognorm.fit(data, floc=0)  # Fixing location to 0
    ex_loc, ex_scale = expon.fit(data, floc=0)
    kp = fit_kappa4_mle(data)
    # kp = constrained_optimization(data)
    kde = gaussian_kde(np.log10(data), bw_method='scott')

    edges = list(bin_edges)
    ln_cdf_vals = lognorm.cdf(edges, ln_shape, 
                              loc=ln_loc, scale=ln_scale)
    expon_cdf_vals = expon.cdf(edges, loc=ex_loc, 
                               scale=ex_scale)
    kappa4_cdf_vals = kappa4.cdf(edges, kp['h'], kp['k'],
                                 loc=kp['loc'], scale=kp['scale'])
    
    # Compute the KDE-based CDF at the evaluation points
    kde_cdf_vals = np.array([kde.integrate_box_1d(-np.inf, np.log10(xi)) for xi in edges])
    
    p_sim = pd.DataFrame()
    p_sim[f'{stn}_LN'] = np.diff(ln_cdf_vals)
    p_sim[f'{stn}_EXP'] = np.diff(expon_cdf_vals)
    p_sim[f'{stn}_KP4'] = np.diff(kappa4_cdf_vals)
    p_sim[f'{stn}_KDE'] = np.diff(kde_cdf_vals)
    # normalize the distributions 
    p_sim /= p_sim.sum()
    # make sure all distributions sum to 1
    assert np.isclose(p_sim.sum(), 1, atol=0.0001).all(), p_sim.sum()
    bin_midpoints = (np.array(bin_edges[:-1]) + np.array(bin_edges[1:])) / 2
    # replace the last bin midpoint with half the previous bin's width
    # because our right bin edge is np.inf
    # right_bin_midpoint = edges[-2] + (edges[-2] - edges[-3]) / 2.0
    # bin_midpoints[-1] = right_bin_midpoint
    p_sim['bin_midpoints'] = bin_midpoints
    return p_sim

In [26]:
def compute_distribution_comparisons(data, b1, b2, binning_method='log'):
    bin_dict = {}    
    
    discrete_dict = compute_discrete_distributions(data, b1)
    parametric = fit_continuous_distributions(data, discrete_dict[binning_method]['edges'])
    for b in [b1, b2]:
        bin_dict[b] = {
                'discrete': discrete_dict,
                'continuous': parametric
            }
    return bin_dict

In [27]:
def plot_distribution_comparison(b1, b2, data, binning_method='log', log_axis=False):
    test_fig = figure(title=None,
                         width=1000, height=400) 
    if log_axis:
        test_fig = figure(title=None,
                         width=1000, height=400, 
                          x_axis_type='log')
        
    # plot empirical (discrete) distributions using linear and log binning
    # bin_dict = compute_distribution_comparisons(data, b1, b2, binning_method)
    high_res_probs, b1_bin_dict, b2_bin_dict = remap_low_to_high_resolution(data, b1, b2, which_binning=binning_method)
    continuous_b1 = fit_continuous_distributions(data, b1_bin_dict[binning_method]['edges'])
    continuous_b2 = fit_continuous_distributions(data, b2_bin_dict[binning_method]['edges'])

    b2_edges = b2_bin_dict[binning_method]['edges']
    b2_freqs = b2_bin_dict[binning_method]['freqs']

    test_fig.quad(left=b2_edges[:-1], right=b2_edges[1:], top=high_res_probs, 
                  bottom=[0 for _ in range(len(b2_edges[1:]))],  fill_color='gray', 
                  # legend_label=f'{b1} bits (upscaled)',
                  legend_label=f'Observed',
                  line_color='gray', line_width=0.5, fill_alpha=0.4)
    test_fig.quad(left=b2_edges[:-1], right=b2_edges[1:], top=b2_freqs, 
                  bottom=[0 for _ in range(len(b2_edges[1:]))], fill_color='crimson', 
                  # legend_label=f'{b2} bits',
                  legend_label=f'Simulated',
                  line_color='crimson', line_width=0.5, fill_alpha=0.4)

    b1_bin_midpoints = continuous_b1['bin_midpoints']
    b2_bin_midpoints = continuous_b2['bin_midpoints']
    # print(parametric_high.keys())
    # p_sim = parametric_high.copy()
    test_fig.line(b2_bin_midpoints, continuous_b2[f'{stn}_LN'], color='black', legend_label='LogNorm MLE', line_width=2)
    test_fig.line(b2_bin_midpoints, continuous_b2[f'{stn}_EXP'], color='gray', legend_label='Expon. MLE', line_width=2)
    test_fig.line(b2_bin_midpoints, continuous_b2[f'{stn}_KP4'], color='purple', legend_label='Kappa MLE', line_width=2)
    test_fig.line(b2_bin_midpoints, continuous_b2[f'{stn}_KDE'], color='magenta', legend_label='KDE', line_width=2)

    test_fig.legend.background_fill_alpha = 0.6
    test_fig.legend.location = 'top_right'
    test_fig.legend.click_policy='hide'
    test_fig.xaxis.axis_label = r'$$\text{Mean Daily Flow } [m^3/s]$$'
    test_fig.yaxis.axis_label = r'$$\text{Pr}(X)$$'
    # test_fig.add_layout(LinearAxis(y_range_name="ecdf", axis_label="CDF"), "right")
    test_fig.add_layout(test_fig.legend[0], 'right')
    test_fig = dpf.format_fig_fonts(test_fig)
    return test_fig

In [28]:
from bokeh.layouts import column

test_df = dpf.get_timeseries_data(test_target)
approximated_bitrate = 4
ground_truth_bitrate = 5
data = test_df[test_target].dropna().values
pmf_comparison_plot_lin = plot_distribution_comparison(approximated_bitrate, ground_truth_bitrate, data)
pmf_comparison_plot_log = plot_distribution_comparison(approximated_bitrate, ground_truth_bitrate, data, log_axis=True)
show(column(pmf_comparison_plot_lin, pmf_comparison_plot_log))

The above plots compare discrete and continuous methods for representing observed data as probability distributions. Two discrete distributions are shown, quantized to different dictionary sizes: the first (pink) uses five bits, while the second uses four bits. The frequencies from the four-bit distribution are remapped to the larger dictionary and renormalized to align with the states of the five-bit distribution. Additionally, four continuous distributions are fit to the data: three parametric models (log-normal, exponential, and 4-parameter Kappa) using MLE best-fit parameters, and a KDE-based distribution. These continuous distributions are mapped to the bin midpoints of the five-bit discrete distribution and normalized to sum to 1 to allow comparison with the discrete empirical estimates. 

The plot illustrates the point that all methods for representing observational data as distributions involve decisions that affect interpretation. The three parametric fits limit the interpretation to single-mode distributions. KDE, while more flexible, can exhibit limitations in regions with sparse data or sharp changes in density, particularly for small values approaching zero, where the kernel may under- or over-smooth depending on its size. Discrete binning, on the other hand, preserves as much of the variability in the data as the chosen quantization allows, while parametric fits inherently smooth the distribution, embedding assumptions about long-term behavior rather than simply acting as a low-pass filter to remove high-frequency variability to emphasize broader trends or more general patterns. The KDE can act as an intermediate approach: larger kernels emphasize smoothing, similar to parametric fits, while smaller kernels approach the variance-preserving characteristics of discrete binning.

A limitation of KDE is that typically the Kernel applied is constant, which is why the KDE fit is very different between the left and right sides of the distribution (more evident when viewed in log space).

### Quantization Noise and Representation Bias

The representation that retains the most information and introduces the least noise or error is the discrete quantization with a dictionary large enough to preserve the full precision of the observations. Assuming the larger of the two discrete distributions is the "ground truth" the next example compares two sources of approximation error: the quantization noise introduced by reducing the number of bins, and the error from continuous representations. The parametric continuous fits impose structural assumptions on the data, and the nonparametric methods impose a degree of smoothing, depending on the kernel choice using KDE or by the reduction in dictionary size for the discrete representation.  By computing KL divergence against the higher precision quantization, the deviation of each representation from the "ground truth" distribution is quantified in information terms.


In [28]:
def continuous_klds(stn, freqs, continuous, binning_method='log', print_results=True):
        
    # compute the KL divergence between the discrete and continuous forms, 
    # where the discrete form is the "ground truth"
    d = {}
    # adj_freqs = np.array(list(discrete['freqs']) + [0])
    kld_ln = compute_kl_divergence(freqs, continuous[f'{stn}_LN'].values)
    d['LN'] = kld_ln
    kld_exp = compute_kl_divergence(freqs, continuous[f'{stn}_EXP'].values)
    d['EXP'] = kld_exp
    kld_kp4 = compute_kl_divergence(freqs, continuous[f'{stn}_KP4'].values)
    d['KP4'] = kld_kp4
    kld_kde = compute_kl_divergence(freqs, continuous[f'{stn}_KDE'].values)
    d['KDE'] = kld_kde
    
    return d

In [None]:
b2= 16
binning_method = 'log'
quant_noise, ln_noise, exp_noise, kp4_noise, kde_noise = [], [], [], [], []
bitrates = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
priors = np.arange(-3., 4, 0.25)
kld_dict = {}
for b1 in bitrates:
    print(f'Computing quantization noise for {b2} to {b1} bit downscaling')
    if np.var(data) == 0:
        raise ValueError("KDE cannot be computed for data with zero variance.")
    # bin_dict, remapped_discrete_freqs = compute_distribution_comparisons(data, b1, b2, binning_method)
    upsampled_freqs, b1_bin_dict, b2_bin_dict = remap_low_to_high_resolution(data, b1, b2, which_binning=binning_method)
    continuous_b1 = fit_continuous_distributions(data, b1_bin_dict[binning_method]['edges'])
    # continuous_b2 = fit_continuous_distributions(data, b2_bin_dict[binning_method]['edges'])
    # compute quantization noise from b2 to b1 downscaling
    kld_discrete = compute_kl_divergence(b2_bin_dict[binning_method]['freqs'], upsampled_freqs)      
    kld_dict[b1] = kld_discrete
    quant_noise.append(kld_discrete)
    
    kld_dict_continuous = continuous_klds(stn, b1_bin_dict[binning_method]['freqs'], continuous_b1)
    kld_dict.update(kld_dict_continuous)    
    ln_n, exp_n, kp4_n, kde_n = kld_dict['LN'], kld_dict['EXP'], kld_dict['KP4'], kld_dict['KDE']
    ln_noise.append(ln_n)
    exp_noise.append(exp_n)
    kp4_noise.append(kp4_n)
    kde_noise.append(kde_n)

In [None]:
noise_fig = figure(width=600, height=400)
noise_fig.extra_y_ranges = {"delta": Range1d(start=0.0, end=0.5)}

noise_fig.line(bitrates, quant_noise, legend_label='Discrete', color='royalblue', line_dash='dashed', line_width=2)
noise_fig.line(bitrates, ln_noise, legend_label='LogNorm', color='black', line_dash='solid', line_width=2)
noise_fig.line(bitrates, exp_noise, legend_label='Exponential', color='grey', line_dash='solid', line_width=2)
noise_fig.line(bitrates, kp4_noise, legend_label='Kappa 4', color='purple', line_dash='solid', line_width=2)
noise_fig.line(bitrates, kde_noise, legend_label='KDE', color='magenta', line_dash='solid', line_width=2)

noise_fig.xaxis.axis_label = r'$$\text{Downsampled Dictionary Size } (10^b \text{symbols})$$'
noise_fig.yaxis.axis_label = r'$$\text{Representation Noise } (\text{bits/sample})$$'
noise_fig.add_layout(LinearAxis(y_range_name="delta", axis_label=r"Continuous minus Discrete Noise"), 'right')
noise_fig.legend.location = 'center_left'
noise_fig.legend.background_fill_alpha = 0.5
noise_fig.legend.click_policy = 'hide'
noise_fig = dpf.format_fig_fonts(noise_fig, font_size=14)
show(noise_fig)

The plot above shows that the 'representation noise' as a function of the dictionary size used to approximate the "ground truth" distribution.The dictionary size does not affect the continuous distribution fits relative to each other (differences are negligble compared to the magnitude of the noise itself).  The preceding analysis examined the noise from individual distribution representations, and the next step is to compare the *total* noise (epistemic uncertainty) to the discriminant value.  

In [None]:
test_proxy, test_target = filtered_stns[15], filtered_stns[16]

# Retrieve the data for both stations
# this is all data, including non-concurrent
adf = dpf.retrieve_nonconcurrent_data(test_proxy, test_target)

# add a very small amount of noise to prevent duplicate bin edges
for stn in [test_proxy, test_target]:
    depth = 1e-9
    noise = np.random.uniform(-depth, depth, size=len(adf))
    adf[stn] += noise

df = adf.copy().dropna(subset=[test_proxy, test_target], how='any')
if df.empty:
    num_complete_concurrent_years = 0
else:
    df.reset_index(inplace=True)
    num_complete_concurrent_years = dpf.count_complete_years(df, 'time', test_proxy)
    print(num_complete_concurrent_years, 'concurrent years of record')

In [None]:
# simulate runoff at the target based on equal unit area runoff scaling
target_da = attr_df.loc[attr_df['official_id'] == test_target, 'drainage_area_km2'].values[0]
proxy_da = attr_df.loc[attr_df['official_id'] == test_proxy, 'drainage_area_km2'].values[0]
# by the equal unit area runoff model assumption, the simulated target runoff
# is equal to the observed proxy unit area runoff
sim_target = f'{test_target}_sim'
df[sim_target] = df[test_proxy] * (target_da / proxy_da)

In [None]:
def compute_discrete_probabilities(df, test_target, n_bins):
    # define states / binning
    target_bin_edges = digitize_by_log_binning(df, test_target, n_bins)

    # quantize the observed and simulated series according to the defined states
    obs_q_label = f'{test_target}_quantized'
    sim_q_label = f'{test_target}_sim_quantized'
    df[obs_q_label] = np.digitize(df[stn], target_bin_edges)    
    df[sim_q_label] = np.digitize(df[sim_target], target_bin_edges)

    # convert quantized series to discrete distributions
    obs_count_df = df.groupby(obs_q_label).count()
    sim_count_df = df.groupby(sim_q_label).count()
    
    count_df = pd.DataFrame(index=range(n_bins))
    count_df[test_target] = 0
    count_df[sim_target] = 0
    
    count_df[test_target] += obs_count_df[test_target]
    count_df[sim_target] += sim_count_df[sim_target]
    count_df.fillna(0, inplace=True)

    # assume a prior on the surrogate/proxy distribution
    for prior in priors:
        pseudo_counts = 10**prior
        post_label = f'{prior}_pseudo_counts'
        count_df[post_label] = count_df[sim_target] + pseudo_counts
        # count_df[post_label] /= count_df[post_label].sum()
    
    # normalize the counts to probabilities
    probabilities = count_df / count_df.sum()
    assert np.all(probabilities[test_target].sum().round(3) == 1.0)
    return probabilities, target_bin_edges



In [None]:
def remap_probabilities(b2_bins, b1_bins, b1_probs):
    bin_indices = np.digitize(b2_bins[:-1], b1_bins) - 1  # Map high-res bins to low-res bins
    
    # Compute counts of high-res bins falling into each low-res bin
    counts_per_low_bin = np.bincount(bin_indices, minlength=len(b1_probs))

    # Broadcast low-res frequencies to high-res bins, dividing by the count to distribute uniformly
    high_res_probs = b1_probs[bin_indices] / counts_per_low_bin[bin_indices]
    # Normalize to ensure the result sums to 1
    high_res_probs /= np.sum(high_res_probs)
    return high_res_probs

In [None]:
# set the "baseline" probabilities using a large dictionary
bitrates = np.arange(4, 16.25, 0.25)
large_dict_bins = 2**16
large_dict_probabilities, large_dict_edges = compute_discrete_probabilities(df.copy(), test_target, large_dict_bins)
print(f'{len(large_dict_edges)} edges')

In [None]:
noise_dict = {'discrete': {}, 'continuous': {}}
noise_ratios = {}
noise_decomp = []
for b in bitrates:
    print(f'{b} bits')
    n_bins = int(2**b - 2)
    # compute the quantization noise of the observed TARGET series
    # mapped by log-binning to the TARGET obseved range
    prob_df, bin_edges = compute_discrete_probabilities(df, test_target, int(2**b))
    obs_remapped_discrete = remap_probabilities(large_dict_edges, bin_edges, prob_df[test_target].values)
    quantization_noise_observed = compute_kl_divergence(large_dict_probabilities[test_target].values, obs_remapped_discrete)
    # print(f'Target quantization noise ({max(bitrates) + 1} to {b} bits quantization): {quantization_noise_observed:.2f}')
    # noise_dict['discrete'][b] = {'obs': quantization_noise_observed}
    
    # compute the parametric fits of OBSERVED data aligned with the discrete bin edges
    obs_data = df[test_target].dropna().values
    obs_continuous = fit_continuous_distributions(obs_data, bin_edges)
    # compute the parametric fits of SIMULATED data aligned with the discrete TARGET bin edges
    sim_data = df[sim_target].dropna().values
    sim_continuous = fit_continuous_distributions(sim_data, bin_edges)
    
    rep_noise_dict = {}
    # compute the "noise" associated with continuous distribution representation 
    # of both observed and simulated distributions
    for cd in ['LN', 'EXP', 'KP4', 'KDE']:
        rep_noise_obs = compute_kl_divergence(prob_df[test_target].values, obs_continuous[f'{stn}_{cd}'].values)
        rep_noise_sim = compute_kl_divergence(prob_df[sim_target].values, sim_continuous[f'{stn}_{cd}'].values)
        # print(f'    Representation noise: obs (sim) = {rep_noise_obs:.2f} ({rep_noise_sim:.2f}) ({cd} continuous to {b} bits discrete)')
        # noise_dict['continuous'][b] = {'obs': rep_noise_obs, 'sim': rep_noise_sim} 
        rep_noise_dict[cd] = {'obs': rep_noise_obs, 'sim': rep_noise_sim}

    # compute prior noise and DKL(observed || simulated_with_prior)
    # prior_noise, discriminants = [], []
    prior_dict = {}
    noise_ratios[b] = []
    for prior in priors:
        post_label = f'{prior}_pseudo_counts'
        prior_noise = compute_noise(prob_df, sim_target, post_label)
        dkl_discriminant = compute_noise(prob_df, test_target, post_label) 
        # in this case it's the DKL(Target || Posterior Simulated)
        prior_dict[prior] = {'prior_noise': prior_noise, 'discriminant': dkl_discriminant}

        # compute the quantization (discrete) noise
        # relative to a baseline (large) dictionary for the OBSERVED data
        # remap the current dictionary frequencies to the higher resolution "baseline" bins
        # observed_probs = probabilities[test_target]
        # obs_remapped_discrete = remap_probabilities(large_dict_edges, bin_edges, observed_probs.values)
        # quantization_noise_observed = compute_kl_divergence(large_dict_probabilities[test_target], obs_remapped_discrete)

        simulated_probs = prob_df[sim_target].values
        sim_remapped_discrete = remap_probabilities(large_dict_edges, bin_edges, simulated_probs)
        quantization_noise_simulated = compute_kl_divergence(large_dict_probabilities[sim_target], sim_remapped_discrete)
        # print(f'    quantization noise: observed={quantization_noise_observed:.2f}, simulated={quantization_noise_simulated:.2f} + {prior_noise:.2f} prior ({max(bitrates) + 1} to {b} bits quantization)')

        # compute the difference between the noise added from quantization and prior on the simulated data 
        # and the noise added from quantization on the observed data
        discrete_noise_diff = abs(quantization_noise_observed - (quantization_noise_simulated + prior_noise))
        SNR = dkl_discriminant / discrete_noise_diff
        # print(f'    Noise ratio = {noise_ratio:.2f} (prior: 10^{prior} pseudo-counts)')
        noise_ratios[b].append(SNR)
        noise_decomp += [(b, prior, quantization_noise_observed, quantization_noise_simulated, prior_noise, dkl_discriminant)]


In [None]:
noise_comp_df = pd.DataFrame(noise_decomp, columns=['bitrate', 'prior', 'obs_noise', 'sim_noise', 'prior_noise', 'dkl_discriminant'])
noise_comp_df.head()

In [None]:
from bokeh.palettes import Sunset11, TolRainbow23

noise_ratio_plot = figure(width=1200, height=500, y_axis_type='log')
noise_ratio_plot.varea(x=[-3, 3.75], y1=[1, 1], y2=[1e4, 1e4], legend_label='Signal > Noise', color='lightgreen', fill_alpha=0.5)
noise_ratio_plot.varea(x=[-3, 3.75], y1=[0.01, 0.01], y2=[1, 1], legend_label='Signal <= Noise', color='crimson', fill_alpha=0.3)
n = 0
line_dash='solid'
line_switch = 0
for b in bitrates:
    c = TolRainbow23[n]
    nrs = noise_ratios[b]
    noise_ratio_plot.line(priors, nrs, color=c, legend_label=f'{b:.2f} bits', line_width=2, line_dash=line_dash)
    n += 1
    if n > 22:
        if line_switch == 0:
            line_dash='dotted'
        else:
            line_dash='dashed'
        n = 0
        line_switch = 1        
    
noise_ratio_plot.xaxis.axis_label = r'$$\text{Prior } (\alpha =  10^x) \text{ Pseudo-Counts} $$'
noise_ratio_plot.yaxis.axis_label = r'$$\text{SNR}$$'
noise_ratio_plot.legend.click_policy = 'hide'
noise_ratio_plot.legend.location = 'top_left'
noise_ratio_plot.legend.ncols = 4
noise_ratio_plot.add_layout(noise_ratio_plot.legend[0], 'right')
noise_ratio_plot = dpf.format_fig_fonts(noise_ratio_plot)
show(noise_ratio_plot)

The plot above shows how the combination of quantization noise and the prior influence compares to the discriminant $D_\text{KL}(T||S_\text{post})$.  The discriminant can't be considered meaningful if it is not greater than the epistemic uncertainty introduced by the models and methods.

The quantization noise has the same "flattening" effect on both the simulated and observed distributions, but different amounts of noise are added to each because they are distinct to begin with.  So it is the difference in how much the distributions are influenced that is important in comparison to the discriminant.  

**Observation**: Somewhere between 6 and 7 bits there is a threshold.  Below this threshold, the quantization adds more noise to the observed series, and above the threshold the quantization adds more noise to the simulated series. 

In [None]:
def compute_adjusted_noise(row):
    a, b, c = row['obs_noise'], row['sim_noise'], row['prior_noise']
    return abs(b + c - a)

In [None]:
from bokeh.palettes import tol
# noise_ratio_plot = figure(width=800, height=500)
n = 0
noise_decomp_plots = []
N = 8
for b in bitrates:
    if (b % 1 != 0) & (b >= 8):
        continue
    # c = Sunset10[n]
    # nrs = noise_ratios[b]
    # noise_ratio_plot.line(priors, nrs, color=c, legend_label=f'{b} bits', line_width=2)
    noise_decomp_plot = figure(title=f'{b} bits dictionary', width=400, height=300)
    dat = noise_comp_df[noise_comp_df['bitrate'] == b].copy()
    # dat['quant_noise_diff'] = abs(dat['obs_noise'] - dat['sim_noise'])
    dat['adjusted_error'] = dat.apply(lambda row: compute_adjusted_noise(row), axis=1)
    pstrings = ['adjusted_error', 'prior_noise']
    if b % 1 == 0:
        print(b)
        print(dat[['obs_noise', 'sim_noise', 'prior'] + pstrings].head())
    # print(asdf)
    
    noise_decomp_plot.varea_stack(stackers=pstrings, x='prior', color=tol['Sunset'][N][-2:], 
                                  fill_alpha=0.75, legend_label=pstrings, source=dat)
    noise_decomp_plot.line(dat['prior'], dat['dkl_discriminant'], line_width=3, line_dash='dotted', 
                           color='black', legend_label='DKL(P||Q_post)')
    noise_decomp_plot.xaxis.axis_label = r'$$\text{Prior } (\alpha =  10^x) \text{ Pseudo-Counts} $$'
    noise_decomp_plot.yaxis.axis_label = r'$$\text{Noise (bits/sample)}$$'
    noise_decomp_plot.legend.location = 'top_center'
    noise_decomp_plot.legend.background_fill_alpha = 0.5
    noise_decomp_plots.append(noise_decomp_plot)
    n += 1

In [None]:
layout = gridplot(noise_decomp_plots, ncols=3, width=400, height=300)
show(layout)

## Formalize the problem statement

1. Let $X_S$ be the time series runoff at a (**S**urrogate) catchment used as a source/basis for regional information transfer.
2. Let $X_T$ be the time series runoff at a (**T**arget) catchment where we want to simulate streamflow.
3. Let $\hat X_T$ be a simulation of $X_T$ based on a simple equal unit area runoff regional transfer model using the surrogate catchment:
    * $\hat X_T = \frac{A_T}{A_S} X_S$

We want to evaluate how well the surrogate catchment represents the overall catchment response of the target.  A perfect representation corresponds to an exact match between FDCs, and otherwise the difference in overall basin response is quantified by the (Kullback-Leibler) divergence of the simulated distribution ($\hat T = \text{Pr}(\hat X_T)$) from the "ground truth" observed distribution ($T = \text{Pr}(X_T)$), expressed as $D_\text{KL}(T||\hat T)$.  The divergence represents the additional description length required to encode the simulated values, reflecting the errors in frequency estimation relative to the 'ground truth' distribution. The KL divergence can be interpreted as a discriminant function, quantifying how well a surrogate model approximates the 'ground truth' distribution. Comparing the divergence values for different surrogate catchments simulating the same target, the model with the lower divergence is preferred since it represents a closer match.  To compute the KL divergence, we first need to represent $X_T$ and $\hat X_T$ as distributions.  The problem setup is formalized as follows:

4. Let $T(b)$ be a quantized representation of $X_T$ to a dictionary size of $2^b$ intervals, each representing a range of continuous values for b-bit precision
    * A dictionary size of $2^b$ symbols is defined by $2^b + 1$ bin edges, where
    * the set of bin edges $B_T$ are logarithmically spaced, i.e. $[\log(\max(X_T)) - \log(\min(X_T))] / (2^b - 2)$
    * The $(2^b - 2)$ denominator reserves two bins, one at each end of the distribution.
    * The quantization is 1-indexed, so
        * the quantized representation of $X_T$ will fall in the range $[1, 2^b - 2]$ (i.e. 14-symbol dictionary if b = 4).
        * simulated values $t_i < \min(X_T)$ are assigned index 0 (ie. the 15th symbol if b=4), and
        * simulated values $t_i > \max(X_T)$ are assigned index $2^b - 1$ (i.e. the 16th symbol if b=4)
5. $\hat T(b)$ is then quantized representation of the simulated series $\hat X_T$ using the bin edges $B_T$ defined on the "ground truth" range.
6.  Compute the discriminant, the quantization noise, and the noise due to the prior:
    *  $D_\text{KL}(T(b)||\hat T(b))$: the discriminant quantifies the divergence between simulation $\hat T(b)$ and ground truth distributions
    *  $D_\text{KL}(T(b)|| T(b_\text{max}))$: the quantization noise for $T$,
    *  $D_\text{KL}(\hat T(b)|| \hat T(b_\text{max}))$: the quantization noise for $T_hat$.
    *  $b_{\max}$ is the size of dictionary where the noise added is uniform across all samples.  It represents a dictionary size that retains the full variance of the original series by assigning a unique symbol to each unique observation.  For the sample of observed series in the dataset, a dictionary size of $2^12$ adds constant noise within 1% to all streamflow time series.
7.  Compute the (signal to epistemic noise) ratio of the discriminant to the total noise to determine the strength of "genuine difference" between the simulated and observed distributions $T$ and $\hat T$.

In [29]:
# create a new output filename 
attributes_filename = 'BCUB_watershed_attributes_updated.csv'
attributes_fpath = os.path.join(os.getcwd(), 'data', attributes_filename)
attr_df = pd.read_csv(attributes_fpath)
attr_df.columns = [e.lower() for e in attr_df.columns]
# df.columns
filtered_stns = sorted(list(set(attr_df['official_id'].values)))

In [30]:
def compute_kl_divergence(P, Q):
    """Compute the KL divergence DKL(P || Q)."""
    return np.sum(P * np.log(P / Q), where=(P != 0))

def generate_and_plot_kl_vs_prior(mu1, sigma1, mu2, sigma2, b, priors, y_range=(0, 0.02)):
    """
    Generates two log-normal distributions, quantizes them with 2^b symbols,
    computes the posterior with varying priors, and plots KL divergences DKL(P||R) and DKL(Q||R).

    Parameters:
    - mu1, sigma1: Parameters for the first log-normal distribution.
    - mu2, sigma2: Parameters for the second log-normal distribution.
    - b: Number of symbols = 2^b for quantization.
    - priors: Array of prior pseudo-counts to apply to Q.
    - y_range: Range for the y-axis in the plot.
    """
    # Create the two log-normal distributions
    minx, maxx = 0.0, 5.0
    bins = np.linspace(minx, maxx, 2 ** b + 1)
    bins_midpoints = (bins[:-1] + bins[1:]) / 2

    # Evaluate the distributions at the bin midpoints
    dist1 = lognorm.pdf(bins_midpoints, sigma1, scale=np.exp(mu1))
    dist2 = lognorm.pdf(bins_midpoints, sigma2, scale=np.exp(mu2))

    # Normalize to create PDFs
    P = dist1 / np.sum(dist1)
    Q = dist2 / np.sum(dist2)

    kl_p_r_list = []
    kl_q_r_list, ratio_list = [], []

    # Compute KL divergences for each prior value
    for prior_pseudo_count in priors:
        prior_counts = np.full_like(Q, prior_pseudo_count)
        posterior_counts = Q * np.sum(dist2) + prior_counts  # Adding pseudo-counts
        R = posterior_counts / np.sum(posterior_counts)  # Renormalize

        # Ensure valid PDFs
        assert np.abs(sum(R) - 1) < 0.001, 'R does not sum to 1'

        # Compute KL divergences
        kl_p_r = compute_kl_divergence(P, R)
        kl_q_r = compute_kl_divergence(Q, R)

        kl_p_r_list.append(kl_p_r)
        kl_q_r_list.append(kl_q_r)
        
        # Compute ratio as percentage
        ratio = (kl_q_r / kl_p_r) * 100 if kl_p_r != 0 else np.nan
        ratio_list.append(ratio)

    # Prepare data for plotting
    source = ColumnDataSource(data=dict(
        prior=priors,
        kl_p_r=kl_p_r_list,
        kl_q_r=kl_q_r_list,
        ratio=ratio_list,
    ))

    ratio_range = (min(ratio_list) * 0.98, max(ratio_list) * 1.02)
    ratio_range = (min(ratio_list) * 0.98, 10)

    # Create the Bokeh plot
    p = figure(title="",
               x_axis_label='Prior Pseudo-count',
               y_axis_label='KL Divergence',
               x_axis_type='log',
               y_range=y_range,
               width=600, height=400)

    # Add secondary y-axis for the ratio
    p.extra_y_ranges = {"ratio": Range1d(*ratio_range)}
    p.add_layout(LinearAxis(y_range_name="ratio", axis_label='Noise (%)'), 'right')

    p.line('prior', 'kl_p_r', source=source, line_width=2, color='black', legend_label='DKL(P || R)')
    p.line('prior', 'kl_q_r', source=source, line_width=2, color='red', legend_label='DKL(Q || R)')

    # Plot the ratio on the secondary y-axis
    p.line('prior', 'ratio', source=source, line_width=2, color='red', 
           line_dash='dashed', y_range_name="ratio",
           legend_label='Prior Influence (%)')

    p.line(priors, [5 for _ in priors], line_width=2, color='red',
           line_dash='dotted', y_range_name='ratio',
           legend_label='5% noise limit')

    # Configure the legend and show the plot
    p.legend.location = "top_left"
    p.legend.click_policy = "hide"
    return p

In [31]:
mu, sigma = 0.25, 0.35
priors = np.logspace(-6, -2, 100)
prior_vs_kld = generate_and_plot_kl_vs_prior(mu, sigma, mu+0.025, sigma - 0.02, 8, priors)
prior_vs_kld = dpf.format_fig_fonts(prior_vs_kld)
show(prior_vs_kld)

Let's run through an example computation to see the difference between 4, 6, and 8 bit quantization, how each represents the total measurement range, and how each quantization aligns with your own expectation of heteroscedastic rating curve uncertainty.

In [32]:
# Step 1: Compute L-moments
def compute_l_moments(data):
    sorted_data = np.sort(data)
    n = len(sorted_data)
    
    # Calculate L-moment ratios using probability weighted moments (PWM)
    b0 = np.mean(sorted_data)
    b1 = np.sum([(k / (n - 1)) * sorted_data[k] for k in range(n)]) / n
    b2 = np.sum([(k * (k - 1)) / ((n - 1) * (n - 2)) * sorted_data[k] for k in range(n)]) / n
    l1 = b0
    l2 = 2 * b1 - b0
    l3 = 6 * b2 - 6 * b1 + b0

    # L-moment ratios (L-skewness and L-kurtosis)
    tau3 = l3 / l2
    tau4 = (l3 - l2) / l2  # Approximation for L-kurtosis

    return l1, l2, tau3, tau4

# Step 2: Estimate initial Kappa parameters using L-moments
def initial_kappa_params(data):
    l1, l2, tau3, tau4 = compute_l_moments(data)
    
    # Use L-moment formulas to estimate the shape and scale of Kappa distribution
    h = (1 - tau3) / 2 if tau3 < 1 else 0.1  # Example heuristic
    k = (1 - tau4) / 2 if tau4 < 1 else 0.1  # Example heuristic

    loc = l1  # Initial guess for location
    scale = l2  # Initial guess for scale
    print('initial params: ', h, k, loc, scale)
    return h, k, loc, scale

# Step 3: Define constraints based on Castellarin (e.g., ensuring kK <= 0 and hK > 0)
def constrained_optimization(data):
    initial_params = initial_kappa_params(data)

    # Define the objective function as the negative log-likelihood
    def neg_log_likelihood(params):
        h, k, loc, scale = params
        if scale <= 0:  # Enforce positive scale
            return np.inf
        return -np.sum(kappa4.logpdf(data, h, k, loc=loc, scale=scale))

    # Constrain h > 0, k <= 0, and enforce scale > 0
    constraints = [
        {'type': 'ineq', 'fun': lambda x: x[0]},  # h > 0
        {'type': 'ineq', 'fun': lambda x: -x[1]},  # k <= 0
        {'type': 'ineq', 'fun': lambda x: x[3] - 1e-5}  # scale > 0
    ]
    
    # Run the constrained optimization
    result = minimize(
        neg_log_likelihood,
        initial_params,
        constraints=constraints,
        method='SLSQP'
    )

    if result.success:
        h, k, loc, scale = result.x
        return {'h': h, 'k': k, 'loc': loc, 'scale': scale}
    else:
        raise ValueError("Optimization did not converge")


In [33]:
def create_MLE_fit_plot(b, df, stn):
    test_fig = figure(title=None,
                     width=600, height=500, x_axis_type='log')  

    # plot empirical (discrete) distributions using linear and log binning
    b=8
    label = f'{b}_bit_log'
    bin_edges, freqs, log_bin_edges, log_freqs = compute_discrete_distributions(df, b, label, stn)
    test_fig.quad(left=bin_edges[:-1], right=bin_edges[1:], top=freqs, bottom=[0 for _ in freqs], 
                  legend_label=f'{b }bits linear bins', color=Sunset10[0], fill_alpha=0.4, line_color=None)
    test_fig.quad(left=log_bin_edges[:-1], right=log_bin_edges[1:], top=log_freqs, 
                  bottom=[0 for _ in freqs], legend_label=f'{b} bits log bins', color=Sunset10[8], 
                  line_color=None, fill_alpha=0.6)

    # fit and plot a lognormal distribution
    ln_shape, ln_loc, ln_scale = lognorm.fit(df[stn], floc=0)  # Fixing location to 0
    x = np.logspace(-2, 3, 1000)
    ln_mle_pdf = lognorm.pdf(x, ln_shape, loc=0, scale=ln_scale)

    # fit and plot an exponential distribution
    ex_loc, ex_scale = expon.fit(df[stn], floc=0)
    ex_mle_pdf = expon.pdf(x, loc=0, scale=ex_scale)

    test_fig.line(x, ln_mle_pdf, color='black', legend_label='LN MLE pdf', line_width=2)
    test_fig.line(x, ex_mle_pdf, color='grey', legend_label='EXP MLE pdf', line_width=2)
    test_fig.legend.background_fill_alpha = 0.6
    test_fig.legend.location = 'top_right'
    test_fig.legend.click_policy='hide'
    test_fig.xaxis.axis_label = r'$$\text{Mean Daily Flow } [m^3/s]$$'
    test_fig.yaxis.axis_label = r'$$P(X)$$'
    test_fig = dpf.format_fig_fonts(test_fig)
    return test_fig


In [34]:
def create_logspace_bins(data, num_bins):
    """Create linearly spaced bins in log space over the range of `data`."""
    min_val, max_val = np.min(data), np.max(data)
    log_min, log_max = np.log10(min_val), np.log10(max_val)
    bin_edges = np.logspace(log_min, log_max, num_bins + 1)
    return bin_edges

def calculate_probabilities(data, bin_edges):
    """Calculate the probabilities of data falling into each bin given `bin_edges`."""
    counts, _ = np.histogram(data, bins=bin_edges)
    probabilities = counts / counts.sum()  # Normalize to get probabilities
    return probabilities

def aggregate_probabilities(high_res_probs, high_res_bins, low_res_bins):
    """
    Aggregate high-resolution probabilities to match low-resolution bins.
    This will sum the high_res_probs that fall within each low-res bin.
    """
    low_res_probs = []
    for i in range(len(low_res_bins) - 1):
        # Find indices of high-resolution bins that fall within the current low-res bin
        indices = np.where((high_res_bins[:-1] >= low_res_bins[i]) & (high_res_bins[:-1] < low_res_bins[i+1]))[0]
        # Sum probabilities within the range
        low_res_probs.append(np.sum(high_res_probs[indices]))
    return np.array(low_res_probs)

def compute_kl_divergence(p, q):
    """Compute KL divergence between two probability distributions `p` and `q`."""
    # Avoid division by zero and log of zero by adding a small epsilon
    # epsilon = 1e-10
    # p = np.maximum(p, epsilon)
    # q = np.maximum(q, epsilon)
    kl_divergence = np.sum(p * np.log(p / q))
    return kl_divergence

def compute_discrete_distributions(df, b, label, stn, log_bins=True):
    # 1. Determine the log-spaced bin edges
    vals = df[stn].values
    minx, maxx = np.min(vals), np.max(vals)
    log_min = np.log10(minx)  # Avoid log(0) error
    log_max = np.log10(maxx)
    log_edges = np.logspace(log_min, log_max, 2**b + 1)

    # 2. Compute the histogram using log-spaced bins
    log_freqs, log_bin_edges = np.histogram(vals, bins=log_edges, density=True)

    # 3. Calculate the bin midpoints in log space
    log_bin_midpoints = (log_bin_edges[:-1] * log_bin_edges[1:]) ** 0.5  # Geometric mean

    freqs, bin_edges = np.histogram(vals, bins=2**b, density=True)
    bin_midpoints = (bin_edges[:-1] + bin_edges[1:]) / 2
    
    return bin_edges, freqs, log_bin_edges, log_freqs


In [35]:
quantization_noise_results = {}
for stn in filtered_stns:
    test_df = dpf.get_timeseries_data(stn)
    test_df.dropna(subset=[stn], inplace=True)
    test_bs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
    quantization_noise_results[stn] = [kl_divergence_between_quantizations(test_df, b, max(test_bs), stn) for b in test_bs]

NameError: name 'kl_divergence_between_quantizations' is not defined

In [None]:
qn_df = pd.DataFrame.from_dict(quantization_noise_results, orient='index', columns=test_bs)
bounds = pd.DataFrame()
qn_fig = figure(width=700, height=400)
for p in [2.5, 25, 50, 75, 98.5]:
    bounds[p] = [np.percentile(qn_df[c], p) for c in test_bs]
bounds.index = test_bs
qn_fig.varea(x=test_bs, y1=bounds[2.5], y2=bounds[98.5], alpha=0.4, color='grey', legend_label='95% CI')
qn_fig.varea(x=test_bs, y1=bounds[25], y2=bounds[75], alpha=0.4, color='black', legend_label='IQR')
qn_fig.line(x=test_bs, y=bounds[50], color='crimson', legend_label='Median', line_dash='dashed', line_width=3)
qn_fig.yaxis.axis_label = r"$$\text{Noise [bits/sample]}$$"
qn_fig.xaxis.axis_label = r"$$\text{Dictionary Size (b)} [2^b = \text{N symbols}]$$"
qn_fig = dpf.format_fig_fonts(qn_fig)
show(qn_fig)

In [None]:
from bokeh.io import export_png

test_stn = filtered_stns
for stn in filtered_stns:
    output_folder = 'MLE_plots'
    plot_fpath = os.path.join(output_folder, f"{stn}_LN_and_expon_fits.png")
    if os.path.exists(plot_fpath):
        continue
    test_df = dpf.get_timeseries_data(stn)
    test_df.dropna(subset=[stn], inplace=True)
    minx, maxx = test_df[stn].min(), test_df[stn].max()
    # print(f'X range is {minx:.1f} to {maxx:.1f} cms')
    # print('')
    
    test_fig = create_MLE_fit_plot(8, test_df, stn)
    
    export_png(test_fig, filename=plot_fpath)

In [None]:
n = 0
for stn in filtered_stns:
    df = dpf.get_timeseries_data(stn)
    ln_shape, ln_loc, ln_scale = lognorm.fit(df[stn], floc=0)
    expon_loc, expon_scale = expon.fit(df[stn], floc=0)
    # kh, kk, k_loc, k_scale = fit_kappa4_mle(df[stn])
    # kh, kk, k_loc, k_scale = constrained_optimization(df[stn])
    attr_df.loc[attr_df['official_id'] == stn, ['ln_shape', 'ln_loc', 'ln_scale']] = (ln_shape, ln_loc, ln_scale)
    attr_df.loc[attr_df['official_id'] == stn, ['expon_loc', 'expon_scale']] = (expon_loc, expon_scale)
    # attr_df.loc[attr_df['official_id'] == stn, ['kappa_h', 'kappa_k', 'kappa_loc', 'kappa_scale']] = (kh, kk, k_loc, k_scale)
    for b in test_bs:
        attr_df.loc[attr_df['official_id'] == stn, f'{b}_quantization_noise'] = kl_divergence_between_quantizations(df, b, max(test_bs), stn) 
    n += 1
    if n % 150 == 0:
        print(f'    ...{n}/{len(filtered_stns)} completed.')

In [None]:
# convert the MLE parameters to dicts for easier access
ln_dict = (
    attr_df
    .set_index('official_id')[['ln_shape', 'ln_loc', 'ln_scale']]
    .to_dict(orient='index')
)
expon_dict = (
    attr_df
    .set_index('official_id')[['expon_loc', 'expon_scale']]
    .to_dict(orient='index')
)

# kappa_dict = (
#     attr_df
#     .set_index('official_id')[['kappa_a', 'kappa_b', 'kappa_c', 'kappa_d']]
#     .to_dict(orient='index')
# )

## Compute the noise added to a discriminant value due to assuming an error distribution as a prior

Rating curve uncertainty is a hard problem in hydrology.  Instead of treating daily flow observations as discrete measurements with the fixed (often overzealous) precision that it is published by governing agencies, we can assume some kind of basic error model and test how much the error model distorts the information in the distribution.  In other words, how much noise/uncertainty is added for any model error.  

Below we'll test a range of uniform error distributions as models for the observations.  We'll take an example streamflow record, and we'll quantize it to a range of dictionary sizes in two ways.  One way is to bin the observations as they are, we'll refer to this as the "deterministic" treatment.  The second way is to apply a series of error distribution models, calling it the "stochastic treatment", and bin the observations by counting the fraction of the observation distribution interval that lies in each bin.  In other words, we'll count partial observations in proportion to where they fall over the binning intervals as opposed to counting a whole observation based on the interval alone.

The quantization will take in a bitrate $b$, and it will divide and log-transform the measured interval $(\log(x_\text{min}),\log(x_\text{max}))$ into $2^b$ log-spaced bins.  

In [36]:
def compute_log_uniform_bins(df, stn, bitrate):
    n_bins = 2**bitrate
    min_log_val = np.log10(df[stn].min())
    max_log_val = np.log10(df[stn].max())

    # set the bin edges to be evenly spaced between the
    # observed range of the proxy/donor series
    # np.digitize will assign 0 for out-of-range values at left
    # and n_bins + 1 for out-of-range values at right
    log_bin_edges = np.linspace(
        min_log_val,
        max_log_val,
        n_bins + 1,
    ).flatten()

    # convert back to linear space
    bin_edges = [10**e for e in log_bin_edges]

    # there should be n_bins edges which define n_bins - 1 bins
    # this is to reserve 2 bin for out-of-range values to the right
    assert len(bin_edges) == n_bins + 1
    return bin_edges

In [37]:
def apply_error_to_observations(df, stn, bitrate=None, error=0.1):
    min_q, max_q = df[stn].min() - 1e-9, df[stn].max() + 1e-9
    assert min_q > 0
    # use equal width bins in log10 space
    bin_edges = compute_log_uniform_bins(df, stn, bitrate)
    # df[f'{bitrate}_bits_quantized'] = np.digitize(df[stn], bin_edges)
    fractional_obs_counts = dpf.error_adjusted_fractional_bin_counts(
        df[stn], np.array(bin_edges), bitrate, error_factor=error
    )
    label = f'{stn}_{int(100*error)}_error'
    count_df = pd.DataFrame(index=range(2**bitrate))
    count_df[label] = 0
    count_df[label] += fractional_obs_counts
    count_df.fillna(0, inplace=True)
    n_obs = np.nansum(count_df[label])
    # normalize p_obs and p_sim
    return count_df[label].values / n_obs
    

In [38]:
def compute_unadjusted_counts(df, stn, bitrate):
    bin_edges = compute_log_uniform_bins(df, stn, bitrate)
    label = f'{stn}_simple_{bitrate}bits'
    df[label] = np.digitize(df[stn], bin_edges)
    # print(df[[stn, f'{stn}_quantized_{bitrate}bits']].head(4))
    # count the occurrences of each quantized value
    # the "simulated" series is the proxy/donor series
    # and the "observed" series is the target location
    obs_count_df = df.groupby(label).count()
    count_df = pd.DataFrame(index=range(2**bitrate))
    count_df[label] = 0
    count_df[label] += obs_count_df[stn]
    count_df.fillna(0, inplace=True)
    adjusted_p = count_df / obs_count_df[stn].sum()
    return adjusted_p.values.flatten()

In [39]:
def compute_distortion(inputs):
    df, stn, b, err = inputs
    simple_frequencies = compute_unadjusted_counts(df, stn, b)
    error_adjusted_frequencies = apply_error_to_observations(df, stn, bitrate=b, error=err)
    # compute KL divergence between the simple and adjusted frequencies
    # this represents the distortion due to the error model
    mask = (simple_frequencies > 0) & (error_adjusted_frequencies > 0)
    distortion = np.zeros_like(simple_frequencies)
    distortion[mask] = simple_frequencies[mask] * np.log2(simple_frequencies[mask] / error_adjusted_frequencies[mask])
    kld = sum(distortion)
    return stn, kld, b, err

## Pairwise Processing



In [40]:
import itertools

# generate all combinations of pairs of station ids
id_pairs = list(itertools.combinations(filtered_stns, 2))
print(f' There are {len(id_pairs)} unique pairings in the dataset')
# shuffle the pairs to make testing smaller batches more robust
np.random.seed(42)
np.random.shuffle(id_pairs)

 There are 877150 unique pairings in the dataset


In [41]:
# load the attributes file with catchment geometries
geom_file = 'BCUB_watershed_attributes_updated.geojson'
bcub_gdf = gpd.read_file(os.path.join(os.getcwd(), 'data', geom_file))
bcub_gdf.columns = [c.lower() for c in bcub_gdf.columns]

In [42]:
# set a revision date for the results output file
revision_date = '20241125'

# how many pairs to compute in each batch
batch_size = 5000
# batch_size = 10

# # what percentage of 365 observations in a year counts as a "complete" year
# completeness_threshold = 0.9
# min_observations = 365 * 0.9

# station pairs with less than min_years concurrent years of data are excluded (for concurrent analysis),
# stations with less than min_years are excluded (for non-concurrent analysis),
min_years = 1 #[2, 3, 4, 5, 10]

# a prior is applied to q in the form of a uniform array of 10**c pseudo-counts "c"
# this prior is used to test the effect of the choice of prior on the model
pseudo_counts = [-5, -4, -3, -2, -1, -0.5, -0.2, -0.1, 0, 0.1, 0.2, 0.5, 1, 2, 3, 4, 5]

# set the number of quantization levels to test, equal to 2^bitrate
bitrates = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

# Preload all records into a dictionary for fast lookup
records_dict = bcub_gdf.copy().set_index('official_id').to_dict(orient='index')

In [43]:
temp_dir = os.path.join(os.getcwd(), 'data/', 'temp')
if not os.path.exists(temp_dir):
    os.makedirs(temp_dir)

In [44]:
def input_batch_generator(df, id_pairs_filtered, min_years, use_partial_counts):
    batch_inputs = []
    for proxy, target in id_pairs_filtered:
        
        proxy_dict = records_dict.get(proxy, {})
        target_dict = records_dict.get(target, {})

        proxy_dict['official_id'] = proxy
        target_dict['official_id'] = target

        assert 'geometry' in proxy_dict.keys(), proxy_dict.keys()
        assert 'geometry' in target_dict.keys(), target_dict.keys()
        
        batch = [proxy_dict, target_dict, min_years,]
        batch_inputs.append(batch)
    return batch_inputs

In [45]:
def compute_kl_estimates(P_dd, Q_dd, epsilon=1e-10):
    """
    Compute row-wise KL divergence estimates while ensuring positivity and handling edge cases.

    Args:
        P_dd: 2D array of P_dd values (with NaNs).
        Q_dd: 2D array of Q_dd values (with NaNs).
        epsilon: Small constant to prevent division by zero.

    Returns:
        kl_estimates: 1D array of row-wise KL divergence estimates.
    """
    # Replace NaNs with 0 in P_dd and epsilon in Q_dd
    P_dd_safe = np.where(~np.isnan(P_dd), P_dd, 0)
    Q_dd_safe = np.where(~np.isnan(Q_dd), Q_dd, epsilon)

    # Ensure rows are normalized
    P_dd_safe = P_dd_safe / np.nansum(P_dd_safe, axis=1, keepdims=True)
    Q_dd_safe = Q_dd_safe / np.nansum(Q_dd_safe, axis=1, keepdims=True)

    # Compute log2(P_dd / Q_dd) safely
    log_ratios = np.log2(P_dd_safe / Q_dd_safe)
    log_ratios = np.where(P_dd_safe > 0, log_ratios, 0)  # Ignore invalid ratios

    # Compute row-wise KL divergence
    kl_estimates = np.nansum(P_dd_safe * log_ratios, axis=1)

    return kl_estimates

In [46]:
def adjust_kde_bandwidth(data, factor=1.0, existing_bw=None):
    """
    Adjust the bandwidth of a Gaussian KDE by scaling Scott's bandwidth.

    Args:
        data: 1D array of data points for KDE.
        factor: Scaling factor for the bandwidth (default is 1.0, i.e., no adjustment).

    Returns:
        proxy_kde: KDE object with adjusted bandwidth.
    """
    # Compute Scott's bandwidth
    n = data.size
    if existing_bw == None:
        scott_bw = jnp.std(data) * n**(-1 / 5)  # Scott's rule
    else:
        scott_bw = existing_bw

    # Scale the bandwidth
    adjusted_bw = factor * scott_bw
    print(f'    KDE bandwidth adjusted from {scott_bw:.3f} to {adjusted_bw:.3f}')

    return adjusted_bw

In [47]:
def assert_alignment(a, b):
    # Define a mask for valid kde_density values
    valid_mask_1 = ~jnp.isnan(a)
    
    # Define a mask for valid x1_diffs values
    valid_mask_2 = ~jnp.isnan(b)
    
    # Assert that wherever kde_density is valid, x1_diffs is also valid
    assert jnp.all(jnp.logical_not(valid_mask_1) | valid_mask_2), (
        "x1_diffs contains invalid (NaN) values where kde_density is valid."
    )

In [48]:
def justify(a, invalid_val=np.nan, axis=1, side='right'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
    """

    if invalid_val is np.nan:
        mask = ~np.isnan(a)
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val) 
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out


def compute_adjusted_cdf_and_density(data):
    """
    Compute adjusted CDF and density for each row of a 2D array using JAX.

    Args:
        data: 2D JAX array, where each row represents a sample set.

    Returns:
        adjusted_cdfs: 2D array of adjusted CDFs, padded with NaN where necessary.
        adjusted_densities: 2D array of adjusted densities, padded with NaN where necessary.
        unique_vals: 2D array of unique values per row, padded with NaN where necessary.
    """
   # Sort rows and handle NaNs by sorting them to the end
    sorted_data = jnp.sort(data, axis=1)

    # Count consecutive duplicates to determine unique values and counts
    diffs = jnp.diff(sorted_data, axis=1, prepend=0)
    unique_mask = diffs != 0  # Marks unique values in each row

    # Create unique values and compute their counts
    unique_vals = jnp.where(unique_mask, sorted_data, jnp.nan)
    counts_per_row = jnp.sum(unique_mask, axis=1, keepdims=True)

    # Compute cumulative counts for unique values only
    adjusted_cdf = jnp.where(
        unique_mask,
        jnp.cumsum(unique_mask / counts_per_row, axis=1),
        jnp.nan
    )

    sorted_cdf = justify(adjusted_cdf, side='left')
    sorted_unique = justify(unique_vals, side='left')

    # Compute the density (difference of CDF)
    density = jnp.diff(jnp.concatenate([jnp.zeros_like(sorted_cdf[:, :1]), sorted_cdf], axis=1), axis=1)    

    return sorted_cdf, density, sorted_unique

In [49]:
def compute_hist_and_log_edges(x, n_bins=64):
    min_log_val = np.log10(np.nanmin(x))
    max_log_val = np.log10(np.nanmax(x))

    # set the bin edges to be evenly spaced between the
    # observed range of the proxy/donor series
    # np.digitize will assign 0 for out-of-range values at left
    # and n_bins + 1 for out-of-range values at right
    log_bin_edges = np.linspace(
        min_log_val,
        max_log_val,
        n_bins + 1,
    ).flatten()
    bin_edges = [10**e for e in log_bin_edges]
    # hist, edges = np.histogram(x, bins=bin_edges) 
    log_counts, _ = np.histogram(x, bins=bin_edges, density=False)
    freqs = log_counts / sum(log_counts)
    # hist = hist / np.nansum(hist)
    assert abs(sum(freqs) - 1) < 0.0001
    return freqs, bin_edges, n_bins
    

In [None]:
def kde_test(x1, x2, sim_kde):

    return hist_obs, hist_sim, edges

In [50]:
from bokeh.io import reset_output

def plot_and_save_cdf_kde_histogram(x1, x2, ci, sim_kde, output_path="plot.png"):
    """
    Plot and save the empirical CDF, KDE fit, and histogram using Bokeh.

    Args:
        x: 1D array of x values (for KDE and CDF).
        cdf: Empirical CDF values corresponding to x.
        kde: KDE fit values corresponding to x.
        hist: Histogram heights.
        bins: Bin edges for the histogram.
        output_path: Path to save the PNG output.
    """
    hist_obs, edges_obs, n_bins = compute_hist_and_log_edges(x1)
    hist_sim, edges_sim, n_bins = compute_hist_and_log_edges(x2)

    # Create the figure
    p = figure(width=800, height=600,
        title=f"KL divergence (median: {ci[1]:.3f} ({ci[0]:.3f}-{ci[2]:.3f} 95% CI)",
        x_axis_label="X", y_axis_label="Density / Probability",
              x_axis_type='log')

    # Add histogram (muted grey with alpha)
    p.quad(top=hist_obs, bottom=0, left=edges_obs[:-1], right=edges_obs[1:],
           fill_color="grey", fill_alpha=0.4, line_color=None, legend_label="Observed")
    p.quad(top=hist_sim, bottom=0, left=edges_sim[:-1], right=edges_sim[1:],
           fill_color="crimson", fill_alpha=0.4, line_color=None, legend_label="Simulated")

    # Compute the KDE-based CDF at the evaluation points    
    kde_obs_fit = jkde(np.log10(x1))
    
    # x_obs_range = np.percentile(x1, np.linspace(0, 100, n_bins))
    bin_midpoints = (np.array(edges_obs[:-1]) + np.array(edges_obs[1:])) / 2
    kde_obs_cdf = np.array([kde_obs_fit.integrate_box_1d(-np.inf, np.log10(xi)) for xi in edges_obs])
    # kde_obs = np.diff(np.concatenate(([0], kde_obs_cdf)))
    kde_obs = np.diff(kde_obs_cdf)
    # kde_obs = kde_obs_fit.evaluate(np.log10(x_obs_range))
    kde_sim_cdf = np.array([sim_kde.integrate_box_1d(-np.inf, np.log10(xi)) for xi in edges_obs])#vmap_kde(np.log10(x_obs_range))
    # kde_sim = np.diff(np.concatenate(([0], kde_sim_cdf)))
    kde_sim = np.diff(kde_sim_cdf)

    # kde_obs = kde_obs / np.nansum(kde_obs)
    # kde_sim = kde_sim / np.nansum(kde_sim)

    print(f'kde  sim ={np.nansum(kde_obs):.3f} obs={np.nansum(kde_sim):.3f}')

    # Add KDE lines series 
    p.line(bin_midpoints, kde_obs, line_width=3, line_color="black", line_dash='solid', legend_label="KDE (obs)",)
    p.line(bin_midpoints, kde_sim, line_width=3, line_color="black", line_dash='dashed', legend_label="KDE (sim)",)

    # Add empirical CDF (orange line)
    # p.line(x, cdf, line_width=2, line_color="orange", legend_label="Empirical CDF",)

    # Style the legend
    p.legend.location = "top_right"
    # p.legend.title = "Legend"
    p = dpf.format_fig_fonts(p)

    # Save as PNG
    try:
        export_png(p, filename=output_path)
    finally:
        # Clean up Bokeh output state
        reset_output()

    print(f"Plot saved to {output_path}")

In [51]:
def kl_divergence_convergence(proxy, target, x1, x2, n_resamples=100, epsilon=1e-10, bw_method='scott', existing_bw=None, min_q_allowed=1e-7):
    """
    Vectorized estimation of KL divergence convergence using bootstrap resampling.

    Parameters:
    - x1, x2: Arrays of observations from distributions P and Q.
    - n_resamples: Number of bootstrap resamples.

    Returns:
    - kl_estimates: Array of KL divergence estimates.
    - mean_kl: Mean of the bootstrap estimates.
    - ci_lower, ci_upper: 95% confidence interval bounds.
    """
    n = len(x1)
    print(f'N={n} samples')
    # Empirical CDF for each resample
    cdf1 = np.arange(1, n + 1) / n
    
    P_density = np.diff(np.tile(cdf1, (n_resamples, 1)), axis=1)
    P_density = np.pad(P_density, ((0, 0), (0, 1)), mode="edge")

    # Precompute bootstrap indices for x1 and x2
    # resample_size = min(n, 365)
    resample_size=n
    x1_resampled = np.random.choice(x1, size=(n_resamples, resample_size), replace=True)

    # Sort the resampled data for empirical CDF calculation
    sorted_x1_vals_raw = np.sort(x1_resampled, axis=1)

    # Compute adjusted CDFs, densities, and unique x1 values. vectorize and compute over rows
    adjusted_cdfs, adjusted_densities, sorted_x1_vals = compute_adjusted_cdf_and_density(x1_resampled)

    # now fit the KDE to X2 (simulated values)
    log_x2 = jnp.log10(x2)
    sim_kde_jax = jkde(log_x2, bw_method=bw_method)
    # evaluate the KDE fit at all points in X1
    vmap_kde = jax.vmap(sim_kde_jax.evaluate)
    
    # evaluate on the observed values
    log_x1_sorted = jnp.log10(sorted_x1_vals)
    kde_density = vmap_kde(log_x1_sorted)
    
    min_kde_density = jnp.nanmin(jnp.nanmin(kde_density, axis=1))

    if min_kde_density < min_q_allowed:
        smoothing_factor = 1.01
        adj_bw = adjust_kde_bandwidth(x2, factor=smoothing_factor, existing_bw=existing_bw)
        print(f'KDE returned P(x_i) = 0, adjusting bandwidth by a factor of {smoothing_factor}')
        return kl_divergence_convergence(proxy, target, x1, x2, bw_method=adj_bw, existing_bw=adj_bw)
    else:
        print(f'min Q(x) = {min_kde_density}')

    # prepend a zero so the diff works out
    zeros_to_prepend = jnp.zeros((sorted_x1_vals.shape[0], 1))
    sorted_x1_vals_with_zero = jnp.concatenate([zeros_to_prepend, sorted_x1_vals], axis=1)
    log_x1_with_zero = jnp.concatenate([zeros_to_prepend, log_x1_sorted], axis=1)
    
    # Compute the differences
    x1_diffs = jnp.diff(sorted_x1_vals_with_zero, axis=1)
    log_x1_diffs = jnp.diff(log_x1_with_zero, axis=1)

    # Assert no values are zero
    assert not jnp.any(x1_diffs == 0), "The array contains zero values!"
   
    # divide each density point by the width delta x
    assert_alignment(adjusted_densities, x1_diffs)
    P_dd = adjusted_densities / x1_diffs # delta_x1i
    
    # assert that where kde_density is valid that x1_diffs is also valid
    assert_alignment(kde_density, x1_diffs)

    Q_dd = kde_density / log_x1_diffs # delta_x1i
    
    # Compute KL divergence for all resamples
    kl_estimates = compute_kl_estimates(P_dd, Q_dd)

    # Compute median and confidence intervals
    ci = np.percentile(kl_estimates, [2.5, 50, 97.5])
    output_path = f'KDE_fit_plots/sim_{target}_using_{proxy}.png'
    plot_and_save_cdf_kde_histogram(x1, x2, ci, sim_kde_jax, output_path)
    return kl_estimates, ci

In [52]:
def process_batch(inputs):    
    proxy, target, min_concurrent_years = inputs
    
    proxy_id, target_id = proxy['official_id'], target['official_id']

    # create a result dict object for tracking results of the batch comparison
    result = {
        "proxy": proxy_id,
        "target": target_id,
        "min_concurrent_years": min_concurrent_years,
    }
    station_info = {"proxy": proxy, "target": target}

    # check if the polygons are nested
    result["nested_catchments"] = dpf.check_if_nested(proxy, target)

    # for stn in pair:
    proxy = dpf.Station(station_info["proxy"])
    target = dpf.Station(station_info["target"])
    target.ln_pdf_label = f'{target.id}_sim_lognorm_pdf'
    target.ln_cdf_label = f'{target.id}_sim_lognorm_cdf'
    target.expon_pdf_label = f'{target.id}_sim_expon_pdf'
    target.expon_cdf_label = f'{target.id}_sim_expon_cdf'

    # compute spatial distance
    p1, p2 = (
        station_info["proxy"]["geometry"].centroid,
        station_info["target"]["geometry"].centroid,
    )
    # compute the distance between catchment centroids (km)
    centroid_distance = p1.distance(p2) / 1000
    result["centroid_distance"] = round(centroid_distance, 2)
    if centroid_distance > 1000:
        return None

    if np.isnan(target.drainage_area_km2):
        raise ValueError(f"No drainage area for {target_id}")
    if np.isnan(proxy.drainage_area_km2):
        raise ValueError(f"No drainage area for {proxy_id}")

    # Retrieve the data for both stations
    # this is all data, including non-concurrent
    adf = dpf.retrieve_nonconcurrent_data(proxy_id, target_id)

    assert ~adf.empty, "No data returned."

    for stn in [proxy, target]:
        adf = dpf.transform_and_jitter(adf, stn)

    # simulate flow at the target based on equal unit area runoff scaling
    adf[target.sim_label] = adf[proxy.id] * (
        target.drainage_area_km2 / proxy.drainage_area_km2
    )

    # filter for the concurrent data
    df = adf.copy().dropna(subset=[proxy_id, target_id], how="any")
    result["num_concurrent_obs"] = len(df)
    
    if df.empty:
        num_complete_concurrent_years = 0
    else:
        df.reset_index(inplace=True)
        num_complete_concurrent_years = dpf.count_complete_years(df, 'time', proxy_id)
        
    counts = df[[proxy_id, target_id]].count(axis=0)
    counts = adf.count(axis=0)
    proxy.n_obs, target.n_obs = counts[proxy_id], counts[target_id]
    
    result[f"proxy_n_obs"] = proxy.n_obs
    result[f"target_n_obs"] = target.n_obs
    result[f"proxy_frac_concurrent"] = len(df) / proxy.n_obs
    result[f"target_frac_concurrent"] = len(df) / target.n_obs

    if (counts[proxy_id] == 0) or (counts[target_id] == 0):
        print(f"   Zero observations.  Skipping.")
        return None

    # process the PMFs and divergences for concurrent data 
    # using a range of uniform priors via pseudo counts    
    if num_complete_concurrent_years > min_concurrent_years:
        # df is concurrent data, so the results
        # are updating concurrent data here    
        df.dropna(subset=[proxy.id, target.id], inplace=True)
        # x1 is the observed, x2 is the simulated
        x1, x2 = df[target.id].values, df[f'{target.id}_sim'].values
        # fit kde to proxy values
        _, dkl_interval  = kl_divergence_convergence(proxy.id, target.id, x1, x2)
        result['kld_lb'] = dkl_interval[0]
        result['kld_median'] = dkl_interval[1]
        result['kld_ub'] = dkl_interval[2]
    
    return result

In [53]:
# the 'process' variable is here so jupyter doesn't go computing 
# a million rows per iteration when the book is built for pushing to github pages.

    # Evaluate KDE on grid points for each sample set
def kde_per_row(row, bw):
    # Compute Gaussian kernel values for each grid point
    kernels = norm.pdf((grid_points[:, None] - row) / bw)
    return jnp.mean(kernels, axis=1) / bw

def kde_fit_per_row(row, bw):    
    # Vectorized computation across all rows
    kde_values = jax.vmap(kde_per_row)(samples, bandwidth)

process = True
partial_counts = False
if process: 
    
    print(f'Processing empirical CDF pairs (partial counts={partial_counts})')
    results_fname = f'KL_kde_fits_{revision_date}.csv'

    out_fpath = os.path.join('data/', 'nonparametric_divergence_test', results_fname)

    n_batches = max(len(id_pairs) // batch_size, 1)
    batches = np.array_split(np.array(id_pairs, dtype=object), n_batches)
    n_pairs = len(id_pairs)
    print(f"    Processing {n_pairs} pairs in {n_batches} batches.")
    batch_no = 1
    batch_files = []
    t0 = time()
    # error_df = error_model_df[error_model_df['bitrate'] == bitrate].copy()
    for batch_ids in batches[:1]:
        print(f'Starting batch {batch_no}/{len(batches)} processing.')
        batch_fname = results_fname.replace('.csv', f'_batch_{batch_no:04d}.csv')
        batch_output_fpath = os.path.join(temp_dir, batch_fname)
        print(batch_output_fpath)
        if os.path.exists(batch_output_fpath):
            batch_files.append(batch_output_fpath)
            batch_no += 1
        #     continue
        
        # define the input array for multiprocessing
        inputs = input_batch_generator(bcub_gdf, batch_ids, min_years, partial_counts)

        print(len(inputs))

        # with mp.Pool(20) as pool:
        #     results = pool.map(process_batch, inputs)
        #     results = [r for r in results if r is not None]
        results = []
        for inp in inputs[:10]:
            res = process_batch(inp)
            if res is not None:
                results.append(res)
            # results = [r for r in res if r is not None]

        batch_result = pd.DataFrame(results)
        print(batch_result)
        print(asdf)
        if batch_result.empty:
            print('Empty batch.  Skipping')
        else:
            batch_result.to_csv(batch_output_fpath, index=False)
            print(f"    Saved {len(batch_result)} new results to file.")
        
        batch_files.append(batch_output_fpath)
        t2 = time()
        print(f'    Processed {len(batch_ids)} pairs in {t2 - t0:.1f} seconds')
        batch_no += 1
        
    print(f'    Concatenating {len(batch_files)} batch files.')
    if len(batch_files) > 0:
        all_results = pd.concat([pd.read_csv(f, engine='pyarrow') for f in batch_files], axis=0)
        all_results.to_csv(out_fpath, index=False)
        if os.path.exists(out_fpath):
            for f in batch_files:
                os.remove(f)
        print(f'    Wrote {len(all_results)} results to {out_fpath}')
    else:
        print('    No new results to write to file.')

Processing empirical CDF pairs (partial counts=False)
    Processing 877150 pairs in 175 batches.
Starting batch 1/175 processing.
/home/danbot2/code_5820/24/divergence_measures/docs/notebooks/data/temp/KL_kde_fits_20241125_batch_0001.csv
5013
N=5557 samples
min Q(x) = 0.0557931549847126
Plot saved to KDE_fit_plots/sim_08LF051_using_05CA001.png
N=20241 samples
    KDE bandwidth adjusted from 0.677 to 0.683
KDE returned P(x_i) = 0, adjusting bandwidth by a factor of 1.01
N=20241 samples
min Q(x) = 1.648605518766999e-07
Plot saved to KDE_fit_plots/sim_12205000_using_08LE031.png
N=1138 samples
min Q(x) = 0.05337715148925781
Plot saved to KDE_fit_plots/sim_12458000_using_08MH004.png
N=1921 samples
    KDE bandwidth adjusted from 0.178 to 0.180
KDE returned P(x_i) = 0, adjusting bandwidth by a factor of 1.01
N=1921 samples
    KDE bandwidth adjusted from 0.180 to 0.182
KDE returned P(x_i) = 0, adjusting bandwidth by a factor of 1.01
N=1921 samples
    KDE bandwidth adjusted from 0.182 to 0.

NameError: name 'asdf' is not defined

In [None]:
result_df = pd.read_csv('data/parametric_divergence_test/KL_parametric_fits_4bits_20241125.csv')
# print(len(result_df))
# foo = result_df.copy()
# foo.dropna(subset=['kld_lb'], inplace=True)
# foo['kld_ci_range'] = foo['kld_ub'] - foo['kld_lb']
# print(foo['kld_ci_range'].max())
# foo.head()

In [None]:
foo = foo.sort_values('num_concurrent_obs').reset_index(drop=True)
bin_size = int(len(foo) / 50)
print(f'{bin_size} samples per bin')
foo['bin'] = foo.index // bin_size
foo['n_years'] = foo['num_concurrent_obs'] / 365.24

bin_data = []
for i in sorted(list(set(foo['bin'].values))):
    data = foo[foo['bin'] == i].copy()
    vals = data['kld_ci_range'].values
    sample_size = data['n_years'].values
    bin_data.append([np.median(sample_size)] + np.percentile(vals, [2.5, 25, 50, 75, 95]).tolist())   
    
bdf = pd.DataFrame(bin_data, columns=['N', 'lb', '25', 'median', '75', 'ub'])
# grouped['bin_center'] = foo.groupby('bin')['num_concurrent_obs'].median().values

p = figure(width=600, height=400, x_axis_type='log')
p.varea(x=bdf['N'], y1=bdf['lb'], y2=bdf['ub'], legend_label='95% CI', fill_alpha=0.4, color='grey')
p.varea(x=bdf['N'], y1=bdf['25'], y2=bdf['75'], legend_label='IQR', fill_alpha=0.4, color='black')
p.line(bdf['N'], y=bdf['median'], legend_label='Median', line_width=2, color='red', line_dash='solid')
p.xaxis.axis_label = "Concurrent Record Length [years]"
p.yaxis.axis_label = r"$$95\% \text{ CI } \hat D_\text{KL}(P||Q)$$"
p = dpf.format_fig_fonts(p, font_size=16)
show(p)

The plot above describes how the (bootstrapped) uncertainty of the KL divergence estimate decreases with sample size.  The series are computed on record length intervals determined by equal sample size, approximately 6000 samples per interval.  

## Citations

```{bibliography}
:filter: docname in docnames
```