# Predict Kullback-Leibler (KL) Divergence

## Introduction

Streamflow prediction in ungauged basins has classically focused on minimizing one or more loss functions, typically square error, NSE, and KGE, between observed (daily streamflow) values and predictions generated by some type of model, whether it be an empirical regionalization, statistical machine learning, or process-based rainfall runoff model. 

Regionalization and machine learning models for PUB rely on existing streamflow monitoring network, and the performance of these models is tied to how well the network represents the ungauged space.  This connection leads to the question of how the arrangement of streamflow monitoring stations within the network impacts the overall performance of PUB models, particularly in terms of expected prediction error across all ungauged locations. Furthermore, are there environmental signals, orthogonal to streamflow, that contain enough information to differentiate between network arrangements such that the prediction error over ungauged areas can be minimized?  

A simple interpretation of the loss functions commonly used in the PUB literature might be "how close are mean daily streamflow predictions to observed values?"  A much simpler question to ask of observational data is: "will a given model outperform random guessing in the long run?".  This binary question represents a starting point to approach the optimal streamflow monitoring network problem.  The justification for asking such a basic question is that an expectation of the uncertainty reduction over the unmonitored space provided by a given monitoring arrangement supports a discriminant function to compare unique arrangements.  A simple question can be formulated to test on real data, in this case an ungauged space of over 1 million ungauged catchments and a set of over 1600 monitored catchments with which to train a model.

The binary prediction problem is followed by a regresson problem where the goal is to minimize the expectation of prediction error based on the **Kullback-Leibler divergence** $D_{KL}$, (a surrogate loss function from the class of *information discriminant* measures which is consistent with the exponential loss {cite}`nguyen2009surrogate`).


## Data Preview

As a first step, it's important to understand how the sample is distributed over the target variable that we're trying to predict.  The distrbution reflects what we're asking the model to tell us, and it helps set our expectations as far as interpreting the model loss.  For example, if we know that the range of our target variable is bounded in (0, 10], if our predictive model produces mean absolute loss or RMSE of something like 1, is this a good result?  It depends 1) on the application if we can accept such an error, but it also depends upon the distribution of the target.  If the target variable is heavily skewed, and the median value is 1, then a very large number of our errors are in the range of 100% which doesn't sound very good.  

For the target variable tested in this notebook, we also test two key assumptions: i) the quantization (referred here as bitrate) which reflects the precision that we use in representing a "continuous" signal as a discrete set of states represented by $2^b$ symbols, and ii) the prior we must apply to the simulated distribution $Q(X)$ where models are underspecified.  A prior must be assumed to handle the (common) case where any state observed *a posteriori* with  $p_i > 0$, if the corresponding model predicts $q_i = 0$, since KL divergence is a function of $\log \frac{P}{Q}$. i.e. we can't have $q_i=0$ in the denominator. 

First we preload the results to view the distribution.


## Data Import and Model Setup

In [None]:
import os
import pandas as pd
import numpy as np
from time import time

import geopandas as gpd
from shapely.geometry import Point
import xyzservices.providers as xyz

from bokeh.plotting import figure, show
from bokeh.layouts import gridplot, row, column
from bokeh.transform import factor_cmap, linear_cmap
from bokeh.models import ColumnDataSource, LinearAxis, Range1d
from bokeh.io import output_notebook
from bokeh.palettes import Sunset10, Vibrant7, Category20, Bokeh6, Bokeh7, Bokeh8, Greys256

import xgboost as xgb
xgb.config_context(verbosity=2)

from sklearn.cluster import AgglomerativeClustering

from sklearn.metrics import (
    root_mean_squared_error,
    mean_absolute_error,
    roc_auc_score,
    roc_curve, auc,
    accuracy_score,
    confusion_matrix,
)

import data_processing_functions as dpf

from scipy.stats import linregress
from scipy.stats import lognorm
from scipy.special import kl_div
# from sklearn.model_selection import StratifiedKFold
output_notebook()


In [None]:
BASE_DIR = os.getcwd()
tiles = xyz['USGS']['USTopo']

In [None]:
# load the catchment characteristics
fname = 'BCUB_watershed_attributes_updated.csv'
attr_df = pd.read_csv(os.path.join('data', fname))
attr_df.columns = [c.lower() for c in attr_df.columns]
station_ids = attr_df['official_id'].values
print(f'There are {len(station_ids)} monitored basins in the attribute set.')

In [None]:
# open an example pairwise results file
input_folder = os.path.join(
    BASE_DIR, "data", "processed_divergence_inputs",
)
pairs_files = os.listdir(input_folder)

test_df = pd.read_csv(os.path.join(input_folder, pairs_files[0]), nrows=1000)

### Pre-load the results to avoid repeat loads in the model training iterations

In [None]:

# the input data file has an associated revision date
revision_date = '20240812'
revision_date = '20241016'

result_dict = {}
nrows = None

bitrates = [4, 6, 8, 10]
for bitrate in bitrates:
    if bitrate in [5, 7]:
        continue
    print(f'bitrate = {bitrate}')
    fname = f"KL_results_{bitrate}bits_{revision_date}.csv"
    # if partial_counts is true, we load a separate result file
    # where observation counts incorporated a 10% uniform uncertainty
    whole_counts_fname = f"KL_results_{bitrate}bits_{revision_date}.csv"
    partial_counts_fname = f"KL_results_{bitrate}bits_{revision_date}_partial_counts.csv"

    wc_input_data_fpath = os.path.join(input_folder, whole_counts_fname)
    pc_input_data_fpath = os.path.join(input_folder, partial_counts_fname)
        
    df_partial = pd.read_csv(pc_input_data_fpath, nrows=nrows, low_memory=False)
    df_whole = pd.read_csv(wc_input_data_fpath, nrows=nrows, low_memory=False)
    
    result_dict[bitrate] = {'partial': df_partial, 'whole': df_whole}

In [None]:
order_dict = {}
percentiles = np.linspace(0, 100, 1000)
distribution_plots = []
for b, set_dict in result_dict.items():
    if b in [7, 9, 11]:
        continue
    dfig = figure(title=f"{b} bits quantization", width=600, height=450)#, x_axis_type='log')#,
                 # x_axis_type='log')
    if len(distribution_plots) > 0:
        dfig = figure(title=f"{b} bits quantization", width=600, height=450, 
                     x_range=distribution_plots[-1].x_range, # x_axis_type='log',
                      y_range=distribution_plots[-1].y_range)#, 
                     # x_axis_type='log')
    
    partial_counts = set_dict['partial']
    whole_counts = set_dict['whole']

    concurrent_prior_cols = [c for c in whole_counts if c.startswith('dkl_concurrent_post')]
    nonconcurrent_prior_cols = [c for c in whole_counts if c.startswith('dkl_nonconcurrent_post')]
    uniform_col = 'dkl_concurrent_uniform'
    assert uniform_col in partial_counts.columns
    
    n = 0
    for c in concurrent_prior_cols:
        prior = float(c.split('_')[-1].split('R')[0])
        # the distributions converge when the prior is 10^5
        if prior > 5:
            continue
        # compute empirical cdf of "Actual" Target
        # values = partial_counts[c].copy().dropna()
        values = whole_counts[c].copy().dropna()
        minv, maxv = min(values), max(values)
        # print(f'{b}bits, 10^{prior} prior min dkl: {minv:.1e}, max dkl: {maxv:.1F}')
        sample_vals = np.percentile(values, percentiles)
        # Calculate the CDF values
        cdf_values = np.arange(1, len(sample_vals) + 1) / len(sample_vals)
        dfig.line(sample_vals, cdf_values, color=Category20[17][n], 
                  line_width=2.5, legend_label=f'10^{prior}')
        n += 1

    values = whole_counts[uniform_col].copy().dropna()
    sample_percentiles = np.percentile(values, percentiles)
    cdf_values = np.arange(1, len(sample_percentiles) + 1) / len(sample_percentiles)
    dfig.line(sample_percentiles, cdf_values, color='red', 
              line_width=3, legend_label='Q(X)=U', line_dash='dashed')
        
    dfig.legend.location ='bottom_right'
    dfig.xaxis.axis_label = r'$$D_{\text{KL}} [\text{bits}/\text{sample}]$$'
    dfig.yaxis.axis_label = r'$$\text{Pr}(x \leq X)$$'
    dfig.legend.ncols = 1
    dfig.legend.click_policy = 'hide'
    dfig.add_layout(dfig.legend[0], 'right')
    distribution_plots.append(dfig)


In [None]:
layout = gridplot(distribution_plots, ncols=3, width=400, height=500)
show(layout)

### Information Loss due to the Prior

Adding pseudo-counts to the simulated distribution $Q(X)$ shifts it to a posterior distribution $R(X)$.

The information loss from adding pseudo-counts is then the KL divergence between the empirical distribution $Q(X)$ and the posterior $R(X)$:
$$D_\text{KL}(Q||R) = \sum_{i=1}^k Q(x_i) \log_2 \left( \frac{Q(x_i)}{R(x_i)} \right)$$

Where: 
* $q_i = Q(x_i)$ is the empirical frequency of state $i$,

We want to  **normalize** the posterior distribution $R(x_i) = r_i = q_i + c_i$ after adding a prior distribution $\alpha(x_i)=\{ c_1, c_2, \dots, c_k\}$. 

$$R(x_i) = \frac{q_i + c_i}{\sum_{j=1}^k(q_j+c_j)}$$
$$\quad = \frac{q_i + c_i}{\sum_{j=1}^kq_j+ \sum_{j=1}^k c_j}$$

We know $\sum_{i=1}^k Q(x_i) = 1$, and we let $\sum_{j=1}^k c_j = C$, so $$R(x_i) = \frac{q_i + c_i}{1+ C}$$

* $k$ is the dictionary size ($k=2^b$), and
* $c_i$ is the pseudo-counts added for state $i$


The ratio $\frac{q_i}{r_i}$ becomes:

$$\frac{q_i}{r_i} = \frac{q_i}{\frac{q_i + c_i}{1 +C}} =  \frac{(1 + C) \cdot q_i}{q_i + c_i}$$

Substituting $\frac{q_i}{r_i}$ into $D_\text{KL}(Q||R)$ gives:

$$D_\text{KL}(Q||R) = \sum_{i=1}^k q_i \log_2 \left( (1+C) \cdot \frac{q_i}{q_i + c_i} \right)$$

$$=\sum_{i=1}^k q_i \left[ \log_2(1+C) + \log_2 \left( \frac{q_i}{q_i + c_i} \right) \right]$$
$$= \log_2(1+C) \sum_{i=1}^k q_i + \sum_{i=1}^k q_i \log_2 \left( \frac{q_i}{q_i + c_i} \right)$$

Since again $\sum_{i=1}^k q_i = 1$, the expression for the information loss due to the prior simplifies to:

$$D_\text{KL}(Q||R) = \log_2(1+C) + \sum_{i=1}^k q_i \log_2 \left( \frac{q_i}{q_i + c_i} \right)$$

But $C$ should also be normalized, that is $\sum_{j=1}^k c_i = 1$, so then R(x) becomes:

$$R(x) = \frac{q_i + c_i}{\sum_{j=1}^k q_j + \sum_{j=1}^k c_j} = \frac{q_i + c_i}{2}$$

And the KL divergence between the empirical $Q(x)$ and posterior $R(x)$ distributions is then:

$$D_\text{KL}(Q||R) = \sum_{i=1}^k q_i \log_2 \left( \frac{q_i}{R(x_i)} \right) = \sum_{i=1}^k q_i \log_2 \left( \frac{2\cdot q_i}{q_i + c_i} \right) $$

The first term is constant, regardless of what the distribution $Q(X)$ looks like we should just minimize the total number of counts, which we can make equal by using fractional counts, or simply assuming some total pseudo-counts that we divide by the dictionary size.

If we assume the maximum uncertainty prior (uniform distribution), the posterior becomes $R(x_i) = \frac{q_i + \frac{1}{k}}{2}$.

As the (uniform) prior $c_i = \frac{1}{k}$ approaches $q_i$, $R(x_i) = \frac{q_i + \frac{1}{k}}{2} \rightarrow \frac{\frac{2}{k}}{2} = \frac{1}{k}$ 

At the opposite extreme, as $q_i \rightarrow 0$, $R(x_i) \rightarrow \frac{0 + \frac{1}{k}}{2} \rightarrow \frac{1}{2} \cdot \frac{1}{k}$

So then the distortion is a minimum when $q_i = c_i$ and a maximum when $q_i = 0$.


### Check CDFs based on support coverage flags

Find the following:

1. Separate the dataset into two groups base on the `underspecified_model_flag` which represents models where the support of $Q$ does not cover the support of $P$.

In [None]:
prior_cols = concurrent_prior_cols = [c for c in result_dict[4]['whole'].columns if c.startswith('dkl_concurrent_post')]
priors = list(sorted([c.split('_')[-1].split('R')[0] for c in prior_cols]))
priors = ['-4', '-3', '-2', '-1', '-0.5', '-0.2', '-0.1', '0', '0.1', '0.2', '0.5', '1', '2', '3', '4']

In [None]:
plots = []
pcts = np.linspace(0, 100, 200)
for pr in priors:
    cfig = figure(width=500, height=400, title=f'10^{pr} pseudo-counts')
    if len(plots) > 0:
        cfig = figure(width=500, height=400, title=f'10^{pr} pseudo-counts', x_range=plots[-1].x_range)
    c = 0
    print(f'   Processing 10^{pr} prior')
    for b, set_dict in result_dict.items():
        if b in [7, 9, 11]:
            continue
        dkl_col = f'dkl_concurrent_post_{pr}R'
        # part = set_dict['partial'].copy()
        # part.dropna(subset=[dkl_col], inplace=True)
        whole = set_dict['whole'].copy()
        whole.dropna(subset=[dkl_col], inplace=True)    

        # p_underspec = part.loc[part['underspecified_model_flag'] == 1, dkl_col]
        w_underspec = whole.loc[whole['underspecified_model_flag'] == 1, dkl_col]
        # p_covered = part.loc[part['underspecified_model_flag'] == 0, dkl_col]
        w_covered = whole.loc[whole['underspecified_model_flag'] == 0, dkl_col]
                
        # p_under_cdf = np.percentile(p_underspec, pcts) 
        w_under_cdf = np.percentile(w_underspec, pcts) 
        # p_cover_cdf = np.percentile(p_covered, pcts) 
        w_cover_cdf = np.percentile(w_covered, pcts) 
        # cfig.line(p_under_cdf, pcts, color=Bokeh6[c], line_width=2, legend_label=f'{b}b underspec')
        # cfig.line(p_cover_cdf, pcts, color=Bokeh6[c], line_width=2, legend_label=f'{b}b covered', line_dash='dashed')
        cfig.line(w_under_cdf, pcts, color=Bokeh6[c], line_width=2, legend_label=f'{b}b underspec')
        cfig.line(w_cover_cdf, pcts, color=Bokeh6[c], line_width=2, legend_label=f'{b}b covered', line_dash='dashed')

        c += 1
        cfig.xaxis.axis_label = dkl_col
        cfig.yaxis.axis_label = r'$$\text{Pr}(x \leq X)$$'
        cfig.axis.background_fill_alpha = 0.6
        cfig.legend.location = 'bottom_right'
        cfig.legend.click_policy = 'hide'
        cfig = dpf.format_fig_fonts(cfig, font_size=16)
    plots.append(cfig)
        

In [None]:
layout = gridplot(plots, ncols=3, width=400, height=350)
show(layout)

### Information Loss Due to the Quantization

The quantization itself sheds some of the information in both $P$ and $Q$.  We can quantify this on individual signals by mapping one quantization $Q(x_i | \rho_1)$ to another $Q(x_i | \rho_2)$, where $\rho_b$ are the set of quantization schemes corresponding to dictionary sizes of $2^b$ symbols.

1. Choose a baseline bitrate. I think it makes sense to set the highest bitrate (largest dictionary size is 12 bits) as the baseline since it reflects preserving the most information from the signal.
2. Compute the ratio of KL divergence between bitrates $D_{KL}(P||R, \rho_b) / D_{KL}(P||R, \rho_{12})$ for $b \in \{10, 8, 6, 4\}$

**NEED TO RETHINK THIS**

### Distortion Due to the Assumed Error

Measurement (rating curve) uncertainty carries potentially the biggest distortion of both P and Q.  If instead of discrete points we treated observations as distributions relative to an assumed error distribution, then observations could count in multiple states in proportion to how much of the error interval overlaps with each bin.  The effect of this is greatest on large dictionaries with many empty bins, and the effect is the same as adding a prior -- information is lost.  

Compare $D_{KL}(P_p||R_p) / D_{KL}(P_w||R_w)$ where $p$ represents "partial counts", or the set where the observations are treated as probability distributions over some assumption of error (in this case, 10% uniform), and $w$ represents "whole counts", i.e. the state frequencies are computed strictly by the bin the observation falls in.

**HERE AGAIN NEED TO RETHINK IF THIS MAKES SENSE ON INDIVIDUAL VS. PAIRED BASIS**

In [None]:
# load the error model distortion values for individual stations
error_model_distortion_fname = 'data/error_model_distortion/error_model_distortion_test.csv'
err_df = pd.read_csv(error_model_distortion_fname)
err_df.head()

### Basis for a "Distinguishability Criteria"

To compute $D_{KL}(P||Q)$ on discrete distributions $P(x)$ and $Q(x)$, where $Q$ is a simulation or model of $P$, we must add a prior in order to prevent issues dividing by zero in $D_{KL}(P||Q) = \sum_{i=1}^k P(x)\log_2\frac{P(x)}{Q(x)}$ if $q_i = 0$ for any state $i$ where $p_i > 0$.

The assumption of a uniform prior in the form of pseudo-counts adds noise to $Q$.  If we want to compare two models $Q_a$ and $Q_b$ to determine which is more representative of $P$ based on the KL divergence, we should check that $|D_{KL}(P||Q_a) - D_{KL}(P||Q_b)| > D_{KL}(Q_a||R_a) + D_{KL}(Q_b||R_b)$ where $R(x_i) = (q_i + c_i) / (1 + C)$ and $C = \sum_{i=1}^k c_i$ for a general prior distribution $\alpha(x_i) = \{c_1, c_2, \dots, c_k\}$.

In [None]:
def generate_and_plot_lognormal(mu1, sigma1, mu2, sigma2, b, prior_pseudo_count, y_range=(0, 0.035)):
    """
    Generates two log-normal distributions, quantizes them with 2^b symbols,
    applies a uniform pseudo-count prior to one, and plots the distributions
    with Bokeh (including the posterior as a dashed line).

    Parameters:
    - mu1, sigma1: Parameters for the first log-normal distribution.
    - mu2, sigma2: Parameters for the second log-normal distribution.
    - b: Number of symbols = 2^b for quantization.
    - prior_pseudo_count: Uniform pseudo-count to apply to one distribution.
    """
    # Create the two log-normal distributions
    minx, maxx = 0.0, 5
    x = np.linspace(minx, maxx, 200)  # Values over which distributions are evaluated
    dist1 = lognorm.pdf(x, sigma1, loc=mu1, scale=sigma1)
    dist2 = lognorm.pdf(x, sigma2, loc=mu2, scale=sigma2)

    bins = np.linspace(minx, maxx, 2 ** b + 1)
    bins_midpoints = (bins[:-1] + bins[1:]) / 2

    # Create the continuous log-normal distributions evaluated at the midpoints
    dist1 = lognorm.pdf(bins_midpoints, sigma1, scale=np.exp(mu1))
    dist2 = lognorm.pdf(bins_midpoints, sigma2, scale=np.exp(mu2))

    dist3 = lognorm.pdf(bins_midpoints, sigma1, scale=np.exp(mu1))
    dist4 = lognorm.pdf(bins_midpoints, sigma2, scale=np.exp(mu2))

    # Quantize the distributions
    # Normalize the distributions to form proper PDFs over the quantized bins
    P = dist1 / np.sum(dist1)
    Q = dist2 / np.sum(dist2)

    # Apply the uniform prior as pseudo-counts to the second distribution
    prior_counts = np.full_like(Q, prior_pseudo_count)
    posterior_counts = Q * np.sum(dist2) + prior_counts  # Adding pseudo-counts
    R = posterior_counts / np.sum(posterior_counts)  # Renormalize

    assert np.abs(sum(P) - 1) < 0.001, 'P does not sum to 1'
    assert np.abs(sum(Q) - 1) < 0.001, 'Q does not sum to 1'
    assert np.abs(sum(R) - 1) < 0.001, 'R does not sum to 1'

    kl_pq = kl_div(P, Q)
    kl_pr = kl_div(P, R)

    # Prepare data for Bokeh plotting
    source1 = ColumnDataSource(data=dict(x=bins_midpoints, y=P))
    source2 = ColumnDataSource(data=dict(x=bins_midpoints, y=Q))
    source2_posterior = ColumnDataSource(data=dict(x=bins_midpoints, y=R))

    # Create the Bokeh plot
    p = figure(title="", y_range=y_range,
               x_axis_label='x', y_axis_label=r'$$\text{Pr}(X)$$', width=500, height=350)

    p.line('x', 'y', source=source1, line_width=2, color='black', legend_label=f"P(x)=LN(x|{mu1:.1f},{sigma1:.2f})")
    p.line('x', 'y', source=source2, line_width=2, color='red', legend_label=f"Q(x)=LN(x|{mu2:.1f},{sigma2:.2f})")
    p.line('x', 'y', source=source2_posterior, line_width=2, line_dash='dashed',
           color='red', legend_label=f"R(X) (Prior={prior_pseudo_count})")

    # Configure the legend and show the plot
    p.legend.location = "top_right"
    p.legend.click_policy = "hide"

    return p, sum(kl_pq), sum(kl_pr)

In [None]:
mu, sigma = 0.25, 0.35
p1, klpq1, klpr1 = generate_and_plot_lognormal(mu, sigma, mu+0.01, sigma-0.1, 8, 0.03, y_range=(0, 0.035))
p2, klpq2, klpr2 = generate_and_plot_lognormal(mu, sigma, mu+0.01, sigma+0.1, 8, 0.03)
p1, p2 = dpf.format_fig_fonts(p1), dpf.format_fig_fonts(p2)

kl1_text = "    -->Q1 is closer to P than R1"
if klpr1 < klpq1:
    kl1_text = "    -->R1 is closer to P than Q"
kl2_text = "    -->Q2 is closer to P than R2"
if klpr2 < klpq2:
    kl2_text = "    -->R2 is closer to P than Q2"

print(f'DKL_1(P||Q) = {klpq1:.2f}, DKL(P||R) = {klpr1:.2f}')
print(kl1_text)
print(f'DKL_2(P||Q) = {klpq2:.2f}, DKL(P||R) = {klpr2:.2f}')
print(kl2_text)
layout = gridplot([p1, p2], ncols=2, width=500, height=350)
show(layout)

In [None]:
def compute_kl_divergence(P, Q):
    """Compute the KL divergence DKL(P || Q)."""
    return np.sum(P * np.log(P / Q), where=(P != 0))

def generate_and_plot_kl_vs_prior(mu1, sigma1, mu2, sigma2, b, priors, y_range=(0, 0.02)):
    """
    Generates two log-normal distributions, quantizes them with 2^b symbols,
    computes the posterior with varying priors, and plots KL divergences DKL(P||R) and DKL(Q||R).

    Parameters:
    - mu1, sigma1: Parameters for the first log-normal distribution.
    - mu2, sigma2: Parameters for the second log-normal distribution.
    - b: Number of symbols = 2^b for quantization.
    - priors: Array of prior pseudo-counts to apply to Q.
    - y_range: Range for the y-axis in the plot.
    """
    # Create the two log-normal distributions
    minx, maxx = 0.0, 5.0
    bins = np.linspace(minx, maxx, 2 ** b + 1)
    bins_midpoints = (bins[:-1] + bins[1:]) / 2

    # Evaluate the distributions at the bin midpoints
    dist1 = lognorm.pdf(bins_midpoints, sigma1, scale=np.exp(mu1))
    dist2 = lognorm.pdf(bins_midpoints, sigma2, scale=np.exp(mu2))

    # Normalize to create PDFs
    P = dist1 / np.sum(dist1)
    Q = dist2 / np.sum(dist2)

    kl_p_r_list = []
    kl_q_r_list, ratio_list = [], []

    # Compute KL divergences for each prior value
    for prior_pseudo_count in priors:
        prior_counts = np.full_like(Q, prior_pseudo_count)
        posterior_counts = Q * np.sum(dist2) + prior_counts  # Adding pseudo-counts
        R = posterior_counts / np.sum(posterior_counts)  # Renormalize

        # Ensure valid PDFs
        assert np.abs(sum(R) - 1) < 0.001, 'R does not sum to 1'

        # Compute KL divergences
        kl_p_r = compute_kl_divergence(P, R)
        kl_q_r = compute_kl_divergence(Q, R)

        kl_p_r_list.append(kl_p_r)
        kl_q_r_list.append(kl_q_r)
        
        # Compute ratio as percentage
        ratio = (kl_q_r / kl_p_r) * 100 if kl_p_r != 0 else np.nan
        ratio_list.append(ratio)

    # Prepare data for plotting
    source = ColumnDataSource(data=dict(
        prior=priors,
        kl_p_r=kl_p_r_list,
        kl_q_r=kl_q_r_list,
        ratio=ratio_list,
    ))

    ratio_range = (min(ratio_list) * 0.98, max(ratio_list) * 1.02)
    ratio_range = (min(ratio_list) * 0.98, 10)

    # Create the Bokeh plot
    p = figure(title="",
               x_axis_label='Prior Pseudo-count',
               y_axis_label='KL Divergence',
               x_axis_type='log',
               y_range=y_range,
               width=600, height=400)

    # Add secondary y-axis for the ratio
    p.extra_y_ranges = {"ratio": Range1d(*ratio_range)}
    p.add_layout(LinearAxis(y_range_name="ratio", axis_label='Distortion (%)'), 'right')

    p.line('prior', 'kl_p_r', source=source, line_width=2, color='black', legend_label='DKL(P || R)')
    p.line('prior', 'kl_q_r', source=source, line_width=2, color='red', legend_label='DKL(Q || R)')

    # Plot the ratio on the secondary y-axis
    p.line('prior', 'ratio', source=source, line_width=2, color='red', 
           line_dash='dashed', y_range_name="ratio",
           legend_label='Prior Distortion (%)')

    p.line(priors, [5 for _ in priors], line_width=2, color='red',
           line_dash='dotted', y_range_name='ratio',
           legend_label='5% distortion limit')

    # Configure the legend and show the plot
    p.legend.location = "top_left"
    p.legend.click_policy = "hide"
    return p

In [None]:
mu, sigma = 0.25, 0.35
priors = np.logspace(-6, -2, 100)
prior_vs_kld = generate_and_plot_kl_vs_prior(mu, sigma, mu+0.025, sigma - 0.02, b, priors)
prior_vs_kld = dpf.format_fig_fonts(prior_vs_kld)
show(prior_vs_kld)

Compute th $D_{KL}$ curve on all samples.

1. set the prior array according to what has been computed for the dataset.
2. compute the ratio $D_{KL}(Q||R) / D_{KL}(P||R)$ for all priors. This is pre-computed in `KL_results_<b>bits_<revision_date>.csv`.
3. for each prior, compute the 95% CI of distortion ratios.

In [None]:
# the priors used in processing DKL 
pseudo_counts = [-4, -3, -2, -1, -0.5, -0.2, -0.1, 0, 0.1, 0.2, 0.5, 1, 2, 3, 4]
pcts = {}
for bitrate in bitrates:
    partial_label = 'partial' # or 'whole'
    tdf =  result_dict[bitrate][partial_label]
    pcts[bitrate] = []
    for pc in pseudo_counts:
        # the distortion label
        dkl_qr_label = f'dkl_prior_distortion_post_{pc}R'
        # the posterior label
        dkl_pr_label = f'dkl_concurrent_post_{pc}R'
        
        pc_data = tdf.copy().dropna(subset=[dkl_qr_label, dkl_pr_label])
        ratios = pc_data[dkl_qr_label] / pc_data[dkl_pr_label]
        vals = 100 * np.percentile(ratios, [5, 25, 50, 75, 95])
        pcts[bitrate].append(vals)

In [None]:
distortion_figs = []
for bitrate in bitrates:
    if len(distortion_figs) > 0:
        basis_y_range = distortion_figs[-1].y_range
        basis_x_range = distortion_figs[-1].x_range
        dfig = figure(title=f'{bitrate} bits', width=600, height=400, y_range=basis_y_range, x_range=basis_x_range)
    else:
        dfig = figure(title=f'{bitrate} bits', width=600, height=400, y_range=(-5, 20))       
    
    dist_df = pd.DataFrame(pcts[bitrate], columns=['lb', 'lq', 'median', 'uq', 'ub'])
    dist_df['prior'] = pseudo_counts
    source = ColumnDataSource(dist_df)
    
    dfig.varea(x='prior', y1='lb', y2='ub', source=source, 
               legend_label='90% CI', color='grey', fill_alpha=0.4)
    dfig.varea(x='prior', y1='lq', y2='uq', source=source, 
               legend_label='IQR', color='black', fill_alpha=0.4)
    dfig.line('prior', 'median', source=source, legend_label='median', 
              color='red', line_width=2, line_dash='dashed')
    dfig.line([-5, 5], [5, 5], color='red', line_dash='dotted', 
              legend_label='distortion limit')
    dfig.xaxis.axis_label = r'$$\text{Prior } [10^x \text{pseudo-counts}]$$'
    if len(distortion_figs) == 0:
        dfig.yaxis.axis_label = r'$$\text{distortion} [\%]$$'
    dfig.legend.background_fill_alpha = 0.45
    dfig.legend.location = 'top_left'
    dfig = dpf.format_fig_fonts(dfig)
    distortion_figs.append(dfig)

In [None]:
layout = gridplot(distortion_figs, ncols=4, width=400, height=350)
show(layout)

## Problem Formulation

Now that we have determined a basis for interpreting the magnitude of $D_\text{KL}$ in terms of the sources of "distortion" to the signal (the prior, the bitrate, and some estimate of the rating curve uncertainty), we can use these "uncertainties" when comparing models.  That is, we can filter out model comparisons where the KL divergence is small relative to the sources of uncertainty.

It might help determine a suitable approach to first understand what the distribution of our target variables look like.

Given what we know about the distortion from the prior, and also understanding that underspecified models can be particularly sensitive to the prior if they mis-specify any $p_i$ with large probability, it would be helpful to know what proportion of the dataset does the model Q underspecify the target P, stated otherwise how many sample pairs does the support of Q not cover P?  

In [None]:
pcounts, wcounts, brs = [], [], []
for b, set_dict in result_dict.items():
    if b in [7, 9, 11]:
        continue
    partial_counts = set_dict['partial']
    whole_counts = set_dict['whole']
    partial_counts_underspecified = partial_counts['underspecified_model_flag'].sum()
    whole_counts_underspecified = whole_counts['underspecified_model_flag'].sum()
    partial_pct = 100-100*partial_counts_underspecified / len(partial_counts)
    whole_pct = 100-100*whole_counts_underspecified / len(whole_counts)
    print(f'{b} bits, {whole_pct:.1f}% whole counts {partial_pct:.1f}% (partial counts)')
    brs.append(b)
    pcounts.append(partial_pct)
    wcounts.append(whole_pct)

In [None]:
pc_fig = figure(width=500, height=400)
pc_fig.line(brs, wcounts, legend_label='Deterministic', line_width=2,
           color='black')
pc_fig.line(brs, pcounts, legend_label='Probabilistic', line_width=2, 
           line_dash='dashed', color='black')
pc_fig.legend.location = 'bottom_left'
pc_fig.xaxis.axis_label = r'$$\text{Dictionary Size } [2^x \text{bits}]$$'
pc_fig.yaxis.axis_label = r'$$\text{Support coverage rate [\%] }$$'
pc_fig = dpf.format_fig_fonts(pc_fig)
show(pc_fig)

### Combine Error Model Distortion and $D_{KL}$ CDF

Compare the distortion due to an assumed error model versus the distribution of DKL values. 

In [None]:
plots = []
pcts = np.linspace(0, 100, 200)
for b, set_dict in result_dict.items():
    if b in [7, 9, 11]:
        continue
    cfig = figure(width=500, height=400, title=f'{b} bits')
    if len(plots) > 0:
        cfig = figure(width=500, height=400, title=f'{b} bits', 
                      x_range=plots[-1].x_range, y_range=plots[-1].y_range)
    c = 0
    print(f'   Processing {b} bits')
    # plot the error model distortion "bounds"
    err_model = err_bound_dict[b].copy()
    err_pct = 0.1
    err_bounds = err_model.loc[err_model['err'] == err_pct, :].to_dict('records')[0]
    cfig.harea(x1=[err_bounds[2.5], err_bounds[2.5]], x2=[err_bounds[98.5], err_bounds[98.5]],
               y=[0, 100], color='grey', fill_alpha=0.4, legend_label='95% CI')
    cfig.harea(x1=[err_bounds[25], err_bounds[25]], x2=[err_bounds[75], err_bounds[75]],
               y=[0, 100], color='black', fill_alpha=0.4, legend_label='IQR')
    cfig.line([err_bounds[50], err_bounds[50]], [0, 100], color='red', line_width=3, 
              line_dash='dotted', legend_label=f'{int(100*err_pct)}% error')

    for pr in priors:
        dkl_col = f'dkl_concurrent_post_{pr}R'
        part = set_dict['partial'].copy()
        part.dropna(subset=[dkl_col], inplace=True)
        whole = set_dict['whole'].copy()
        whole.dropna(subset=[dkl_col], inplace=True)    

        p_underspec = part.loc[part['underspecified_model_flag'] == 1, dkl_col]
        w_underspec = whole.loc[whole['underspecified_model_flag'] == 1, dkl_col]
        p_covered = part.loc[part['underspecified_model_flag'] == 0, dkl_col]
        w_covered = whole.loc[whole['underspecified_model_flag'] == 0, dkl_col]

        p_under_cdf = np.percentile(p_underspec, pcts) 
        w_under_cdf = np.percentile(w_underspec, pcts) 
        p_cover_cdf = np.percentile(p_covered, pcts) 
        w_cover_cdf = np.percentile(w_covered, pcts) 
        cfig.line(p_under_cdf, pcts, color=Category20[17][c], line_width=2, legend_label=f'{pr} underspec')
        cfig.line(p_cover_cdf, pcts, color=Category20[17][c], line_width=2, legend_label=f'{pr} covered', line_dash='dashed')

        c += 1
        cfig.xaxis.axis_label = dkl_col
        cfig.yaxis.axis_label = r'$$\text{Pr}(x \leq X)$$'
        cfig.axis.background_fill_alpha = 0.6
        cfig.legend.location = 'bottom_right'
        cfig.legend.click_policy = 'hide'
        cfig = dpf.format_fig_fonts(cfig, font_size=16)
    plots.append(cfig)

In [None]:
layout = gridplot(plots, ncols=3, width=450, height=700)
show(layout)

In [None]:
error_model_pct = 0.1
for b, set_dict in result_dict.items():
    if b in [7, 9, 11]:
        continue
    
    print(f'Processing {b} bits')
    # get the error model distortion by bitrate
    filtered_err = err_df[(err_df['bitrate'] == b) & (err_df['err'] == error_model_pct)].copy()
    filtered_err.set_index('official_id', inplace=True)
    err_by_station = filtered_err[['value']].to_dict('index')
    
    for pr in priors:
        dkl_col = f'dkl_concurrent_post_{pr}R'
        part = set_dict['partial'].copy()
        part.dropna(subset=[dkl_col], inplace=True)
        whole = set_dict['whole'].copy()
        whole.dropna(subset=[dkl_col], inplace=True)

        # create a boolean flag to indicate that the distortion 
        # due to the assumed error is greater than the DKL
        # here i use the distortion of P only.
        part[f'P_distortion_{pr}_prior'] = part['proxy'].map(lambda x: err_by_station.get(x, {}).get('value'))
        whole[f'P_distortion_{pr}_prior'] = part['proxy'].map(lambda x: err_by_station.get(x, {}).get('value'))
        part[f'P_distortion_flag_{pr}'] = part[f'P_distortion_{pr}_prior'] >= part[dkl_col]
        whole[f'P_distortion_flag_{pr}'] = whole[f'P_distortion_{pr}_prior'] >= whole[dkl_col]
        pct_flags_partial_counts = 100*part[f'P_distortion_flag_{pr}'].sum() / len(part)
        pct_flags_whole_counts = 100*whole[f'P_distortion_flag_{pr}'].sum() / len(whole)
        print(f'    {pct_flags_partial_counts:.1f}%/{pct_flags_whole_counts:.1f}% partial/whole counts are flagged as < {int(100*error_model_pct)}% RC error distortion given 10^{pr} prior')


### Binary Classification Problem Part 1: Prediction and Feature Group Importance

The simulated runoff distribution is said to be "optimized" on the proxy.  That is, the bin edges are defined to suit the observed range of the streamflow from the catchment that is used as a donor/proxy (the model $Q(x_i)$) to simulate the general hydrological response of some target.  The $D_{KL}(P||Q)$ then represents the information cost of poorly estimated frequencies compared to the target $P(x_i)$ observed *a posteriori*.

The support of a probability distribution $P(x_i)$ is the set of states $x_i \in X$ for which $P(x_i) > 0$, and likewise for $Q(x_i)$.  If there is a mismatch in the support of P and Q, the KL divergence behaves in different ways. By definition (L'Hopitale's rule) $\log_2(0/0) = 0$ and $\log_2(0/q_i) = 0$ for $q_i > 0$. 

If the support of P does not cover Q, then we are bound to have some $p_i/q_i < 0$ which will yield $p_i\log_2(p_i/q_i)$ greater than zero, weighted by the frequency $p_i$.  Conversely, if the support of Q does not cover P, then we have a problem since $q_i = 0$ where $p_i > 0$ leading to $\log_2(p_i / 0)$.  This issue was addressed above where we assumed a prior: $$R(x_i) = \frac{q_i + c_i}{\sum_{i=1}^k (q_i + c_i)} $$  

This prior adds some noise to the signal, but it's a function of the bitrate, the prior, and the distribution $Q(x_i)$. If Q covers P, then the smallest prior will add the least noise to Q.  But we just said we don't know ahead of time if the support of Q covers P.  Since we don't know the range of $P(x_i)$ beforehand (though it could be predicted from catchment attributes), it might be helpful to know whether or not the support of a model $Q(x_i)$ maps to some target $P(x_i)$ or not.  This question can be formulated as a binary classification prediction problem.  


In [None]:
# load the cluster information from the Methods section 
# where we partitioned the graph to evaluate the distribution
# of the target variable across folds
# n_clusters = 15
# for spatial clustering
# cluster_fname = f'stn_attributes_with_assigned_cluster_{n_clusters}.geojson'
# for distributed classification (alternating labels spatially)
n_classes = 5
cluster_fname = f'stn_attributes_with_{n_classes}_spatial_partitions.geojson'
cluster_ids = gpd.read_file(os.path.join('data', cluster_fname))
cluster_ids.head()

In [None]:
# attr_df['cluster_id'] = attr_df['official_id'].apply(lambda x: cluster_ids.loc[cluster_ids['official_id'] == x, 'cluster'].values[0])
attr_df['cluster_id'] = attr_df['official_id'].apply(lambda x: cluster_ids.loc[cluster_ids['official_id'] == x, f'{n_classes}_spatial'].values[0])
# check the number of station per fold
f_unique, f_counts = np.unique(attr_df['cluster_id'], return_counts=True)
print(f_unique)
f_counts

### Load pairwise attribute comparisons

Load a few rows from one of the pairwise data files.  These contain attributes about divergence measures that are computed on concurrent and non-concurrent time series at two monitored locations.

In [None]:
kld_columns = [c for c in test_df.columns if 'dkl' in c]

binary_results_folder = os.path.join(BASE_DIR, 'data', 'kld_prediction_results_binary')
if not os.path.exists(binary_results_folder):
    os.makedirs(binary_results_folder)

### Define attribute groupings

In [None]:
terrain = ['drainage_area_km2', 'elevation_m', 'slope_deg', 'aspect_deg'] #'gravelius', 'perimeter',
land_cover = [
    'land_use_forest_frac_2010', 'land_use_grass_frac_2010', 'land_use_wetland_frac_2010', 'land_use_water_frac_2010', 
    'land_use_urban_frac_2010', 'land_use_shrubs_frac_2010', 'land_use_crops_frac_2010', 'land_use_snow_ice_frac_2010']
soil = ['logk_ice_x100', 'porosity_x100']
climate = ['prcp', 'srad', 'swe', 'tmax', 'tmin', 'vp', 'high_prcp_freq', 'high_prcp_duration', 'low_prcp_freq', 'low_prcp_duration']
all_attributes = terrain + land_cover + soil + climate
len(all_attributes)

In [None]:
# define the amount of data to set aside for final testing
# n_cv_folds = 5
n_boost_rounds = 2500
random_seed = 42
loss_function = 'reg:absoluteerror'  # binary classification

#define if testing concurrent or nonconcurrent data
concurrent = 'concurrent'

# partial counts refer to the test where observations were assigned
# a uniform distribution to approximate error and allow fractional 
# observations in state space
partial_counts = False

# cross validation parameters
optimize_cv_folds = False
cv_fold_seed = 83561
n_cv_fold_optimization_trials = 10
# limit the maximum distance to make the network 
# graph of station pairs more separable
max_centroid_distance = 1000

attribute_set_names = ['proximity', '+climate', '+terrain', '+land_cover', '+soil']
attribute_group_sets = [['centroid_distance'], climate, terrain, land_cover, soil]

In [None]:
# create the fold dictionary to initialize the train/test split for each fold
def create_fold_dict(df):
    cluster_ids = sorted(list(set(attr_df['cluster_id'].values)))
    print(f'The fold ids are: {cluster_ids}')
    fold_dict = {}
    for c in cluster_ids:
        cluster_stns = attr_df.loc[attr_df['cluster_id'] == c, 'official_id'].values
        # in-group edges
        dkl_sample_AND = df[(df['proxy'].isin(cluster_stns)) & (df['target'].isin(cluster_stns))].copy()
        # out-of-group edges
        dkl_sample_NOR = df[(~df['proxy'].isin(cluster_stns)) & (~df['target'].isin(cluster_stns))].copy()
        # assert that these are mutually exclusive groups
        and_official_ids = set(dkl_sample_AND['proxy'].values + dkl_sample_AND['target'].values)
        nor_official_ids = set(dkl_sample_NOR['proxy'].values + dkl_sample_NOR['target'].values)
        assert len(list(set(np.intersect1d(and_official_ids, nor_official_ids)))) == 0, 'stations in list are not unique'
        fold_dict[c] = {
            'test': dkl_sample_AND.index.values,
            'train': dkl_sample_NOR.index.values,
        }
    return fold_dict        

In [None]:
def format_features(input_attributes):
    features = []
    for a in input_attributes:
        features.append(f"proxy_{a}".lower())
        features.append(f"target_{a}".lower())
    return features

def add_attributes(attr_df, df, attribute_cols):
    """
    Adds attributes from the df_attributes to the df_relations based on the 'proxy' and 'target' columns
    using map for efficient lookups.

    Parameters:
    df_attributes (pd.DataFrame): DataFrame with 'id' and attribute columns.
    df_relations (pd.DataFrame): DataFrame with 'proxy' and 'target' columns.
    attribute_cols (list of str): List of attribute columns to add to df_relations.

    Returns:
    pd.DataFrame: Updated df_relations with added attribute columns.
    """
    # Create dictionaries for each attribute for quick lookup
    attr_dicts = {col: attr_df.set_index('official_id')[col].to_dict() for col in attribute_cols}

    # Add target attributes
    for col in attribute_cols:
        df[f'target_{col}'] = df['target'].map(attr_dicts[col])

    # Add proxy attributes
    for col in attribute_cols:
        df[f'proxy_{col}'] = df['proxy'].map(attr_dicts[col])

    for col in attribute_cols:
        df[f'{col}_diff'] = df[f'target_{col}'] - df[f'proxy_{col}'] 

    return df

In [None]:
def train_binary_model(
    input_data, train_idxs, test_idxs, attributes, target, params, num_boost_rounds,
):
    train_data = input_data.iloc[train_idxs].copy()
    test_data = input_data.iloc[test_idxs].copy()
    
    X_train = train_data[attributes].values
    Y_train = np.log10(train_data[target].values)
    X_test = test_data[attributes].values
    Y_test = np.log10(test_data[target].values)

    model = xgb.XGBClassifier(**params)
    dtrain = xgb.DMatrix(X_train, label=Y_train)
    dtest = xgb.DMatrix(X_test, label=Y_test)

    eval_list = [(dtrain, "train"), (dtest, "eval")]

    params['eval_metric'] = 'auc'
    evals_result = {}

    bst = xgb.train(
        params,
        dtrain,
        num_boost_rounds,
        evals=eval_list,
        verbose_eval=0,
        early_stopping_rounds=None,
        evals_result=evals_result,
    )

    raw_preds = bst.predict(dtest)
    predicted_y =  1 / (1 + np.exp(-raw_preds))
    test_results = pd.DataFrame(
        {
            "predicted": predicted_y,
            "actual": Y_test,
        }
    )
    test_results['predicted'] = test_results['predicted']
    test_results['actual'] = test_results['actual']

    return bst, test_results, evals_result

In [None]:
def run_binary_trials_custom_CV(
    set_name,
    attributes,
    target,
    input_data,
    fold_dict, 
    n_optimization_rounds,
    num_boost_rounds,
    results_folder,
    loss='reg:squarederror',
    random_seed=42
):
    """
    Custom CV refers to cross validation.  Custom cross validation means the 
    held-out set must be determined in a more robust way to avoid "data leakage".
    That is, the pairs making up the training, validation, and test sets must 
    be made up of pairings from unique sets of stations.
    """
    # select random hyperparameters for n_optimization_rounds
    sample_choices = np.arange(0.5, 0.9, 0.02)  # subsample and colsample percentages
    lr_choices = np.arange(0.001, 0.1, 0.0005)  # learning rates
    learning_rates = np.random.choice(lr_choices, n_optimization_rounds)
    subsamples = np.random.choice(sample_choices, n_optimization_rounds)
    colsamples = np.random.choice(sample_choices, n_optimization_rounds)
    num_boost_rounds = num_boost_rounds

    all_results = []
    best_result = (None, np.inf, None)
    best_params = None
    best_mean_test_perf = 0
    best_convergence_df = pd.DataFrame()
    best_trial_test_predictions = None
    output_target_cdfs = None

    for trial in range(n_optimization_rounds):
        lr, ss, cs = learning_rates[trial], subsamples[trial], colsamples[trial]
        params = {
            "objective": loss,
            "eta": lr,
            # "max_depth": 6,  # use default max_depth
            # "min_child_weight": 1, # use colsample and subsample instead of min_child_weight
            "subsample": ss,
            "colsample_bytree": cs,
            "seed": random_seed,
            "device": "cuda",  # note, change this to 'cpu' if your system doesn't have a CUDA GPU
            "sampling_method": "gradient_based",
            "tree_method": "hist",
        }

        results_fname = (
            f"{set_name}_{bitrate}_bits_{lr:.3f}_lr_{ss:.3f}_sub_{cs:.3f}_col.csv"
        )
        results_fpath = os.path.join(results_folder, results_fname)

        # k-fold cross validation
        n_samples = len(input_data)
        fold_no = 0
        fold_scores = []
        fold_balances = []
        learning_curves = []
        all_test_set_predictions = []
        for fold_no, cv_data in fold_dict.items():

            cv_model, cv_test, evals_result = train_binary_model(
                input_data,
                cv_data['train'],
                cv_data['test'],
                attributes,
                target,
                params,
                num_boost_rounds,
            )
            obs, pred = cv_test['actual'].values, cv_test['predicted'].values
            test_balance = sum(cv_test['actual'].values) / len(cv_test)
            
            obs_set, obs_counts = np.unique(obs, return_counts=True)
            if (obs == pred).all() & (len(obs_set) == 1):
                # print('    All observations have the same class.')
                raise Exception('All observations have the same class')
                accuracy = 1.0
            else:
                # tn, fp, fn, tp = confusion_matrix(obs, pred).ravel()
                # accuracy = (tp + tn) / (tp + fp + fn + tn)                 
                fpr, tpr, thresholds = roc_curve(obs, pred)
                auc_score = auc(fpr, tpr)
            
            # error = fbeta_score(obs, pred, 1)
            # print(f'Accuracy: {accuracy:.2f}')#, F1 beta: {error:.2f}')
            fold_scores.append(auc_score)
            learning_curves.append(evals_result)
            all_test_set_predictions.append(cv_test)
            fold_balances.append(test_balance)

        cv_mean, cv_std = np.mean(fold_scores), np.std(fold_scores)
        fold_balance_mean, fold_balance_std = np.mean(fold_balances), np.std(fold_balances)
        results_dict = {
            "trial": trial,
            "test_auc_mean": cv_mean,
            "test_auc_stdev": cv_std,
            "test_balance_mean": fold_balance_mean,
            "test_balance_std": fold_balance_std,
        }
        results_cols = list(results_dict.keys())
        results_dict.update(params)

        all_results.append(results_dict)
        if (trial > 0) & (trial % 20 == 0):
            print(f"   completed {trial}/{n_optimization_rounds}")            

        all_results.append(results_dict)
        if (trial > 0) & (trial % 10 == 0):
            print(f"   completed {trial}/{n_optimization_rounds}")

        if round(cv_mean, 2) > round(best_mean_test_perf, 2):
            best_params = params
            best_mean_test_perf = cv_mean
            best_learning_curve = learning_curves
            best_trial_predictions = all_test_set_predictions
            print(f'    New best result: AUC={cv_mean:.2f} (trial {trial})')
    
    # save the best trial results
    # best_trial_test_predictions.to_csv(results_fpath)
    results_all_trials = pd.DataFrame(all_results)
    # get the mean and standard deviation of the error metrics over all trials
    all_trials_mean = results_all_trials[f"test_auc_mean"].mean()
    all_trials_stdev = results_all_trials[f"test_auc_mean"].std()
    all_fold_balances_mean = results_all_trials[f"test_balance_mean"].mean()
    # mean of the standard devations of all folds, maybe should rethink this
    all_fold_balances_std = results_all_trials[f"test_balance_mean"].std()
    print(
        f"    {all_trials_mean:.2f} ± {2*all_trials_stdev:.3f} mean (95% CI) AUC ({100*all_fold_balances_mean:.0f}% mean fold +/-{all_fold_balances_std:2f} ) (of {n_optimization_rounds} hyperparameter optimization rounds.)"
    )
    
    return best_params, best_mean_test_perf, best_learning_curve, results_all_trials, best_trial_predictions



In [None]:
def predict_underspecification_from_attributes(attr_df, target_variable, max_centroid_distance, results_folder, prior,
                                loss_function=None, partial_counts=False, n_boost_rounds=100, random_seed=42, 
                              optimize_cv_folds=True, n_cv_fold_optimization_trials=20, cv_fold_seed=42):
    counts_key = 'partial'
    if partial_counts == "False":
        counts_key = 'whole'
    
    all_results = {}
    for bitrate in [4, 6, 8, 10]:
        all_results[bitrate] = {}
        t0 = time()
        print(f'bitrate = {bitrate}')
        input_data_fpath = os.path.join(input_folder, fname)
        nrows = None
        # df = pd.read_csv(input_data_fpath, nrows=nrows, low_memory=False)
        df = result_dict[bitrate][counts_key]
        df.dropna(subset=[target_variable], inplace=True)

        # check the target variable balance
        true_vals = len(df[df[target_variable] == True])
        pct_true = int(100*true_vals / len(df))
        print(f'{true_vals}/{len(df)} ({pct_true}%) of the dataset are underspecified')
                        
        # reduce the maximum distance separating pairs such that the 
        # graph can be more evenly separated.  Too permissible a distance
        # filter increases the graph connectivity, making it difficult to
        # create cross validation folds with no data leakage.
        df = df[df['centroid_distance'] < max_centroid_distance]
        print(f'  {len(df)} pairs remaining after filtering by max distance of {max_centroid_distance} km')        
        # add the attributes into the input dataset
        df.reset_index(inplace=True, drop=True)
        df = add_attributes(attr_df, df.copy(), all_attributes)
        fold_dict = create_fold_dict(df)

        # add attribute groups successively
        predictor_attributes = []
        # make sure the order of attributes matches the attribute_set_names list
        for attribute_set, set_name in zip(attribute_group_sets, attribute_set_names):
            print(f'  Processing {set_name} attribute set: {target_variable}')            
            # initialize the predictor variables (features)
            predictor_attributes += attribute_set
            if set_name == 'proximity':
                features = ['centroid_distance']
            else:
                non_dist_features = [c for c in predictor_attributes if c != 'centroid_distance']
                pair_features = format_features(non_dist_features)
                diff_features = [f'{c}_diff' for c in non_dist_features]
                features = ['centroid_distance'] + pair_features + diff_features                

            best_params, best_mean_test_perf, best_learning_curve, results_all_trials, best_trial_predictions = run_binary_trials_custom_CV(
                set_name, features, target_variable, df, fold_dict, 
                n_cv_fold_optimization_trials, n_boost_rounds, results_folder, 
                loss=loss_function, random_seed=random_seed)
                
            # store the test set predictions and actuals
            all_results[bitrate][set_name] = {
                'best_params': best_params,
                'all_results': results_all_trials, 
                'learning_curve': best_learning_curve,
                'test_predictions': best_trial_predictions,
                # 'test_rmse': best_mean_rmse,
                'test_auc': best_mean_test_perf,
                # 'target_cdfs': target_cdfs,
            } 
    return all_results

In [None]:
# target_col = f'dkl_{concurrent}_post_{prior}R'
target_column = 'underspecified_model_flag'
loss_function = 'binary:logistic'
partial_counts = False
binary_test_results_fname = f'binary_classification_coverage_test.npy'
if partial_counts is True:
    binary_test_results_fname = binary_test_results_fname.replace('.npy', '_partial_counts.npy')
binary_test_results_fpath = os.path.join(binary_results_folder, binary_test_results_fname)
if os.path.exists(binary_test_results_fpath):
    print('Processed and loading: ', binary_test_results_fname)
    all_test_results = np.load(binary_test_results_fpath, allow_pickle=True).item()
else:
    all_test_results = predict_underspecification_from_attributes(
        attr_df, target_column, max_centroid_distance, binary_results_folder, prior,
        loss_function=loss_function, partial_counts=partial_counts, n_boost_rounds=n_boost_rounds, random_seed=random_seed,
        optimize_cv_folds=optimize_cv_folds, n_cv_fold_optimization_trials=n_cv_fold_optimization_trials, cv_fold_seed=cv_fold_seed,
    )
    np.save(binary_test_results_fpath, all_test_results)

### Notes for Discussion

1. The binary classification balance ranges from 46% to 58%, meaning a very large proportion of the models in the dataset are underspecified, or the support of $Q(x_i)$ does not cover $P(x_i)$.  Underspecified models are senstive to the prior assumed in computing the KL divergence.  
2. The binary classification tells us $\forall x_i \in X, P(x_i) > 0 \rightarrow Q(x_i) > 0$.  This is different than knowing whether the range of values is equal, **as $Q$ could still cover a wider range**.  In other words, the support of $P(x_i)$ could be a subset of $Q(x_i)$.  Just not the other way around.  However, since the pairs are directional and both are contained in the dataset, we can create a third target variable for prediction that describes the combined case where the model $Q(x_i)$ covers the support of $P(x_i)$ **and** $P(x_i)$ covers the support of $Q(x_i)$.  This boolean combination describes matching coverage.
3. The support coverage can be predicted as a binary variable from catchment attributes.  Predicting whether a model (proxy) $Q(x_i)$ will cover the support of the target $P(x_i)$ **based on proximity alone** does almost no better than random guessing.  However the AUC score improves substantially with the addition of climate attributes.  This is likely due to the precipitation.  Adding terrain attributes again results in a marked improvement in AUC score reaching nearly 90%. This improvement is likely due to the addition of the drainage area.  Adding land cover and soil attributes to the covariate vector does not improve the AUC.  The annual precipitation and drainage area in combination explain much of the story of what is expected in terms of the range of unit area runoff, and we can test this by training the model with those two features exclusively.  More nuanced differences can be plausibly explained by any number of physical processes, but no such tests are attempted here.

In [None]:
def plot_confusion_matrix(all_test_predictions, threshold=0.5):
    predicted_labels = (all_test_predictions['predicted'] >= threshold).astype(int)
    cm = confusion_matrix(all_test_predictions['actual'], predicted_labels)
    cm_percent = cm / cm.sum() * 100  # Convert to percentages
    # Prepare data for Bokeh plot (flatten the 2x2 matrix)
    labels = ["True Neg", "False Pos", "False Neg", "True Pos"]
    counts = [float(e) for e in cm_percent.flatten()]
    text_counts = [f'{c:.1f}%' for c in counts]
    x = [0, 1, 0, 1]  # Columns of the confusion matrix
    y = [1, 1, 0, 0]  # Rows of the confusion matrix

    # Create a data source for Bokeh
    source = ColumnDataSource(data=dict(x=x, y=y, counts=counts, labels=labels, text_counts=text_counts))

    # Set up color mapping
    mapper = linear_cmap(field_name='counts', palette='Blues9', low=min(counts), high=max(counts))

    # Create the figure
    p = figure(
        title=None,
        x_axis_location="above",
        tools="",
        width=400,
        height=400,
        x_range=(-0.5, 1.5),
        y_range=(-0.5, 1.5),
    )
    p.grid.visible = False

    # Add squares for each cell in the confusion matrix
    p.rect(x='x', y='y', width=1, height=1, source=source, fill_color=mapper, line_color="black")

    # Add text labels to the squares
    p.text(
        x='x', y='y', text='text_counts', source=source,
        text_font_size="16pt", text_align="center", text_baseline="middle",
        background_fill_color="white", background_fill_alpha=0.6  # White background with alpha
    )
    # Customize the axes
    # Remove only the ticks from both axes
    p.xaxis.major_tick_line_color = None  # Hide x-axis major ticks
    p.yaxis.major_tick_line_color = None  # Hide y-axis major ticks
    p.xaxis.minor_tick_line_color = None  # Hide x-axis minor ticks
    p.yaxis.minor_tick_line_color = None  # Hide y-axis minor ticks
    p.xaxis.major_label_overrides = {0: "Pred False", 1: "Pred True", -0.5: "", 0.5: "", 1.5: ""}
    p.yaxis.major_label_overrides = {0: "Actual True", 1: "Actual False", -0.5: "", 0.5: "", 1.5: ""}


    # p.xaxis.axis_label = "Predicted Label"
    # p.yaxis.axis_label = "Actual Label"
    return p

In [None]:
def find_optimal_threshold(all_test_predictions):
    # Compute ROC curve
    fpr, tpr, thresholds = roc_curve(all_test_predictions['actual'], all_test_predictions['predicted'])
    
    # Compute Youden's Index for each threshold
    youden_index = tpr - fpr
    optimal_idx = np.argmax(youden_index)  # Index of the best threshold
    optimal_threshold = thresholds[optimal_idx]
    
    return optimal_threshold

### Binary Classification Results

In [None]:
layout_dict = {}
reg_plots_dict = {}
res_r2_dict = {}

plots = []
reg_plots_dict[prior] = {}
res_r2_dict[prior] = {}

binary_results = np.load(binary_test_results_fpath, allow_pickle=True).item()

for b, binary_result in binary_results.items():
    test_set = 'test_auc'

    y2 = [binary_result[e][test_set] for e in attribute_set_names]
    source = ColumnDataSource({'x': attribute_set_names, 'y2': y2})
        
    title = f'{b} bits'
    if len(plots) == 0:
        fig = figure(title=title, x_range=attribute_set_names, toolbar_location='above')
    else:
        fig = figure(title=title, x_range=attribute_set_names, y_range=plots[0].y_range, 
                     output_backend='webgl', toolbar_location='above')

    fig.line('x', 'y2', legend_label='AUC', color='dodgerblue', source=source, line_width=3)
    fig.legend.background_fill_alpha = 0.6
    fig.legend.location = 'bottom_right'
    fig.yaxis.axis_label = 'AUC'
    fig.xaxis.axis_label = 'Attribute Group (additive)'
    plots.append(fig)
    
    result_df = pd.DataFrame({'set': attribute_set_names, 'auc': y2})
    best_auc_idx = result_df['auc'].idxmax()
    best_auc_set = result_df.loc[best_auc_idx, 'set']
    foo = binary_result[best_auc_set]
    
    # plot the test set convergence for the 'best' trial
    cfig = figure(title=f'AUC ({best_auc_set} set)',)
    learning_curve_sets = binary_result[best_auc_set]['learning_curve']
    train_curve_dfs, test_curve_dfs = [], []
    n = 0
    for lc_dict in learning_curve_sets:
        train_df = pd.DataFrame(lc_dict['train']['auc'], columns=[n])
        test_df = pd.DataFrame(lc_dict['eval']['auc'], columns=[n])
        test_curve_dfs.append(test_df)
        train_curve_dfs.append(train_df)
        n += 1

    train_curve_df = pd.concat(train_curve_dfs, axis=1)
    train_curve_df['mean'] = train_curve_df.mean(axis=1)
    test_curve_df = pd.concat(test_curve_dfs, axis=1)
    test_curve_df['mean'] = test_curve_df.mean(axis=1)
    
    for fn in range(len(learning_curve_sets)):
        cfig.line(train_curve_df.index, train_curve_df[fn], line_alpha=0.6, line_color='grey', line_dash='dotted')
        cfig.line(test_curve_df.index, test_curve_df[fn], line_alpha=0.6, line_color='red', line_dash='dotted')
    cfig.line(train_curve_df.index, train_curve_df['mean'], line_alpha=0.5, line_color='grey', 
              line_width=2, legend_label='CV Mean (Train)')
    cfig.line(test_curve_df.index, test_curve_df['mean'], line_alpha=0.5, line_color='red', 
              line_width=2, legend_label='CV Mean (Test)')

    # find the min predictive risk (optimal complexity)
    min_pred_risk_idx = test_curve_df['mean'].idxmax()
    if min_pred_risk_idx == max(test_curve_df['mean'].index):
        print(f'Min prediction risk occurs at the maximum iteration, try increasing the number of boosting rounds')
    
    min_pred_risk = test_curve_df.loc[min_pred_risk_idx, 'mean']
    cfig.line([min_pred_risk_idx, min_pred_risk_idx], [test_curve_df['mean'].min(), min_pred_risk], 
              legend_label='Min risk', color='green', line_width=2, line_dash='dashed')

    cfig.xaxis.axis_label = r'$$\text{Iteration}$$'
    cfig.yaxis.axis_label = r'$$\text{AUC} $$'
    cfig.legend.background_fill_alpha = 0.5
    cfig.legend.location = 'bottom_right'
    plots.append(cfig)

    all_test_predictions = pd.concat(binary_result[best_auc_set]['test_predictions'], axis=0)
    # find a threshold to balance (minimize) false positives and false negatives
    optimal_threshold = find_optimal_threshold(all_test_predictions)
    # plot the confusion matrix
    cm_plot = plot_confusion_matrix(all_test_predictions, threshold=optimal_threshold)
    plots.append(cm_plot)

    cdf_plot = figure(title=f'AUC CDF ({best_auc_set} set) Threshold={optimal_threshold:.2f}',)
    for preds in binary_result[best_auc_set]['test_predictions']:
        pct_true = np.sum(preds['actual']) / len(preds)
        percentiles = np.linspace(1, 100, 1000)
        cdf = np.percentile(preds['predicted'], percentiles)
        cdf_plot.line(percentiles, cdf, color='grey', line_width=1.5, line_alpha=0.5, legend_label='test folds')
    all_mean = np.percentile(all_test_predictions['predicted'], percentiles)
    pct_true = all_test_predictions['actual'].sum() / len(all_test_predictions)
    cdf_plot.line(percentiles, all_mean, color='red', alpha=0.5, line_width=2,
                  legend_label=f'{100*pct_true:.0f}% True')
    cdf_plot.legend.location = 'bottom_right'
    plots.append(cdf_plot)
    


In [None]:
binary_layout = gridplot(plots, ncols=4, width=300, height=275)
show(binary_layout)

### Binary Classification Problem Part 2: Feature Importance Hypothesis Test

In Part 1 of the binary classification test, findings of interest included: i) proximity alone barely beats random guessing for predicting when a model catchment will provide support coverage of a target, ii) climate attributes improve the predictive performance substantially, iii) adding terrain attributes achieves an AUC performance of 0.9, iv) the misclassified samples on the out-of-sample test set were split nearly evenly between true and false positives, and v) overall the model correctly labels out of sample catchments roughly 8 out of 10 times.

Intuitively it seems knowing the mean precipitation and the drainage area provides most of the picture as far as the general range of the distribution, so we re-test the model using just these two attributes. 

In [None]:
def predict_underspecification_from_key_attributes(attr_df, target_variable, max_centroid_distance, results_folder, prior,
                                loss_function=None, partial_counts=False, n_boost_rounds=100, random_seed=42, 
                              optimize_cv_folds=True, n_cv_fold_optimization_trials=20, cv_fold_seed=42):
    counts_key = 'partial'
    if partial_counts == "False":
        counts_key = 'whole'
    
    all_results = {}
    for bitrate in [4, 6, 8, 10]:
        all_results[bitrate] = {}
        t0 = time()
        print(f'bitrate = {bitrate}')
        input_data_fpath = os.path.join(input_folder, fname)
        nrows = None
        # df = pd.read_csv(input_data_fpath, nrows=nrows, low_memory=False)
        df = result_dict[bitrate][counts_key]
        df.dropna(subset=[target_variable], inplace=True)

        # check the target variable balance
        true_vals = len(df[df[target_variable] == True])
        pct_true = int(100*true_vals / len(df))
        print(f'{true_vals}/{len(df)} ({pct_true}%) of the dataset are underspecified')
                        
        # reduce the maximum distance separating pairs such that the 
        # graph can be more evenly separated.  Too permissible a distance
        # filter increases the graph connectivity, making it difficult to
        # create cross validation folds with no data leakage.
        df = df[df['centroid_distance'] < max_centroid_distance]
        print(f'  {len(df)} pairs remaining after filtering by max distance of {max_centroid_distance} km')        
        # add the attributes into the input dataset
        df.reset_index(inplace=True, drop=True)
        df = add_attributes(attr_df, df.copy(), all_attributes)
        fold_dict = create_fold_dict(df)

        # add attribute groups successively
        predictor_attributes = []
        # make sure the order of attributes matches the attribute_set_names list
        group_sets = [['prcp',], ['drainage_area_km2',], ['prcp', 'drainage_area_km2',]]
        small_set_names = ['precipitation', 'area', 'precip+area']
        for attribute_set, set_name in zip(group_sets, small_set_names):
            print(f'  Processing {set_name} attribute set: {target_variable}')
            # initialize the predictor variables (features)
            predictor_attributes = attribute_set
            non_dist_features = [c for c in predictor_attributes if c != 'centroid_distance']
            pair_features = format_features(non_dist_features)
            # inter-catchment differences of attributes 
            diff_features = [f'{c}_diff' for c in non_dist_features]
            features = pair_features + diff_features

            best_params, best_mean_test_perf, best_learning_curve, results_all_trials, best_trial_predictions = run_binary_trials_custom_CV(
                set_name, features, target_variable, df, fold_dict, 
                n_cv_fold_optimization_trials, n_boost_rounds, results_folder, 
                loss=loss_function, random_seed=random_seed)
                
            # store the test set predictions and actuals
            all_results[bitrate][set_name] = {
                'best_params': best_params,
                'all_results': results_all_trials, 
                'learning_curve': best_learning_curve,
                'test_predictions': best_trial_predictions,
                # 'test_rmse': best_mean_rmse,
                'test_auc': best_mean_test_perf,
                # 'target_cdfs': target_cdfs,
            } 
    return all_results

In [None]:
# target_col = f'dkl_{concurrent}_post_{prior}R'
target_column = 'underspecified_model_flag'
loss_function = 'binary:logistic'
partial_counts = True
binary_test_results_fname = f'binary_classification_coverage_test_precip_DA.npy'
if partial_counts is True:
    binary_test_results_fname = binary_test_results_fname.replace('.npy', '_partial_counts.npy')
binary_test_results_fpath = os.path.join(binary_results_folder, binary_test_results_fname)
if os.path.exists(binary_test_results_fpath):
    print('Processed and loading: ', binary_test_results_fname)
    all_test_results = np.load(binary_test_results_fpath, allow_pickle=True).item()
else:
    all_test_results = predict_underspecification_from_key_attributes(
        attr_df, target_column, max_centroid_distance, binary_results_folder, prior,
        loss_function=loss_function, partial_counts=partial_counts, n_boost_rounds=n_boost_rounds, random_seed=random_seed,
        optimize_cv_folds=optimize_cv_folds, n_cv_fold_optimization_trials=n_cv_fold_optimization_trials, cv_fold_seed=cv_fold_seed,
    )
    np.save(binary_test_results_fpath, all_test_results)

In [None]:
layout_dict = {}
reg_plots_dict = {}
res_r2_dict = {}

plots = []
reg_plots_dict[prior] = {}
res_r2_dict[prior] = {}

binary_results = np.load(binary_test_results_fpath, allow_pickle=True).item()
small_set_names = ['precipitation', 'area', 'precip+area']
for b, binary_result in binary_results.items():
    test_set = 'test_auc'

    y2 = [binary_result[e][test_set] for e in small_set_names]
    source = ColumnDataSource({'x': small_set_names, 'y2': y2})
        
    title = f'{b} bits'
    if len(plots) == 0:
        fig = figure(title=title, x_range=small_set_names, toolbar_location='above')
    else:
        fig = figure(title=title, x_range=small_set_names, y_range=plots[0].y_range, 
                     output_backend='webgl', toolbar_location='above')

    fig.line('x', 'y2', legend_label='AUC', color='dodgerblue', source=source, line_width=3)
    fig.legend.background_fill_alpha = 0.6
    fig.legend.location = 'bottom_right'
    fig.yaxis.axis_label = 'AUC'
    fig.xaxis.axis_label = 'Attribute Group (additive)'
    plots.append(fig)
    
    result_df = pd.DataFrame({'set': small_set_names, 'auc': y2})
    best_auc_idx = result_df['auc'].idxmax()
    best_auc_set = result_df.loc[best_auc_idx, 'set']
    foo = binary_result[best_auc_set]
    
    # plot the test set convergence for the 'best' trial
    cfig = figure(title=f'AUC ({best_auc_set} set)',)
    learning_curve_sets = binary_result[best_auc_set]['learning_curve']
    train_curve_dfs, test_curve_dfs = [], []
    n = 0
    for lc_dict in learning_curve_sets:
        train_df = pd.DataFrame(lc_dict['train']['auc'], columns=[n])
        test_df = pd.DataFrame(lc_dict['eval']['auc'], columns=[n])
        test_curve_dfs.append(test_df)
        train_curve_dfs.append(train_df)
        n += 1

    train_curve_df = pd.concat(train_curve_dfs, axis=1)
    train_curve_df['mean'] = train_curve_df.mean(axis=1)
    test_curve_df = pd.concat(test_curve_dfs, axis=1)
    test_curve_df['mean'] = test_curve_df.mean(axis=1)
    
    for fn in range(len(learning_curve_sets)):
        cfig.line(train_curve_df.index, train_curve_df[fn], line_alpha=0.6, line_color='grey', line_dash='dotted')
        cfig.line(test_curve_df.index, test_curve_df[fn], line_alpha=0.6, line_color='red', line_dash='dotted')
    cfig.line(train_curve_df.index, train_curve_df['mean'], line_alpha=0.5, line_color='grey', 
              line_width=2, legend_label='CV Mean (Train)')
    cfig.line(test_curve_df.index, test_curve_df['mean'], line_alpha=0.5, line_color='red', 
              line_width=2, legend_label='CV Mean (Test)')

    # find the min predictive risk (optimal complexity)
    min_pred_risk_idx = test_curve_df['mean'].idxmax()
    if min_pred_risk_idx == max(test_curve_df['mean'].index):
        print(f'Min prediction risk occurs at the maximum iteration, try increasing the number of boosting rounds')
    
    min_pred_risk = test_curve_df.loc[min_pred_risk_idx, 'mean']
    cfig.line([min_pred_risk_idx, min_pred_risk_idx], [test_curve_df['mean'].min(), min_pred_risk], 
              legend_label='Min risk', color='green', line_width=2, line_dash='dashed')

    cfig.xaxis.axis_label = r'$$\text{Iteration}$$'
    cfig.yaxis.axis_label = r'$$\text{AUC} $$'
    cfig.legend.background_fill_alpha = 0.5
    cfig.legend.location = 'bottom_right'
    plots.append(cfig)

    all_test_predictions = pd.concat(binary_result[best_auc_set]['test_predictions'], axis=0)
    # find a threshold to balance (minimize) false positives and false negatives
    optimal_threshold = find_optimal_threshold(all_test_predictions)
    # plot the confusion matrix
    cm_plot = plot_confusion_matrix(all_test_predictions, threshold=optimal_threshold)
    plots.append(cm_plot)

    cdf_plot = figure(title=f'AUC CDF ({best_auc_set} set) Threshold={optimal_threshold:.2f}',)
    for preds in binary_result[best_auc_set]['test_predictions']:
        pct_true = np.sum(preds['actual']) / len(preds)
        percentiles = np.linspace(1, 100, 1000)
        cdf = np.percentile(preds['predicted'], percentiles)
        cdf_plot.line(percentiles, cdf, color='grey', line_width=1.5, line_alpha=0.5, legend_label='test folds')
    all_mean = np.percentile(all_test_predictions['predicted'], percentiles)
    pct_true = all_test_predictions['actual'].sum() / len(all_test_predictions)
    cdf_plot.line(percentiles, all_mean, color='red', alpha=0.5, line_width=2,
                  legend_label=f'{100*pct_true:.0f}% True')
    cdf_plot.legend.location = 'bottom_right'
    plots.append(cdf_plot)
    

In [None]:
binary_layout = gridplot(plots, ncols=4, width=300, height=275)
show(binary_layout)

In [None]:
binary_layout = gridplot(plots, ncols=4, width=300, height=275)
show(binary_layout)

### Regression Problem

The same problem setup applies for the regression prediction problem which is to optimize the discriminant function and the input signal quantization simultaneously to minimize the error in predicting the KL divergence from catchment attributes.  

Instead of predicting a scalar measure which is a feature of a single location, the key difference in this step is the target variable describes a measure of the difference in runoff between **pairs of locations**.  This approach asks whether the **Kullback-Leibler Divergence** (KLD) of the distribution of unit area runoff between two locations can be predicted from the attributes of both catchments (and their differences) using the gradient boosted decision tree method, which is also capable of predicting continuous variables, in this case $D_{KL}$.

### Set trial parameters

In [None]:
# define the amount of data to set aside for final testing
# n_cv_folds = 5
n_boost_rounds = 2500
priors_to_test = [-2, -1, 0, 1, 2]
random_seed = 42
loss_function = 'reg:absoluteerror'  # L1 objective function for regression

#define if testing concurrent or nonconcurrent data
concurrent = 'concurrent'

# partial counts refer to the test where observations were assigned
# a uniform distribution to approximate error and allow fractional 
# observations in state space
partial_counts = False

# cross validation parameters
optimize_cv_folds = False
cv_fold_seed = 83561
n_cv_fold_optimization_trials = 20
# limit the maximum distance to make the network 
# graph of station pairs more separable
max_centroid_distance = 1000

attribute_set_names = ['proximity', '+climate', '+terrain', '+land_cover', '+soil']
attribute_group_sets = [['centroid_distance'], climate, terrain, land_cover, soil]

In [None]:
posterior_columns = [c for c in df_partial.columns if c.startswith(f'dkl_{concurrent}_post')]
print(posterior_columns)

In [None]:
results_folder = os.path.join(BASE_DIR, 'data', 'kld_prediction_results')
if not os.path.exists(results_folder):
    os.makedirs(results_folder)

### Train-Test Split

The input dataset is pairwise comparisons of just over 1300 (streamflow) monitored catchments, their attributes, and the attribute differences.  After filtering for data concurrency (minimum 1 year, < 5 days missing per month) and maximum distance between basin centroids (500 km) we are left with roughly 225K pairs.  The pairwise setup means that station data appears in more than one row.  As a result, the attributes of stations can end up in both training and test sets if we simply split by randomly assigning rows to training or test sets.   We can't simply cut edges until the graph is separated because it is a [generalization of the "keeping a subset of vertices" problem in graph theory which is NP-hard](https://en.wikipedia.org/wiki/Independent_set_(graph_theory)).

To address this issue we split the dataset spatially to create partitioned datasets for each fold and draw samples from within the "cluster" while filtering all edges between the fold and the rest of the set.  The end goal is to generate training folds where the official_id does not appear in both training and test set, in either proxy or target column.  One problem remains, and that is to generate training and test sets with some assurance that the target variable distributions match to some degree, but this is less critical than ensuring there is no data leakage between training and test data.  

Next we create training/test splits and visualize how the target variable distributions compare.


In [None]:
def compute_empirical_cdf(data):
    sorted_data = np.sort(data)
    cdf = np.arange(1, len(sorted_data) + 1) / len(sorted_data)
    return sorted_data, cdf

In [None]:
def load_result_by_prior(prior, binary=False, partial_counts=False):
    rf = os.path.join('data', 'kld_prediction_results')
    fname = f'dkl_{concurrent}_post_{prior}R_{prior}_prior_results.npy'
    if partial_counts == True:
        fname = fname.replace('.npy', '_partial_counts.npy')
    fpath = os.path.join(rf, fname)
    return np.load(fpath, allow_pickle=True).item()

In [None]:
def train_xgb_model(
    input_data, fold_no, cv_data, attributes, target, params, num_boost_rounds, loss
):    
    test_idxs = cv_data['test']
    train_idxs = cv_data['train']   

    test_data = input_data.iloc[test_idxs, :].copy()
    train_data = input_data.iloc[train_idxs, :].copy()

    X_train = train_data[attributes].values
    Y_train = np.log10(train_data[target].values)
    X_test = test_data[attributes].values
    Y_test = np.log10(test_data[target].values)

    # model = xgb.XGBRegressor(**params)

    dtrain = xgb.DMatrix(X_train, label=Y_train)
    dtest = xgb.DMatrix(X_test, label=Y_test)

    eval_list = [(dtrain, "train"), (dtest, "eval")]
    evals_result = {}
    bst = xgb.train(
        params,
        dtrain,
        num_boost_rounds,
        evals=eval_list,
        evals_result=evals_result,
        verbose_eval=0,
    )
    eval_keys = list(evals_result['train'].keys())
    if len(eval_keys) > 1:
        print(f' setting eval key to {eval_keys[0]} from {eval_keys}')

    eval_key = eval_keys[0]
    # Convert the lists to NumPy arrays with dtype=object
    train_perf = np.array(evals_result['train'][eval_key], dtype=object)
    test_perf = np.array(evals_result['eval'][eval_key], dtype=object)
    fold_progress = pd.DataFrame({
        'train': evals_result['train'][eval_key],
        'test': evals_result['eval'][eval_key],
        'fold': [fold_no]*len(evals_result['train'][eval_key]),
    })
    predicted = bst.predict(dtest)

    return train_perf, test_perf, fold_progress, predicted, Y_test

In [None]:
def run_xgb_trials_custom_CV(
    set_name,
    attributes,
    target,
    input_data,
    fold_dict, 
    n_optimization_rounds,
    num_boost_rounds,
    results_folder,
    loss='reg:squarederror',
    random_seed=42
):
    """
    Custom CV refers to cross validation.  Custom cross validation means the 
    held-out set must be determined in a more robust way to avoid "data leakage".
    That is, the pairs making up the training, validation, and test sets must 
    be made up of pairings from unique sets of stations.
    """
    # select random hyperparameters for n_optimization_rounds
    sample_choices = np.arange(0.5, 0.9, 0.02)  # subsample and colsample percentages
    lr_choices = np.arange(0.001, 0.1, 0.0005)  # learning rates
    learning_rates = np.random.choice(lr_choices, n_optimization_rounds)
    subsamples = np.random.choice(sample_choices, n_optimization_rounds)
    colsamples = np.random.choice(sample_choices, n_optimization_rounds)
    num_boost_rounds = num_boost_rounds
    eval_key = loss.split(':')[1]

    all_results = []
    best_result = (None, np.inf, None)
    best_params = None
    best_mean_test_perf = np.inf
    best_convergence_df = pd.DataFrame()
    best_trial_test_predictions = None
    output_target_cdfs = None
    for trial in range(n_optimization_rounds):
        lr, ss, cs = learning_rates[trial], subsamples[trial], colsamples[trial]
        params = {
            "objective": loss,
            "eta": lr,
            # "max_depth": 6,  # use default max_depth
            # "min_child_weight": 1, # use colsample and subsample instead of min_child_weight
            "subsample": ss,
            "colsample_bytree": cs,
            "seed": random_seed,
            "device": "cuda",  # note, change this to 'cpu' if your system doesn't have a CUDA GPU
            "sampling_method": "gradient_based",
            "tree_method": "hist",
        }

        results_fname = (
            f"{set_name}_{bitrate}_bits_{lr:.3f}_lr_{ss:.3f}_sub_{cs:.3f}_col.csv"
        )
        results_fpath = os.path.join(results_folder, results_fname)

        # k-fold cross validation
        fold_scores = []
        n_samples = len(input_data)
        fold_no = 0
        best_train_fold_perf, best_test_fold_perf, best_test_rounds = [], [], []
        all_fold_results = []
        fold_scores, fold_arrays = [], []
        target_cdfs = []
        for fold_no, cv_data in fold_dict.items():
            train_perf, test_perf, fold_progress, predicted, Y_test = train_xgb_model(
                input_data,
                fold_no,
                cv_data,
                attributes,
                target,
                params,
                num_boost_rounds,
                loss
            )            
            fold_arrays.append(fold_progress)
            
            test_ids = input_data.loc[cv_data['test'], ['proxy', 'target']].values
            test_ids = [f'{e[0]}_{e[1]}' for e in test_ids]

            test_results = pd.DataFrame(
                {"predicted": predicted, "actual": Y_test, "proxy_target": test_ids}
            )
            
            ordered_data, fold_cdf = compute_empirical_cdf(Y_test)
            target_cdfs += [(ordered_data, fold_cdf)]
            
            # Get the round with the best validation score (out-of-sample performance)
            best_perf_round_train, best_perf_round_test = np.argmin(train_perf), np.argmin(test_perf)
            
            # Store the metrics at the best round (minimum risk)
            train_perf_best = train_perf[best_perf_round_train]
            test_perf_best = test_perf[best_perf_round_test]

            best_train_fold_perf.append(train_perf_best)
            best_test_fold_perf.append(test_perf_best)
            best_test_rounds.append(best_perf_round_test)
            all_fold_results.append(test_results)
            fold_no += 1

        all_test_predictions_df = pd.concat(all_fold_results)
        convergence_df = pd.concat(fold_arrays)    
        
        mean_test_perf = np.mean(best_test_fold_perf)
        stdev_test_perf = np.std(best_test_fold_perf)
        # # track the trial error metrics
        results_dict = {
            'trial': trial,
            f'test_{eval_key}_mean': mean_test_perf,
            f'test_{eval_key}_stdev': stdev_test_perf,
        }
        
        results_cols = list(results_dict.keys())
        results_dict.update(params)
        all_results.append(results_dict)
        if (trial > 0) & (trial % 10 == 0):
            print(f"   completed {trial}/{n_optimization_rounds}")
            
        if round(mean_test_perf,2) < round(best_mean_test_perf, 2):
            best_params = params
            best_mean_test_perf = mean_test_perf
            best_convergence_df = convergence_df
            best_trial_test_predictions = all_test_predictions_df
            output_target_cdfs = target_cdfs
            print(f'    New best result: {eval_key}={mean_test_perf:.2f} (trial {trial})')
    
    # save the best trial results
    best_trial_test_predictions.to_csv(results_fpath)
    results_all_trials = pd.DataFrame(all_results)
    # get the mean and standard deviation of the error metrics over all trials
    all_trials_mean = results_all_trials[f"test_{eval_key}_mean"].mean()
    all_trials_stdev = results_all_trials[f"test_{eval_key}_mean"].std()
    print(
        f"    {all_trials_mean:.2f} ± {2*all_trials_stdev:.3f} mean (95% CI) {eval_key} (of {len(results_all_trials)} hyperparameter optimization rounds.)"
    )
    return best_params, best_mean_test_perf, best_convergence_df, best_trial_test_predictions, results_all_trials, output_target_cdfs



In [None]:
def predict_KLD_from_attributes(attr_df, target_variable, max_centroid_distance, results_folder, prior,
                                loss_function=None, partial_counts=False, n_boost_rounds=100, random_seed=42, 
                              optimize_cv_folds=True, n_cv_fold_optimization_trials=20, cv_fold_seed=42):
    counts_key = 'partial'
    if partial_counts == "False":
        counts_key = 'whole'
    
    all_results = {}
    for bitrate in [4, 6, 8, 10]:
        all_results[bitrate] = {}
        t0 = time()
        print(f'bitrate = {bitrate} (prior=10^{prior})')
        input_data_fpath = os.path.join(input_folder, fname)
        nrows = None
        # df = pd.read_csv(input_data_fpath, nrows=nrows, low_memory=False)
        df = result_dict[bitrate][counts_key]
        df.dropna(subset=[target_variable], inplace=True)
                        
        # reduce the maximum distance separating pairs such that the 
        # graph can be more evenly separated.  Too permissible a distance
        # filter increases the graph connectivity, making it difficult to
        # create cross validation folds with no data leakage.
        df = df[df['centroid_distance'] < max_centroid_distance]
        print(f'  {len(df)} pairs remaining after filtering by max distance of {max_centroid_distance} km')        
        # add the attributes into the input dataset
        df.reset_index(inplace=True, drop=True)
        df = add_attributes(attr_df, df.copy(), all_attributes)
        fold_dict = create_fold_dict(df)

        # add attribute groups successively
        predictor_attributes = []
        # make sure the order of attributes matches the attribute_set_names list
        for attribute_set, set_name in zip(attribute_group_sets, attribute_set_names):
            print(f'  Processing {set_name} attribute set: {target_variable}')
            
            # initialize the predictor variables (features)
            predictor_attributes += attribute_set
            if set_name == 'proximity':
                features = ['centroid_distance']
            else:
                non_dist_features = [c for c in predictor_attributes if c != 'centroid_distance']
                pair_features = format_features(non_dist_features)
                diff_features = [f'{c}_diff' for c in non_dist_features]
                features = ['centroid_distance'] + pair_features + diff_features                

            best_params, best_mean_rmse, best_convg_df, best_trial_test_predictions, results_all_trials, target_cdfs = run_xgb_trials_custom_CV(
                    set_name, features, target_variable, df, fold_dict, 
                    n_cv_fold_optimization_trials, n_boost_rounds, results_folder, loss=loss_function, random_seed=random_seed,
            )
            # store the test set predictions and actuals
            all_results[bitrate][set_name] = {
                'best_params': best_params,
                'all_results': results_all_trials, 
                'convergence': best_convg_df,
                'oos_predictions': best_trial_test_predictions,
                # 'test_rmse': best_mean_rmse,
                'test_mae': best_mean_rmse,
                'target_cdfs': target_cdfs,
            } 
    return all_results

In [None]:
results_folder

## Run Regression Models

In [None]:
priors_to_test = [-2, -1, 0, 1, 2]
rev_date = '20241202'
for prior in priors_to_test:
    target_col = f'dkl_{concurrent}_post_{prior}R'
    test_results_fname = f'{target_col}_{prior}_prior_results_{rev_date}.npy'
    test_results_fname = f'{target_col}_{prior}_prior_results.npy'
    if partial_counts is False:
        test_results_fname = f'{target_col}_{prior}_prior_results_{rev_date}.npy'
        test_results_fname = f'{target_col}_{prior}_prior_results.npy'
    else:
        test_results_fname = f'{target_col}_{prior}_prior_results_partial_counts_{rev_date}.npy'
        test_results_fname = f'{target_col}_{prior}_prior_results_partial_counts.npy'
    test_results_fpath = os.path.join(results_folder, test_results_fname)
    if os.path.exists(test_results_fpath):
        print('processed and loading: ', test_results_fname)
        all_test_results = np.load(test_results_fpath, allow_pickle=True).item()
    else:
        all_test_results = predict_KLD_from_attributes(
            attr_df, target_col, max_centroid_distance, results_folder, prior,
            loss_function=loss_function, partial_counts=partial_counts, n_boost_rounds=n_boost_rounds, random_seed=random_seed,
            optimize_cv_folds=optimize_cv_folds, n_cv_fold_optimization_trials=n_cv_fold_optimization_trials, cv_fold_seed=cv_fold_seed,
        )                               
        np.save(test_results_fpath, all_test_results)


In [None]:
def format_fig_fonts(fig, font_size=20, font='Bitstream Charter', legend=True):
    fig.xaxis.axis_label_text_font_size = f'{font_size}pt'
    fig.yaxis.axis_label_text_font_size = f'{font_size}pt'
    fig.xaxis.major_label_text_font_size = f'{font_size-2}pt'
    fig.yaxis.major_label_text_font_size = f'{font_size-2}pt'
    fig.yaxis.axis_label_text_font = font
    fig.xaxis.axis_label_text_font = font
    fig.xaxis.major_label_text_font = font
    fig.yaxis.major_label_text_font = font
    if legend == True:
        fig.legend.label_text_font_size = f'{font_size-2}pt'
        fig.legend.label_text_font = font
    return fig

In [None]:
def load_result_by_prior(prior, rev_date, binary=False, partial_counts=False):
    rf = os.path.join('data', 'kld_prediction_results')
    fname = f'dkl_{concurrent}_post_{prior}R_{prior}_prior_results_{rev_date}.npy'
    fname = f'dkl_{concurrent}_post_{prior}R_{prior}_prior_results.npy'
    if partial_counts == True:
        fname = fname.replace('.npy', '_partial_counts.npy')
    fpath = os.path.join(rf, fname)
    return np.load(fpath, allow_pickle=True).item()

In [None]:
pt = -1
bt = 6
result_by_prior = load_result_by_prior(pt, rev_date, binary=False)[b]
test_set = 'test_mae'
y2 = [result[e][test_set] for e in attribute_set_names]

result_df = pd.DataFrame({'set': attribute_set_names, 'mae': y2})
# best_rmse_idx = result_df['rmse'].idxmin()
best_mae_idx = result_df['mae'].idxmin()

# best_rmse_set = result_df.loc[best_rmse_idx, 'set']
best_mae_set = result_df.loc[best_mae_idx, 'set']
foo = result[best_mae_set]
best_result = result[best_mae_set]['oos_predictions']
xx, yy = best_result['actual'], best_result['predicted']

In [None]:

x_min, x_max = np.min(np.log10(xx)), np.max(np.log10(xx))
edges = np.linspace(x_min, x_max, 20)
edges = np.power(10, edges)
hist, edges = np.histogram(xx, bins=edges)
# Prepare data for Bokeh
source = ColumnDataSource(data={
    "top": hist,
    "left": edges[:-1],
    "right": edges[1:]
})
p = figure(title=f'{bt} bits', toolbar_location='above', x_axis_type='log', height=400, width=550)
# Add quad glyph for the histogram
p.xaxis.axis_label = r'$$\text{Observed } D_\text{KL}(P||Q)$$'
p.yaxis.axis_label = r'$$\text{Pr}(X)$$'
p.quad(top="top", bottom=0, left="left", right="right", source=source, 
       fill_color="blue", line_color="black", alpha=0.4)
p = dpf.format_fig_fonts(p, font_size=14)
show(p)

## Plot Results of $D_{KL}$ Regression Test

In [None]:
layout_dict = {}
reg_plots_dict = {}
res_r2_dict = {}
for prior in priors_to_test:
    print(prior)
    plots = []
    reg_plots_dict[prior] = {}
    res_r2_dict[prior] = {}
    result_by_prior = load_result_by_prior(prior, rev_date, binary=False)
    for b in result_by_prior.keys():
        result = result_by_prior[b]

        test_rmse, test_mae = [], []
        test_set = 'test_mae'

        y2 = [result[e][test_set] for e in attribute_set_names]
        source = ColumnDataSource({'x': attribute_set_names, 'y2': y2})
            
        title = f'{b} bits (Q(θ|D)∼Dirichlet(α=10^{prior}))'
        if len(plots) == 0:
            fig = figure(title=title, x_range=attribute_set_names, toolbar_location='above',
                        output_backend='webgl')
        else:
            fig = figure(title=title, x_range=attribute_set_names, y_range=plots[0].y_range, 
                         output_backend='webgl', toolbar_location='above',
                        )
        # fig.line('x', 'y1', legend_label='rmse', color='green', source=source, line_width=3)
        fig.line('x', 'y2', legend_label='mae', color='dodgerblue', source=source, line_width=3)
        fig.legend.background_fill_alpha = 0.6
        fig.yaxis.axis_label = 'Error'
        fig.xaxis.axis_label = 'Attribute Group (additive)'
        plots.append(fig)
        
        result_df = pd.DataFrame({'set': attribute_set_names, 'mae': y2})
        # best_rmse_idx = result_df['rmse'].idxmin()
        best_mae_idx = result_df['mae'].idxmin()
        # best_rmse_set = result_df.loc[best_rmse_idx, 'set']
        best_mae_set = result_df.loc[best_mae_idx, 'set']
        foo = result[best_mae_set]
        best_result = result[best_mae_set]['oos_predictions']
        xx, yy = best_result['actual'], best_result['predicted']
        xmin, ymin = np.nanmin(xx), np.nanmin(yy)
        print(xmin, ymin)
        # print(asdf)
        slope, intercept, r, p, se = linregress(xx, yy)
        
        # sfig = figure(title=f'Test: {b} bits best model {best_rmse_set} (N={len(best_result)})', toolbar_location='above')
        sfig = figure(title=f'{b} bits', toolbar_location='above')#, x_axis_type='log', y_axis_type='log')
        # sfig.scatter(xx, yy, size=1, alpha=0.6)
        # Create hexbin plot
        binsize=0.05
        hex_renderer, hex_data = sfig.hexbin(xx, yy, size=binsize, hover_color="pink", hover_alpha=0.8)
        
        # Add color mapping based on bin counts
        counts = hex_data['counts']  # Extract the counts from the source
        mapper = linear_cmap(field_name='counts', palette=Greys256[::-1], low=min(counts), high=max(counts))
        
        # Plot the hex tiles using the color mapping
        sfig.hex_tile(q="q", r="r", size=binsize, line_color=None, source=hex_data, fill_color=mapper)
        
        xpred = np.linspace(min(xx), max(xx), 100)
        ybf = [slope * e + intercept for e in xpred]
        sfig.line(xpred, ybf, color='red', line_width=3, line_dash='dashed', legend_label=f'R²={r**2:.2f}')   
        # plot a 1:1 line
        sfig.line([min(yy), max(yy)], [min(yy), max(yy)], color='black', line_dash='dotted', 
                  line_width=2, legend_label='1:1')
        sfig.xaxis.axis_label = r'Actual $$D_{KL}$$ [bits/sample]'
        sfig.yaxis.axis_label = r'Predicted $$D_{KL}$$ [bits/sample]'
        sfig.legend.background_fill_alpha = 0.6
        sfig.legend.location = 'top_left'
        reg_plots_dict[prior][b] = sfig
        res_r2_dict[prior][b] = r**2
        
        plots.append(sfig)   
    
        # plot the test set convergence for the 'best' trial
        cfig = figure(title=f'Loss Curve ({best_mae_set} set)',)
        convergence_df = result[best_mae_set]['convergence']
    
        # Pivot the data to get separate columns for each fold
        train_pivot = convergence_df.pivot(columns='fold', values='train')
        test_pivot = convergence_df.pivot(columns='fold', values='test')
    
        # Rename the columns to indicate folds
        train_pivot.columns = [f'fold_{col}' for col in train_pivot.columns]
        test_pivot.columns = [f'fold_{col}' for col in test_pivot.columns]
        train_pivot['mean'] = train_pivot.mean(axis=1)
        test_pivot['mean'] = test_pivot.mean(axis=1)
        fold_nos = sorted(list(set(convergence_df['fold'])))
        
        for fn in fold_nos:
            cfig.line(test_pivot.index, test_pivot[f'fold_{fn}'], line_alpha=0.6, line_color='red', line_dash='dotted')
            cfig.line(train_pivot.index, train_pivot[f'fold_{fn}'], line_alpha=0.6, line_color='grey', line_dash='dotted')
        cfig.line(train_pivot.index, train_pivot['mean'], line_alpha=0.5, line_color='grey', 
                  line_width=2, legend_label='CV Mean (Train)')
        cfig.line(test_pivot.index, test_pivot['mean'], line_alpha=0.5, line_color='red', 
                  line_width=2, legend_label='CV Mean (Test)')
    
        # find the minimum predictive risk (optimal complexity)
        min_pred_risk_idx = test_pivot['mean'].idxmin()
        if min_pred_risk_idx == max(test_pivot['mean'].index):
            print(f'Min prediction risk occurs at the maximum iteration, try increasing the number of boosting rounds')
        
        min_pred_risk = test_pivot.loc[min_pred_risk_idx, 'mean']
        cfig.line([min_pred_risk_idx, min_pred_risk_idx], [train_pivot['mean'].min(), min_pred_risk], 
                  legend_label='Min risk', color='green', line_width=2, line_dash='dashed')
    
        cfig.xaxis.axis_label = r'$$\text{Iteration}$$'
        cfig.yaxis.axis_label = r'$$\text{|x-y|} $$'
        cfig.legend.background_fill_alpha = 0.5
        cfig.legend.location = 'top_right'
        plots.append(cfig)
        # plot a 1:1 line
        sfig.line([min(ybf), max(ybf)], [min(ybf), max(ybf)], color='black', line_dash='dotted', 
                  line_width=2, legend_label='1:1')
    
        # plot the cdfs of the target variables in each fold to compare
        cdffig = figure(title=f'Target Variable CDFs by fold', x_axis_type='log')
        cdf_arrays = result[best_mae_set]['target_cdfs']
        for (cdfx, cdfy) in cdf_arrays:
            cdffig.line(cdfx, cdfy, color='black', line_alpha=0.6, line_width=2)
        plots.append(cdffig)
        cdffig.xaxis.axis_label = r'$$\text{Observed Values } \text{[bits/sample]}$$'
        cdffig.yaxis.axis_label = r'$$\text{Pr}(x\leq X)$$'
        
    layout_dict[prior] = gridplot(plots, ncols=4, width=300, height=275)

In [None]:
show(layout_dict[-2])

In [None]:
show(layout_dict[-1])

In [None]:
show(layout_dict[0])

In [None]:
show(layout_dict[1])

In [None]:
show(layout_dict[2])

In [None]:
sample_plots = []
prior = 0
for b in [4, 6, 8, 10, 12]:
    plot = reg_plots_dict[prior][b]
    sample_plots.append(plot)

In [None]:
sample_layout = gridplot(sample_plots, ncols=5, width=250, height=250)
show(sample_layout)

In [None]:
from bokeh.transform import linear_cmap
from bokeh.models import ColorBar, ColumnDataSource
from bokeh.layouts import gridplot
from bokeh.palettes import Viridis256, gray, magma, Category20

# Convert the nested dict into a DataFrame
df = pd.DataFrame(res_r2_dict).T  # Transpose to get priors as columns
df.index.name = 'Prior'
df.columns.name = 'Bitrate'

In [None]:
# Melt the DataFrame to a long format
df_melted = df.reset_index().melt(id_vars='Prior', var_name='Bitrate', value_name='Value')
# Ensure the Bitrate values are ordered correctly (increasing order)
df_melted['Bitrate'] = pd.Categorical(df_melted['Bitrate'], categories=sorted(df_melted['Bitrate'].unique(), reverse=False), ordered=True)
df_melted['Value'] = df_melted['Value'].round(2)
# Create a Bokeh ColumnDataSource
source = ColumnDataSource(df_melted)

# Create a figure for the heatmap
p = figure(title="KL divergence from attributes: R² of test set by Prior and Bitrate",width=600, height=500,
           tools="hover", tooltips=[('Value', '@Value{0.00}')], toolbar_location=None)

# Create a color mapper
mapper = linear_cmap(field_name='Value', palette=magma(256), low=df_melted.Value.min(), high=df_melted.Value.max())

# Add rectangles to the plot
p.rect(x="Prior", y="Bitrate", width=1, height=1, source=source,
       line_color=None, fill_color=mapper)

# Add color bar
color_bar = ColorBar(color_mapper=mapper['transform'], width=8, location=(0,0))
p.add_layout(color_bar, 'right')

# Format plot
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.xaxis.axis_label = r'$$Q(θ|D)∼\text{Dirichlet}(\alpha = 10^{a})$$'
p.yaxis.axis_label = r'$$\text{Quantization Bitrate (dictionary size)}$$'
p.axis.major_label_text_font_size = "10pt"
p.axis.major_label_standoff = 0
p.xaxis.major_label_orientation = 1.0

# Output the plot to an HTML file and display it
# output_file("heatmap.html")
show(p)

## Discussion

Since KL divergence $D_{KL}(P||Q) = \sum_{i=1}^{2^b} p_i\log(\frac{p_i}{q_i}) = +\infty \text{ when any } q_i \rightarrow 0$, the simulated $Q$ is treated as a posterior distribution by assuming a uniform (Dirichlet) prior $\alpha = [a_1, \dots, a_n]$. The prior $\alpha$ is an additive array of uniform pseudo-counts used to address the (commonly occurring) case where $q_i = 0$, that is the model does not predict an observed state $i$.  In this experiment we tested a wide range of priors on an exponential scale, $\alpha = 10^a, a \in [-2, -1, 0, 1, 2]$.  

The scale of the pseudo-count can be interpreted as a strength of belief in the model. Small $a$ represents strong belief that the model produces a distribution $Q$ that is representative of the "true" (observed posterior) distribution $P$, and for a fixed $p_i$ the effect of a decreasing $a$ on the discriminant function $D_{KL}$ yields a stronger penalty for a model that predicts an observed state with 0 probability.  Loss functions penalize overconfidence in incorrect predictions, and a prediction of 0 probability of a state which is actually observed should perhaps be thought of as confidence in an incorrect prediction and penalized as such.  A large $a$ represents weak belief that the model produces a distribution $Q$ that is representative of $P$, since $Q$ approaches the uniform distribution $\mathbb{U}$ as $a$ increases.  

Adding pseudo-counts has the effect of diluting the signal for the gradient boosting model to exploit in minimizing prediction error.  Analogously, varying the bitrate, or the size of the dictionary used to quantize continuous streamflow into discrete states, also adds quantization noise since the original streamflow signals are stored in three decimal precision and they are quantized into as few as 4 bits (16 symbol dictionary) and as many as 12 bits (4096 symbol dictionary).  The range of dictionary sizes is set to cover the expected range of rating curve uncertainty, which is generally considered multiplicative and expressed as a \% of the observed value.

As shown by the results, priors representing the addition of $10^1 \text{ to } 10^2$ pseudo-counts diminishes the performance of the gradient boosted decision tree model, regardless of the dictionary size, or the number of possible values provided by the quantization.  Heavily penalizing unpredicted states does not have as great an impact as anticipated, perhaps as a result of the corresponding $p_i$ values also being small.



How do the prior and the birate affect the distribution of "actual" $D_{KL}$?.

## Citations

```{bibliography}
:filter: docname in docnames
```