# Predict Kullback-Leibler (KL) Divergence

## Introduction

Streamflow prediction in ungauged basins has classically focused on minimizing one or more loss functions, typically square error, NSE, and KGE, between observed (daily streamflow) values and predictions generated by some type of model, whether it be an empirical regionalization, statistical machine learning, or process-based rainfall runoff model. 

Regionalization and machine learning models for PUB rely on existing streamflow monitoring network, and the performance of these models is tied to how well the network represents the ungauged space.  This connection leads to the question of how the arrangement of streamflow monitoring stations within the network impacts the overall performance of PUB models, particularly in terms of expected prediction error across all ungauged locations. Furthermore, are there environmental signals, orthogonal to streamflow, that contain enough information to differentiate between network arrangements such that the prediction error over ungauged areas can be minimized?  

A simple interpretation of the loss functions commonly used in the PUB literature might be "how close are mean daily streamflow predictions to observed values?"  A much simpler question to ask of observational data is: "will a given model outperform random guessing in the long run?".  This binary question represents a starting point to approach the optimal streamflow monitoring network problem.  The justification for asking such a basic question is that an expectation of the uncertainty reduction over the unmonitored space provided by a given monitoring arrangement supports a discriminant function to compare unique arrangements.  A simple question can be formulated to test on real data, in this case an ungauged space of over 1 million ungauged catchments and a set of over 1600 monitored catchments with which to train a model.

The binary prediction problem is followed by a regresson problem where the goal is to minimize the expectation of prediction error based on the **Kullback-Leibler divergence** $D_{KL}$, (a surrogate loss function from the class of *information discriminant* measures which is consistent with the exponential loss {cite}`nguyen2009surrogate`).


## Problem Formulation

### Binary Prediction Problem 

The first question "is a given model better than random guessing in the long run" is formulated into a binary classification problem as follows: 

1) The streamflow prediction model assumes that discharge at one location is equal to that a different lcoation on a unit area basis, commonly referred to as an equal unit area runoff (UAR) model.
2) Given the equal UAR model, an observed (proxy) catchment is a potential model for predicting the distribution of UAR for any unobserved (target) location,
3) A proxy is "informative" to a target if the proxy UAR is closer to the posterior target UAR than the maximum uncertainty (uniform distribution) prior.
4) A proxy is "disinformative" to a target if the maximum uncertainty (uniform distribution) prior is closer to the posterior target UAR than the proxy UAR.
5) The "closeness" of distributions is characterized by three measures from the general class of f-divergences, namely the total variation distance (TVD), the Kullback-Leibler divergence ($D_{KL}$), and the earth mover's distance (EMD), also known as the Wasserstein distance.    

   For the Kullback-Leibler divergence $D_{KL}$:        
7) The (posterior) target UAR distribution is denoted by $P$, and the proxy UAR distribution (model) is denoted by $Q$.    
8) A proxy model is informative for some target location if $D_{KL}(P||\mathbb{U}) - D_{KL}(P||Q) > 0$    


   The binary problem formulation is then:    
10) The discriminant function maps the difference in the two divergences to a binary outcome corresponding to the sign of the resulting quantity $Y = +1 \text{ if } D_{KL}(P||\mathbb{U}) - D_{KL}(P||Q) > 0 \text{ or } -1 \text{ otherwise}$
11) The goal is to miminize the probability of incorrect predictions, defined by the (Bayes) error $R_{\textbf{Bayes}}(\gamma, C) := \mathbb{P} \left(Y \neq \text{sign}(D_{KL}(\mathbf{P}||\mathbb{U}) - D_{KL}(\mathbf{P}||\mathbf{Q}) \right)$


### Regression Problem

The same problem setup applies for the regression prediction problem which is to optimize the discriminant function and the input signal quantization simultaneously to minimize the error in predicting the KL divergence from catchment attributes.  

Instead of predicting a scalar measure which is a feature of a single location, the key difference in this step is the target variable describes a measure of the difference in runoff between **pairs of locations**.  This approach asks whether the **Kullback-Leibler Divergence** (KLD) of the distribution of unit area runoff between two locations can be predicted from the attributes of both catchments (and their differences) using the gradient boosted decision tree method, which is also capable of predicting continuous variables, in this case $D_{KL}$.

...

## Data Import and Model Setup

In [1]:
import os
import pandas as pd
import numpy as np
from time import time

from bokeh.plotting import figure, show
from bokeh.layouts import gridplot, row, column
from bokeh.models import ColumnDataSource
from bokeh.io import output_notebook
from bokeh.palettes import Sunset10, Vibrant7, Category20, Bokeh6, Bokeh7, Bokeh8

import xgboost as xgb
xgb.config_context(verbosity=2)

from sklearn.metrics import (
    root_mean_squared_error,
    mean_absolute_error,
    roc_auc_score,
    accuracy_score,
    confusion_matrix,
)

import data_processing_functions as dpf

from scipy.stats import linregress
from sklearn.model_selection import StratifiedKFold
output_notebook()

BASE_DIR = os.getcwd()

In [2]:
# load the catchment characteristics
fname = 'BCUB_watershed_attributes_updated.csv'
attr_df = pd.read_csv(os.path.join('data', fname))
attr_df.columns = [c.lower() for c in attr_df.columns]
station_ids = attr_df['official_id'].values
print(f'There are {len(station_ids)} monitored basins in the attribute set.')

There are 1325 monitored basins in the attribute set.


### Load pairwise attribute comparisons

Load a few rows from one of the pairwise data files.  These contain attributes about divergence measures that are computed on concurrent and non-concurrent time series at two monitored locations.

In [3]:
# open an example pairwise results file
input_folder = os.path.join(
    BASE_DIR, "data", "processed_divergence_inputs",
)
pairs_files = os.listdir(input_folder)
print(pairs_files[:4])
test_df = pd.read_csv(os.path.join(input_folder, pairs_files[0]), nrows=1000)

['KL_results_4bits_20240812.csv', 'BCUB_watershed_attributes_updated.csv', 'KL_results_6bits_20240812.csv', 'KL_results_4bits_20240812_partial_counts.csv']


In [4]:
kld_columns = [c for c in test_df.columns if 'dkl' in c]
kld_columns

['dkl_concurrent_uniform',
 'dkl_concurrent_post_-5R',
 'dkl_concurrent_post_-4R',
 'dkl_concurrent_post_-3R',
 'dkl_concurrent_post_-2R',
 'dkl_concurrent_post_-1R',
 'dkl_concurrent_post_0R',
 'dkl_concurrent_post_1R',
 'dkl_concurrent_post_2R',
 'dkl_concurrent_post_3R',
 'dkl_concurrent_post_4R',
 'dkl_concurrent_post_5R',
 'dkl_concurrent_post_6R',
 'dkl_concurrent_post_7R',
 'dkl_concurrent_post_8R',
 'dkl_concurrent_post_9R',
 'dkl_concurrent_post_10R',
 'dkl_nonconcurrent_uniform',
 'dkl_nonconcurrent_post_-5R',
 'dkl_nonconcurrent_post_-4R',
 'dkl_nonconcurrent_post_-3R',
 'dkl_nonconcurrent_post_-2R',
 'dkl_nonconcurrent_post_-1R',
 'dkl_nonconcurrent_post_0R',
 'dkl_nonconcurrent_post_1R',
 'dkl_nonconcurrent_post_2R',
 'dkl_nonconcurrent_post_3R',
 'dkl_nonconcurrent_post_4R',
 'dkl_nonconcurrent_post_5R',
 'dkl_nonconcurrent_post_6R',
 'dkl_nonconcurrent_post_7R',
 'dkl_nonconcurrent_post_8R',
 'dkl_nonconcurrent_post_9R',
 'dkl_nonconcurrent_post_10R']

In [6]:
results_folder = os.path.join(BASE_DIR, 'data', 'kld_prediction_results')
if not os.path.exists(results_folder):
    os.makedirs(binary_results_folder)

binary_results_folder = os.path.join(BASE_DIR, 'data', 'kld_prediction_results_binary')
if not os.path.exists(binary_results_folder):
    os.makedirs(binary_results_folder)

### Define attribute groupings

In [7]:
terrain = ['drainage_area_km2', 'elevation_m', 'slope_deg', 'aspect_deg'] #'gravelius', 'perimeter',
land_cover = [
    'land_use_forest_frac_2010', 'land_use_grass_frac_2010', 'land_use_wetland_frac_2010', 'land_use_water_frac_2010', 
    'land_use_urban_frac_2010', 'land_use_shrubs_frac_2010', 'land_use_crops_frac_2010', 'land_use_snow_ice_frac_2010']
soil = ['logk_ice_x100', 'porosity_x100']
climate = ['prcp', 'srad', 'swe', 'tmax', 'tmin', 'vp', 'high_prcp_freq', 'high_prcp_duration', 'low_prcp_freq', 'low_prcp_duration']
all_attributes = terrain + land_cover + soil + climate
len(all_attributes)

24

### Set trial parameters

In [10]:
# define the amount of data to set aside for final testing
nfolds = 5
n_boost_rounds = 100
n_optimization_rounds = 10
priors_to_test = [-2, -1, 0, 1, 2]
random_state = 42
loss = 'reg:absoluteerror'

#define if testing concurrent or nonconcurrent data
concurrent = 'concurrent'

# partial counts refer to the test where observations were assigned
# a uniform distribution to approximate error and allow fractional 
# observations in state space
partial_counts = False

# the input data file has an associated revision date
revision_date = '20240812'

all_test_results = {}
attribute_set_names = ['climate', '+terrain', '+land_cover', '+soil']

### Check the binary classification balance

The figure below shows how the binary classification balance changes as a function of both the dictionary size and the prior.  Coloured lines in the plot represent different dictionary sizes, and the log x-axis scale reflects the range of priors tested in the form of pseudo counts applied to $Q$, the simulated UAR distribution.  The y-axis represents the fraction of True Y values (the target variable), described above as proxy-target (observed-simulated) pairs where the simulated distribution is "closer to" the observed (posterior) compared to the uniform distribution.  

The balance of the target variable is sensitive to both the choice of prior and the dictionary size for the range of values tested.  Changing the the dictionary size between $2^4$ and $2^{12}$ bits changes the proportion of models that are "better than random guessing" by about 3-8% given any prior, while changing the prior from $10^{-5}$ to $10^5$  changes the same by between 10 and 25% for a given dictionary size.  Smaller priors correspond to larger penalties for underspecified models, so the prior controls the "selectivity" of the model with respect to the discriminant function, controlling the number of "informative models".  The smallest prior ($10^{-5}$ pseudo-counts) at 12 bits quantization yields the most selective model where 75% of the pairings are rejected, those whose KL divergence $D_{KL}(P||Q)$ is less than $D_{KL}(P||\mathbb{U})$.

The point is to create contrast within the sample set such that the "usefulness" of models and the predictors thereof can be most effectively distinguished by the gradient boosting procedure.  "Real" in this context is intended to mean "representative of information about processes governing long-term runoff" contained in the signal, as opposed to vestigial effects of data pre-processing, measurement uncertainty, or the model itself (epistemic & aleatoric uncertainty). 

A  discriminant function has the effect of attenuating the decision space of the sensor network arrangement problem, because fewer True values means fewer viable monitoring locations.

In [11]:
result_dict = {}
nrows = None

for bitrate in range(4,13):
    if bitrate in [5, 7]:
        continue
    print(f'bitrate = {bitrate}')
    fname = f"KL_results_{bitrate}bits_{revision_date}.csv"
    # if partial_counts is true, we load a separate result file
    # where observation counts incorporated a 10% uniform uncertainty
    whole_counts_fname = f"KL_results_{bitrate}bits_{revision_date}.csv"
    partial_counts_fname = f"KL_results_{bitrate}bits_{revision_date}_partial_counts.csv"

    wc_input_data_fpath = os.path.join(input_folder, whole_counts_fname)
    pc_input_data_fpath = os.path.join(input_folder, partial_counts_fname)
    
    df_partial = pd.read_csv(pc_input_data_fpath, nrows=nrows, low_memory=False)
    df_whole = pd.read_csv(wc_input_data_fpath, nrows=nrows, low_memory=False)
    
    result_dict[bitrate] = {'partial': df_partial, 'whole': df_whole}

bitrate = 4
bitrate = 6
bitrate = 8
bitrate = 9
bitrate = 10
bitrate = 11
bitrate = 12


In [12]:
posterior_columns = [c for c in df_partial.columns if c.startswith(f'dkl_{concurrent}_post')]

#### Compute the target column binary class balances by bitrate

In [13]:
def compute_target_balance(df, concurrent, posterior_columns):
    balances = []
    for target_col in posterior_columns:
        posterior = target_col.split('_')[-1].split('R')[0]
        # if DKL(P||Q) < DKL(P||U), then the model is a "better compressor"
        # of the target signal than a uniform distribution
        df['binary_target'] = df[target_col] < df[f'dkl_{concurrent}_uniform']
        ut, ct = np.unique(df['binary_target'].values, return_counts=True)
        pct_false = ct[0] / len(df)
        balances.append(1 - pct_false)
    return balances

In [14]:
class_balance_fpath = os.path.join(binary_results_folder, f'dkl_binary_balance_result.csv')

if os.path.exists(class_balance_fpath):
    class_balance_df = pd.read_csv(class_balance_fpath, index_col='prior')
else:
    class_balance_df = pd.DataFrame()
    for bitrate in range(4,13):
        if bitrate in [5, 7]:
            continue
            
        partial_count_df = result_dict[bitrate]['partial']
        whole_count_df = result_dict[bitrate]['whole']
        
        class_balance_df[f'{bitrate}_partial_counts'] = compute_target_balance(partial_count_df, concurrent, posterior_columns)
        class_balance_df[f'{bitrate}_whole_counts'] = compute_target_balance(whole_count_df, concurrent, posterior_columns)
        
    class_balance_df.index = [10**(int(c.split('_')[-1].split('R')[0])) for c in posterior_columns]
    class_balance_df.index.name = 'prior'
    class_balance_df.to_csv(class_balance_fpath, index=True)

In [15]:
class_balance_df.columns

Index(['4_partial_counts', '4_whole_counts', '6_partial_counts',
       '6_whole_counts', '8_partial_counts', '8_whole_counts',
       '9_partial_counts', '9_whole_counts', '10_partial_counts',
       '10_whole_counts', '11_partial_counts', '11_whole_counts',
       '12_partial_counts', '12_whole_counts'],
      dtype='object')

In [16]:
bal_figs = []
for count_type in ['whole', 'partial']:
    bal_fig = figure(width=550, height=400, x_axis_type='log', title=f'{count_type} counts',
                    y_range=(0.25, 0.55))
    class_cols = [f'{b}_{count_type}_counts' for b in [4, 6, 8, 9, 10, 11, 12]]
    n = 0
    for c in class_cols:
        bal_fig.line(class_balance_df.index, class_balance_df[c], color=Category20[8][n], 
                     line_width=2, legend_label=f'{c.split("_")[0]} bits')
        n += 1
    bal_fig.xaxis.axis_label = r'$$\text{Prior Q (Pseudo-counts)} [ 10^\alpha ]$$'
    bal_fig.yaxis.axis_label = r'$$\text{P(True) } [ \% ]$$'
    bal_fig.legend.location = 'bottom_right'
    bal_figs.append(bal_fig)


In [17]:
bal_fig_layout = row(bal_figs)
show(bal_fig_layout)

From the figure above at left, assuming the observations are "exact" yields a larger proportion of empty bins as the dictionary size increases.  This approach yields higher penalties for misaligned distributions by assuming the observation precision increases with the dictionary size.  The figure at right approximates measurement uncertainty by distributing observations between bins in proportion to the assumed measurement error, in this case +/- 10% uniformly distributed.  This approach reduces the influence of the prior since fewer bins are empty and the prior only impacts empty bins in Q.  

Decreasing the prior results in larger penalties for **underspecified models**, that is where some $q_i = 0$. The shift in the classification balance, that is the pairs that change from True to False as the prior decreases with respect to $D_{KL}(P||Q) < D_{KL}(P||\mathbb{U})$, includes catchment pairs where the proxy $Q$ underspecifies the state frequencies of $P$, and the tolerance of underspecification is proportional to the observed (posterior) state frequency.  Increasing the dictionary size preserves more of the original signal precision, and it also gives greater discriminatory power to the prior, as shown by the greater change in y in the plot above.  

Next we'll invert the plot to show how the binary label balance changes as a function of dictionary size (bitrate) for different priors.

In [18]:
cb_df = pd.DataFrame()
brates = []
for posterior in posterior_columns:
    balances_whole, balances_partial = [], []
    for bitrate in range(4, 13):
        if bitrate in [5, 7]:
            continue
        
        model_tests_whole = result_dict[bitrate]['whole'][posterior] < result_dict[bitrate]['whole'][f'dkl_{concurrent}_uniform']
        model_tests_partial = result_dict[bitrate]['partial'][posterior] < result_dict[bitrate]['partial'][f'dkl_{concurrent}_uniform']
        ut_whole, ct_whole = np.unique(model_tests_whole, return_counts=True)
        ut_partial, ct_partial = np.unique(model_tests_partial, return_counts=True)
        balances_whole.append(1-ct_whole[0] / len(model_tests_whole))
        balances_partial.append(1-ct_partial[0] / len(model_tests_partial))
    cb_df[f'{posterior}_partial'] = balances_partial
    cb_df[f'{posterior}_whole'] = balances_whole
cb_df.index = [2**e for e in [4, 6, 8, 9, 10, 11, 12]]
# class_balance_df.index = [10**(int(c.split('_')[-1].split('R')[0])) for c in posterior_columns]

In [19]:
bal_figs2 = []
for cb in ['whole', 'partial']:
    bal_fig2 = figure(width=600, height=450, x_axis_type='log', title=f'{cb} counts',
                     y_range=(0.25, 0.55))
    n = 0
    ccols = [c for c in cb_df.columns if cb in c]
    
    for c in ccols:
        a = c.split('_')[-2].split('R')[0]
        bal_fig2.line(cb_df.index, cb_df[c], color=Category20[20][n], 
                     line_width=2, legend_label=f'10^{a}')
        n += 1
    bal_fig2.xaxis.axis_label = r'$$\text{Dictionary Size}$$'
    bal_fig2.yaxis.axis_label = r'$$\text{P(True) } [ \% ]$$'
    bal_fig2.add_layout(bal_fig2.legend[0], 'right')
    bal_fig2.legend.location = 'bottom_right'
    bal_figs2.append(bal_fig2)


In [20]:
bal2_layout = row(bal_figs2)
show(bal2_layout)

Above, the balance of binary classification is affected least with increasing prior assumed on Q regardless of how the observation counts are tallied.  Assuming a uniform distribution on each observation reduces the effect of the dictionary size on the binary classification balance, with a slightly increasing proportion of True (informative) models.    

### Train-Test Split

The input dataset is pairwise comparisons of just over 1600 (streamflow) monitored catchments, their attributes, and the attribute differences.  After filtering for data concurrency (minimum 1 year, 90% complete) and maximum distance between basin centroids (1000 km) we are left with roughly 600K pairs.  The pairwise setup means that station data appears in more than one row.  As a result, the attributes of stations can end up in both training and test sets if we simply split by randomly assigning rows to training or test sets.  To address this issue we split the set of catchment IDs and generate pairs separately such that no ID appears in both training and test set.  One problem remains, and that is to generate training and test sets with some assurance that the target variable distributions match to some degree.  

Next we create training/test splits and visualize how the target variable distributions compare.

In [252]:
def add_attributes(attr_df, df_relations, attribute_cols):
    """
    Adds attributes from the df_attributes to the df_relations based on the 'proxy' and 'target' columns
    using map for efficient lookups.

    Parameters:
    df_attributes (pd.DataFrame): DataFrame with 'id' and attribute columns.
    df_relations (pd.DataFrame): DataFrame with 'proxy' and 'target' columns.
    attribute_cols (list of str): List of attribute columns to add to df_relations.

    Returns:
    pd.DataFrame: Updated df_relations with added attribute columns.
    """
    # Create dictionaries for each attribute for quick lookup
    attr_dicts = {col: attr_df.set_index('official_id')[col].to_dict() for col in attribute_cols}

    # Add target attributes
    for col in attribute_cols:
        df_relations[f'target_{col}'] = df_relations['target'].map(attr_dicts[col])

    # Add proxy attributes
    for col in attribute_cols:
        df_relations[f'proxy_{col}'] = df_relations['proxy'].map(attr_dicts[col])

    return df_relations

In [354]:
import networkx as nx
from networkx.algorithms.community import asyn_fluidc

def graph_based_kfold(df, n_folds, proxy_col='proxy', target_col='target', random_seed=42):
    """
    Use graph-based partitioning to create k-folds from a dataset representing pairs of locations.

    Parameters:
    -----------
    df : pd.DataFrame
        The input dataframe containing proxy and target columns representing pairs of locations.
    proxy_col : str
        The name of the column representing proxy IDs (locations).
    target_col : str
        The name of the column representing target IDs (locations).
    n_folds : int
        The number of folds to create.

    Returns:
    --------
    dict
        A dictionary where the keys are fold numbers and the values are the indices of rows in that fold.
    """
    # Step 1: Create a graph from the proxy and target pairs
    G = nx.Graph()
    G.add_edges_from(zip(df[proxy_col], df[target_col]))

    # Step 2: Use a graph partitioning algorithm to split the graph into n_folds
    
    partitions = list(asyn_fluidc(G, n_folds, seed=random_seed))  # Partition the graph into n_folds using fluid communities
    
    # Step 3: Assign rows in the dataframe to the fold based on the partition of the graph
    fold_assignments = {}
    for fold_num, partition in enumerate(partitions):
        # Assign rows to the fold where both proxy and target IDs are in the same partition
        fold_mask = df[proxy_col].isin(partition) & df[target_col].isin(partition)
        fold_assignments[fold_num] = df[fold_mask].index.tolist()

    return fold_assignments

In [355]:
def check_fold_sizes_and_distributions(df, fold_assignments, target_col, random_seed=42):
    n_tot = 0
    fold_dict = {}
    n = 0
    for fold in list(fold_assignments.keys()):
        idxs = fold_assignments[fold]
        fold_df = df.loc[fold_assignments[fold], :].copy()
        fold_vals = fold_df[target_col].values
        n_tot += len(fold_df)        
        fold_mean, fold_std = np.mean(fold_vals), np.std(fold_vals)
        fold_dict[n] = {'size': len(fold_df), 'mean': fold_mean, 'std': fold_std}
        n += 1
        print(f'      fold {n} has {len(fold_df)} elements, mean={fold_mean:.2f}, std={fold_std:.2f}')
    assert n_tot < len(df) # there should be fewer samples because we have to cut the graph
    return fold_dict

In [356]:
def graph_fold_separation_trial(df, target_col, n_folds, seed):
    fold_assignments = graph_based_kfold(df, n_folds, random_seed=seed)
    fold_dict = check_fold_sizes_and_distributions(df, fold_assignments, target_col)
    # Compute a score for this partitioning (based on variance in fold sizes and statistics)
    sizes = np.array([fold_dict[n]['size'] for n in list(fold_dict.keys())])
    means = np.array([fold_dict[n]['mean'] for n in list(fold_dict.keys())])
    stds = np.array([fold_dict[n]['std'] for n in list(fold_dict.keys())])
    
    # Score based on fold size balance and mean/std similarity
    mean_size, min_size, max_size = np.mean(sizes), np.min(sizes), np.max(sizes)
    mean_var, min_mean, max_mean = np.mean(means), np.min(means), np.max(means)
    mean_std, min_std, max_std = np.mean(stds), np.min(stds), np.max(stds)

    # Compute a weighted score
    weight_size, weight_mean, weight_std = (1, 1, 1)
    size_var_norm = weight_size * (mean_size - min_size) / (max_size - min_size) 
    mean_var_norm = weight_mean * (mean_var - min_mean) / (max_mean - min_mean) 
    std_var_norm = weight_std * (mean_std - min_std) / (max_std - min_std)
    score = size_var_norm + mean_var_norm + std_var_norm
    print(f"     Score for this trial: {score:.2f} (size_var={size_var_norm:.2f}, mean_var={mean_var_norm:.2f}, std_var={std_var_norm:.2f})")
    return fold_dict, score

In [357]:
def optimize_folds(df, nfolds, target_col, n_trials=20):
    best_fold_dict = None
    best_score = np.inf
    best_seed = None
    for trial in range(n_trials):        
        seed = np.random.randint(low=1, high=1e6)
        print(f"Trial {trial + 1}/{n_trials} with seed {seed}")        
        fold_assignments, score = graph_fold_separation_trial(df, target_col, nfolds, seed)
        # Check if this is the best partitioning so far
        if score < best_score:
            print(f'New best score = {score:.2f} (random seed={seed})')
            best_score = score
            best_fold_dict = fold_assignments
            best_seed = seed
    return best_fold_dict

In [360]:
def predict_KLD_from_attributes(attr_df, target_col_base, stations, nfolds, results_folder, 
                                loss_function=None, partial_counts=False, binary_test=False, random_seed=42,
                               optimize_cv_folds=True, n_trials=20, cv_fold_seed=42):

    counts_key = 'partial'
    if partial_counts == "False":
        counts_key = 'whole'
    
    all_test_results = {}
    for bitrate in [4, 6, 8, 10, 12]:
        t0 = time()
        print(f'bitrate = {bitrate}')
        input_data_fpath = os.path.join(input_folder, fname)
        nrows = None
        # df = pd.read_csv(input_data_fpath, nrows=nrows, low_memory=False)
        df = result_dict[bitrate][counts_key]
        df.dropna(subset=[target_col_base], inplace=True)
        # reduce the maximum distance separating pairs such that the 
        # graph can be more evenly separated
        max_distance = 500 # km
        df = df[df['centroid_distance'] < max_distance]
        print(f'  {len(df)} pairs remaining after filtering by max distance of {max_distance} km')
        
        # add the attributes into the input dataset
        df = add_attributes(attr_df, df.copy(), all_attributes)

        if binary_test == True:
            # training_stn_cv_sets, test_stn_sets = dpf.train_test_split_by_official_id(holdout_pct, stations, nfolds)
            # if DKL(P||Q) < DKL(P||U), then the model is a better compressor 
            # of the target signal than a uniform distribution
            df['binary_target'] = df[target_col_base] < df[f'dkl_{concurrent}_uniform']
            ut, ct = np.unique(df['binary_target'].values, return_counts=True)
            pct_false = ct[0] / len(df)
            # change target_col to the new binary target
            target_col = 'binary_target'
            print(f'The binary target variable balance is {100*pct_false:.0f}% False and {100*(1-pct_false):.0f}% True')
        else:
            target_col = target_col_base

        if optimize_cv_folds:
            best_fold_dict = optimize_folds(df, nfolds, target_col, n_trials)
        else:
            fold_assignments = graph_based_kfold(df, nfolds, random_seed=cv_fold_seed)
            fold_dict = check_fold_sizes_and_distributions(df, fold_assignments, target_col)

                   

        print(asdf)
        # return fold_assignment
        
    #     print(fold_assignment)
    #     print(asdf)
        
    #     # add attribute groups successively
    #     for attribute_set, set_name in zip([land_cover, terrain, soil, climate], attribute_set_names):
    #         print(f'  Processing {set_name} attribute set: {target_col}')
    #         input_attributes += attribute_set 
                        
    #         features = dpf.format_features(input_attributes)

    #         if binary_test == True:
    #             trial_df, test_df = dpf.run_binary_xgb_trials_custom_CV(
    #                 bitrate, set_name, features, target_col, df, n_optimization_rounds, 
    #                 nfolds, n_boost_rounds, results_folder, loss=loss_function, eval_metric='error'
    #             )
                
    #             obs, pred = test_df['actual'].values, test_df['predicted'].values
    #             tn, fp, fn, tp = confusion_matrix(obs, pred).ravel()
    #             test_accuracy = (tp + tn) / (tp + fp + fn + tn) 
                
    #             print(f'   held-out test accuracy: {test_accuracy:.2f}')
    #             # store the test set predictions and actuals
    #             all_test_results[bitrate][set_name] = {
    #                 'trials': trial_df, 'test_df': test_df,
    #                 'test_accuracy': test_accuracy} 
    #         else:
    #             trial_df, test_df = dpf.run_xgb_trials_custom_CV(
    #                 bitrate, set_name, features, target_col, df, n_optimization_rounds, 
    #                 nfolds, n_boost_rounds, results_folder, loss=loss_function
    #             )
    #             test_rmse = root_mean_squared_error(test_df['actual'], test_df['predicted'])
    #             test_mae = mean_absolute_error(test_df['actual'], test_df['predicted'])

    #             print(f'   Held-out test rmse: {test_rmse:.2f}, mae: {test_mae:.2f}')
    #             print('')
    #             # store the test set predictions and actuals
    #             all_test_results[bitrate][set_name] = {
    #                 'trials': trial_df, 'test_df': test_df,
    #                 'test_mae': test_mae, 'test_rmse': test_rmse} 
    # return all_test_results

In [361]:
loss_function = 'reg:absoluteerror'
optimize_cv_folds = False
cv_fold_seed = 83561
n_trials = 25
for prior in priors_to_test:
    target_col = f'dkl_{concurrent}_post_{prior}R'
    test_results_fname = f'{target_col}_{prior}_prior_results.npy'
    if partial_counts is False:
        test_results_fname = f'{target_col}_{prior}_prior_results.npy'
    else:
        test_results_fname = f'{target_col}_{prior}_prior_results_partial_counts.npy'
    test_results_fpath = os.path.join(results_folder, test_results_fname)
    if os.path.exists(test_results_fpath):
        print('processed and loading: ', test_results_fname)
        all_test_results = np.load(test_results_fpath, allow_pickle=True).item()
    else:
        all_test_results = predict_KLD_from_attributes(
            attr_df, target_col, station_ids, nfolds, results_folder, 
            loss_function=loss_function, partial_counts=partial_counts,
            optimize_cv_folds=optimize_cv_folds, n_trials=n_trials, cv_fold_seed=cv_fold_seed,
        )
        np.save(test_results_fpath, all_test_results)

bitrate = 4
  228413 pairs remaining after filtering by max distance of 500 km
      fold 1 has 32044 elements, mean=1.89, std=3.06
      fold 2 has 40189 elements, mean=2.06, std=3.06
      fold 3 has 18675 elements, mean=1.58, std=2.49
      fold 4 has 11968 elements, mean=1.24, std=2.13
      fold 5 has 12742 elements, mean=1.20, std=1.90
{0: {'size': 32044, 'mean': 1.8853656227476505, 'std': 3.05667743433809}, 1: {'size': 40189, 'mean': 2.0621906356413513, 'std': 3.057044591993696}, 2: {'size': 18675, 'mean': 1.578743584577704, 'std': 2.485098701937047}, 3: {'size': 11968, 'mean': 1.238713749817703, 'std': 2.134464416443258}, 4: {'size': 12742, 'mean': 1.1978954400570754, 'std': 1.895729332380075}}


NameError: name 'asdf' is not defined

## Run Binary Model

In [None]:
loss_function = 'binary:hinge'

for prior in priors_to_test:
    print(f'Starting tests based on 10^{prior} pseudo-count prior.')
    target_col = f'dkl_{concurrent}_post_{prior}R'
    if partial_counts is False:
        test_results_fname = f'{target_col}_{prior}_prior_results_binary.npy'
    else:
        test_results_fname = f'{target_col}_{prior}_prior_results_binary_partial_counts.npy'
    test_results_fpath = os.path.join(binary_results_folder, test_results_fname)
    if os.path.exists(test_results_fpath):
        all_results = np.load(test_results_fpath, allow_pickle=True).item()
    else:
        all_results = predict_KLD_from_attributes(
            attr_df, target_col, station_ids, nfolds, binary_results_folder, 
            loss_function=loss_function, binary_test=True, partial_counts=partial_counts)
        np.save(test_results_fpath, all_results)

## Run Regression Models

In [None]:
def load_result_by_prior(prior, binary=False, partial_counts=False):
    fname = f'dkl_{concurrent}_post_{prior}R_{prior}_prior_results.npy'
    if binary == True:
        fname = fname.replace('.npy', '_binary.npy')
        rf = binary_results_folder
    else:
        rf = results_folder
    if partial_counts == True:
        fname = fname.replace('.npy', '_partial_counts.npy')
    fpath = os.path.join(rf, fname)
    return np.load(fpath, allow_pickle=True).item()

## Plot Results

### Plot Results of Binary Classification Test

In [None]:
layout_dict = {}
c = 0
bin_plots = []
for prior in priors_to_test:
    result = load_result_by_prior(prior, binary=True, partial_counts=partial_counts)
    title = f'10^{prior} Prior: Binary Test Results '
    fig = figure(title=title, x_range=attribute_set_names, toolbar_location='above')
    fig.yaxis.axis_label = 'Accuracy score (tp+tn)/(N)'
    fig.xaxis.axis_label = 'Attribute Group (additive)'
    for b, set_dict in result.items():        
        y = [set_dict[e]['test_accuracy'] for e in attribute_set_names]
        source = ColumnDataSource({'x': attribute_set_names, 'y': y})
            
        fig.line('x', 'y', legend_label=f'{b}bits', 
                 color=Category20[10][c], source=source, line_width=3)
        fig.legend.background_fill_alpha = 0.6
        
        result_df = pd.DataFrame({'set': attribute_set_names, 'accuracy': y})
        c += 1
    bin_plots.append(fig)
    c = 0

In [None]:
layout = gridplot(bin_plots, ncols=3, width=350, height=300)
show(layout)

### Binary Results Discussion

Without additional context, the above plots might lead us to believe that the prior has very little effect on the accuracy performance of the xgboost model but perhaps a higher prior leads to slightly better performance.  Given the additional context provided by the classification balance, we might identify that the 

The equal UAR model is based on mapping "observed" streamflow at one location to another.  The dictionary size and the prior are two key assumptions that affect both the discriminant and loss functions upon which decisions and actions are ultimately based.  The dictionary size controls the precision, or the confidence we have about the "true" value of observations.  A perfect model will not be affected by the dictionary size, since however narrow we make intervals defining states, the model frequencies will equal the posterior (observed frequencies).  By the same token, the choice of prior does not affect the discriminant function of a perfect model since it always predicts observed states.  Our goal is to discriminate between imperfect models, to tune the dials of prior and quantization to bring into focus the true distinction between imperfect models without excessively introducing artifacts arising from incomplete information.  Too small a dictionary filters out real information that the discriminant function would otherwise aim to exploit.  Increasing the dictionary size leads to oversampling and increased sensitivity to outliers.  The choice of prior can also be thought of as controlling contrast between samples in the discriminant function where predicted state frequencies are small compared to the assumed prior.    

A low prior increases the penalty for "misaligned" distributions, that is models that yield 0 probability for some observed state(s).  This characteristic is reflected in the class balance of Y as the prior is varied.  ncreasing the dictionary size has a magnifying effect on the loss function as the "alignment" of bins 

## Plot Results of $D_{KL}$ Regression Test

In [None]:
layout_dict = {}
reg_plots_dict = {}
res_r2_dict = {}
for prior in priors_to_test:
    plots = []
    result = load_result_by_prior(prior, binary=False)
    reg_plots_dict[prior] = {}
    res_r2_dict[prior] = {}
    for b, set_dict in result.items():
        test_rmse, test_mae = [], []
        attribute_sets = list(set_dict.keys())
    
        y1 = [set_dict[e]['test_rmse'] for e in attribute_sets]
        y2 = [set_dict[e]['test_mae'] for e in attribute_sets]
        
        source = ColumnDataSource({'x': attribute_sets, 'y1': y1, 'y2': y2})
        
        title = f'{b} bits (Q(θ|D)∼Dirichlet(α=10^{prior}))'
        if len(plots) == 0:
            fig = figure(title=title, x_range=attribute_sets, toolbar_location='above')
        else:
            fig = figure(title=title, x_range=attribute_sets, y_range=plots[0].y_range, toolbar_location='above')
        fig.line('x', 'y1', legend_label='rmse', color='green', source=source, line_width=3)
        fig.line('x', 'y2', legend_label='mae', color='dodgerblue', source=source, line_width=3)
        fig.legend.background_fill_alpha = 0.6
        fig.yaxis.axis_label = 'Error'
        fig.xaxis.axis_label = 'Attribute Group (additive)'
        
        result_df = pd.DataFrame({'set': attribute_sets, 'rmse': y1, 'mae': y2})
        best_rmse_idx = result_df['rmse'].idxmin()
        best_mae_idx = result_df['mae'].idxmin()
        best_rmse_set = result_df.loc[best_rmse_idx, 'set']
        best_mae_set = result_df.loc[best_mae_idx, 'set']
        best_result = set_dict[best_rmse_set]['test_df']
        
        xx, yy = best_result['actual'], best_result['predicted']
        slope, intercept, r, p, se = linregress(xx, yy)
        
        # sfig = figure(title=f'Test: {b} bits best model {best_rmse_set} (N={len(best_result)})', toolbar_location='above')
        sfig = figure(title=f'{b} bits', toolbar_location='above')
        sfig.scatter(xx, yy, size=1, alpha=0.6)
        xpred = np.linspace(min(xx), max(xx), 100)
        ybf = [slope * e + intercept for e in xpred]
        sfig.line(xpred, ybf, color='red', line_width=3, line_dash='dashed', legend_label=f'R²={r**2:.2f}')   
        # plot a 1:1 line
        sfig.line([min(yy), max(yy)], [min(yy), max(yy)], color='black', line_dash='dotted', 
                  line_width=2, legend_label='1:1')
        sfig.xaxis.axis_label = r'Actual $$D_{KL}$$ [bits/sample]'
        sfig.yaxis.axis_label = r'Predicted $$D_{KL}$$ [bits/sample]'
        sfig.legend.background_fill_alpha = 0.6
        sfig.legend.location = 'top_left'
        reg_plots_dict[prior][b] = sfig
        res_r2_dict[prior][b] = r**2
        plots.append(fig)
        plots.append(sfig)
    layout_dict[prior] = gridplot(plots, ncols=2, width=350, height=300)

In [None]:
show(layout_dict[-2])

In [None]:
# show(layout_dict[-1])

In [None]:
# show(layout_dict[0])

In [None]:
# show(layout_dict[1])

In [None]:
# show(layout_dict[2])

In [None]:
sample_plots = []
prior = -2
for b in [4, 6, 8, 10, 12]:
    plot = reg_plots_dict[prior][b]
    sample_plots.append(plot)

In [None]:
sample_layout = gridplot(sample_plots, ncols=5, width=250, height=250)
show(sample_layout)

In [None]:
from bokeh.transform import linear_cmap
from bokeh.models import ColorBar, ColumnDataSource
from bokeh.layouts import gridplot
from bokeh.palettes import Viridis256, gray, magma, Category20

# Convert the nested dict into a DataFrame
df = pd.DataFrame(res_r2_dict).T  # Transpose to get priors as columns
df.index.name = 'Prior'
df.columns.name = 'Bitrate'

In [None]:
# Melt the DataFrame to a long format
df_melted = df.reset_index().melt(id_vars='Prior', var_name='Bitrate', value_name='Value')
# Ensure the Bitrate values are ordered correctly (increasing order)
df_melted['Bitrate'] = pd.Categorical(df_melted['Bitrate'], categories=sorted(df_melted['Bitrate'].unique(), reverse=False), ordered=True)

# Create a Bokeh ColumnDataSource
source = ColumnDataSource(df_melted)

# Create a figure for the heatmap
p = figure(title="KL divergence from attributes: R² of test set by Prior and Bitrate",width=600, height=500,
           tools="hover", tooltips=[('Value', '@Value')], toolbar_location=None)

# Create a color mapper
mapper = linear_cmap(field_name='Value', palette=magma(256), low=df_melted.Value.min(), high=df_melted.Value.max())

# Add rectangles to the plot
p.rect(x="Prior", y="Bitrate", width=1, height=1, source=source,
       line_color=None, fill_color=mapper)

# Add color bar
color_bar = ColorBar(color_mapper=mapper['transform'], width=8, location=(0,0))
p.add_layout(color_bar, 'right')

# Format plot
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.xaxis.axis_label = r'$$Q(θ|D)∼\text{Dirichlet}(\alpha = 10^{a})$$'
p.yaxis.axis_label = r'$$\text{Quantization Bitrate (dictionary size)}$$'
p.axis.major_label_text_font_size = "10pt"
p.axis.major_label_standoff = 0
p.xaxis.major_label_orientation = 1.0

# Output the plot to an HTML file and display it
# output_file("heatmap.html")
show(p)

## Discussion

Since KL divergence $D_{KL}(P||Q) = \sum_{i=1}^{2^b} p_i\log(\frac{p_i}{q_i}) = +\infty \text{ when any } q_i \rightarrow 0$, the simulated $Q$ is treated as a posterior distribution by assuming a uniform (Dirichlet) prior $\alpha = [a_1, \dots, a_n]$. The prior $\alpha$ is an additive array of uniform pseudo-counts used to address the (commonly occurring) case where $q_i = 0$, that is the model does not predict an observed state $i$.  In this experiment we tested a wide range of priors on an exponential scale, $\alpha = 10^a, a \in [-2, -1, 0, 1, 2]$.  

The scale of the pseudo-count can be interpreted as a strength of belief in the model. Small $a$ represents strong belief that the model produces a distribution $Q$ that is representative of the "true" (observed posterior) distribution $P$, and for a fixed $p_i$ the effect of a decreasing $a$ on the discriminant function $D_{KL}$ yields a stronger penalty for a model that predicts an observed state with 0 probability.  Loss functions penalize overconfidence in incorrect predictions, and a prediction of 0 probability of a state which is actually observed should perhaps be thought of as confidence in an incorrect prediction and penalized as such.  A large $a$ represents weak belief that the model produces a distribution $Q$ that is representative of $P$, since $Q$ approaches the uniform distribution $\mathbb{U}$ as $a$ increases.  

Adding pseudo-counts has the effect of diluting the signal for the gradient boosting model to exploit in minimizing prediction error.  Analogously, varying the bitrate, or the size of the dictionary used to quantize continuous streamflow into discrete states, also adds quantization noise since the original streamflow signals are stored in three decimal precision and they are quantized into as few as 4 bits (16 symbol dictionary) and as many as 12 bits (4096 symbol dictionary).  The range of dictionary sizes is set to cover the expected range of rating curve uncertainty, which is generally considered multiplicative and expressed as a \% of the observed value.

As shown by the results, priors representing the addition of $10^1 \text{ to } 10^2$ pseudo-counts diminishes the performance of the gradient boosted decision tree model, regardless of the dictionary size, or the number of possible values provided by the quantization.  Heavily penalizing unpredicted states does not have as great an impact as anticipated, perhaps as a result of the corresponding $p_i$ values also being small.



How do the prior and the birate affect the distribution of "actual" $D_{KL}$?.

In [None]:
dfig = figure(title=r"$$D_{\text{KL}}(\text{bitrate})$$", width=600, height=450)
n = 0
order_dict = {}
for b, set_dict in all_test_results.items():
    test_rmse, test_mae = [], []
    attribute_sets = list(set_dict.keys())
    
    y1 = [set_dict[e]['test_rmse'] for e in attribute_sets]
    y2 = [set_dict[e]['test_mae'] for e in attribute_sets]

    result_df = pd.DataFrame({'set': attribute_sets, 'rmse': y1, 'mae': y2})
    best_rmse_idx = result_df['rmse'].idxmin()
    best_mae_idx = result_df['mae'].idxmin()
    best_rmse_set = result_df.loc[best_rmse_idx, 'set']
    best_mae_set = result_df.loc[best_mae_idx, 'set']
    best_result = set_dict[best_rmse_set]['test_df']
    
    best_result = set_dict[best_rmse_set]['test_df']    

    # print(b, best_result['actual'].min())
    # xx, yy = best_result['actual'], best_result['predicted']
    # xx = best_result['actual']
    # slope, intercept, r, p, se = linregress(xx, yy)

    # compute empirical cdf of "Actual" TVD
    sorted_data = np.sort(best_result['actual'])
    # Calculate the CDF values
    cdf_values = np.arange(1, len(sorted_data) + 1) / len(sorted_data)
    dfig.line(sorted_data, cdf_values, color=Vibrant7[n], 
              line_width=2.5, legend_label=f'{b}')
    n += 1
dfig.legend.location ='bottom_right'
dfig.xaxis.axis_label = r'$$D_{\text{KL}} [\text{bits}/\text{sample}]$$'
dfig.yaxis.axis_label = r'$$\text{Pr}(D_{\text{KL}})$$'

In [None]:
show(dfig)

## Citations

```{bibliography}
:filter: docname in docnames
```