# Predictability of (Shannon) Entropy

## Introduction

In the previous section, we showed that mean annual discharge could be predicted well from catchment attributes.  In this step we aim to predict a measure of the randomness of river systems from catchment attributes, also known as entropy (H).  Since the (Shannon) entropy of the distribution does not embody one specific process, it does not fit with conventional classifications of hydrological signatures.  Since the entropy measure encompasses the entire distribution, it can be interpreted as an aggregate representation of the complex interactions of the hydrologic cycle.

In the data preprocessing, we computed the entropy of the distribution of each individual streamflow time series in bits per sample.  We'll now use an ensemble decision tree method called XGBoost (eXtreme Gradient Boosted decision tree) {cite}`chen2016xgboost` to see if the entropy (or uncertainty) of a distribution can be predicted from catchment attributes.  The dictionary size (number of quantization levels) is varied to test if the additional information in the distribution can be exploited by the model.  The model input features are added in successive model tests to compare the contribution of catchment attribute groups related to climate, terrain, land cover, and soil.  

In [None]:
import os
import pandas as pd
import numpy as np

import xgboost as xgb
from sklearn.metrics import (
    root_mean_squared_error,
    mean_absolute_error,
    roc_auc_score,
    accuracy_score,
)

from scipy.stats import linregress

import data_processing_functions as dpf

from bokeh.plotting import figure, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource
from bokeh.io import output_notebook
from bokeh.palettes import Sunset10, Vibrant7


output_notebook()

BASE_DIR = os.getcwd()

## Load Input Data

In [None]:
# load the catchment characteristics
attributes_filename = 'BCUB_watershed_attributes_updated.csv'
df = pd.read_csv(os.path.join(BASE_DIR, 'data', 'processed_divergence_inputs', attributes_filename))
df.columns = [c.lower() for c in df.columns]

Subdivide the attributes into related classes: terrain, land cover, soil, climate.

In [None]:
print(df.columns.tolist())

## Define Attribute Groups

In [None]:
terrain = ['drainage_area_km2', 'elevation_m', 'slope_deg', 'aspect_deg']
land_cover = [
    'land_use_forest_frac_2010', 'land_use_grass_frac_2010', 'land_use_wetland_frac_2010', 'land_use_water_frac_2010', 
    'land_use_urban_frac_2010', 'land_use_shrubs_frac_2010', 'land_use_crops_frac_2010', 'land_use_snow_ice_frac_2010']
soil = ['logk_ice_x100', 'porosity_x100']
climate = ['prcp', 'srad', 'swe', 'tmax', 'tmin', 'vp', 'high_prcp_freq', 'high_prcp_duration', 'low_prcp_freq', 'low_prcp_duration']
all_attributes = terrain + land_cover + soil + climate
len(all_attributes)
assert len([c for c in all_attributes if c not in df.columns]) == 0

In [None]:
results_folder = os.path.join(BASE_DIR, 'data', 'entropy_prediction_results')
if not os.path.exists(results_folder):
    os.makedirs(results_folder)

In [None]:
def predict_entropy_from_attributes(df, holdout_pct, results_folder):
    df.reset_index(drop=True, inplace=True)    
    # randomly select holdout_pct of the stations to leave out for a hold-out test set
    # to ensure none of the data are seen in training
    train_indices, test_indices = dpf.train_test_split(df, holdout_pct)
    all_test_results = {}
    for bitrate in [4, 6, 8, 9, 10, 11, 12]:
        all_test_results[bitrate] = {}
        print(f'bitrate = {bitrate}')
        # set the target column
        target_column = f'h_{bitrate}_bits'
        input_attributes = []

        # add attribute groups successively
        for attribute_set, set_name in zip([climate, land_cover, terrain, soil], attribute_set_names):
            print(f'  Processing {set_name} attribute set.')
            input_attributes += attribute_set
            input_data = df[input_attributes + [target_column]].copy()

            trial_df, test_df = dpf.run_xgb_CV_trials(
                set_name, input_attributes, target_column, 
                input_data, train_indices, test_indices, n_optimization_rounds, 
                nfolds, n_boost_rounds, results_folder
            )
            
            test_rmse = root_mean_squared_error(test_df['actual'], test_df['predicted'])
            test_mae = mean_absolute_error(test_df['actual'], test_df['predicted'])

            print(f'    Held-out test rmse: {test_rmse:.2f}, mae: {test_mae:.2f}')
            print('')
            # store the test set predictions and actuals
            all_test_results[bitrate][set_name] = {
                'trials': trial_df, 'test_df': test_df,
                'test_mae': test_mae, 'test_rmse': test_rmse} 
    return all_test_results

## Set Trial Parameters

In [None]:
# define the amount of data to set aside for final testing
holdout_pct = 0.10
nfolds = 5
n_boost_rounds = 2000
n_optimization_rounds = 20

all_test_results = {}
attribute_set_names = ['climate', '+land_cover', '+terrain', '+soil']


## Run XGBoost Models

In [None]:
test_results_fname = f'Entropy_prediction_results.npy'
test_results_fpath = os.path.join(results_folder, test_results_fname)
if os.path.exists(test_results_fpath):
    all_test_results = np.load(test_results_fpath, allow_pickle=True).item()
else:
    all_test_results = predict_entropy_from_attributes(df, holdout_pct, results_folder)
    np.save(test_results_fpath, all_test_results)

## View Results

In [None]:
plots = []
for b, set_dict in all_test_results.items():
    test_rmse, test_mae = [], []
    attribute_sets = list(set_dict.keys())

    y1 = [set_dict[e]['test_rmse'] for e in attribute_sets]
    y2 = [set_dict[e]['test_mae'] for e in attribute_sets]
    
    source = ColumnDataSource({'x': attribute_sets, 'y1': y1, 'y2': y2})
    
    title = f'{b} bits'
    if len(plots) == 0:
        fig = figure(title=title, x_range=attribute_sets)
    else:
        fig = figure(title=title, x_range=attribute_sets, y_range=plots[0].y_range)
    fig.line('x', 'y1', legend_label='rmse', color='green', source=source, line_width=3)
    fig.line('x', 'y2', legend_label='mae', color='dodgerblue', source=source, line_width=3)
    fig.legend.background_fill_alpha = 0.6
    fig.yaxis.axis_label = 'RMSE'
    
    result_df = pd.DataFrame({'set': attribute_sets, 'rmse': y1, 'mae': y2})
    best_rmse_idx = result_df['rmse'].idxmin()
    best_mae_idx = result_df['mae'].idxmin()
    best_rmse_set = result_df.loc[best_rmse_idx, 'set']
    best_mae_set = result_df.loc[best_mae_idx, 'set']
    best_result = set_dict[best_rmse_set]['test_df']
    
    xx, yy = best_result['actual'], best_result['predicted']
    slope, intercept, r, p, se = linregress(xx, yy)
    
    sfig = figure(title=f'Test: {b} bits best model {best_rmse_set} (N={len(best_result)})')
    sfig.scatter(xx, yy, size=3, alpha=0.8)
    xpred = np.linspace(min(xx), max(xx), 100)
    ybf = [slope * e + intercept for e in xpred]
    sfig.line(xpred, ybf, color='red', line_width=3, line_dash='dashed', legend_label=f'R²={r**2:.2f}') 
    # plot a 1:1 line
    sfig.line([min(yy), max(yy)], [min(yy), max(yy)], color='black', line_dash='dotted', 
              line_width=2, legend_label='1:1')
    sfig.xaxis.axis_label = 'Actual H [bits/sample]'
    sfig.yaxis.axis_label = 'Predicted H [bits/sample]'
    sfig.legend.location = 'top_left'
    
    plots.append(fig)
    plots.append(sfig)

In [None]:
layout = gridplot(plots, ncols=2, width=350, height=300)
show(layout)

## Discussion

Model performance metrics can't be compared across different dictionary sizes because the distributions of the target variable are different, and the scale changes as a function of dictionary size.  However looking at the $R^2$ of the predicted vs. "actual" entropy plots (right column), it seems that the model works best at at bits, corresponding to a dictionary size of 1024 unique symbols (states).  



## Citations

```{bibliography}
:filter: docname in docnames
```