# Predict (Shannon) Entropy

## Introduction

In the previous section, we showed that mean annual discharge could be predicted well from catchment attributes.  In this step we aim to predict a measure of the randomness of river systems from catchment attributes, also known as entropy (H).  Since the (Shannon) entropy of the distribution does not embody one specific process, it does not fit with conventional classifications of hydrological signatures.  Since the entropy measure encompasses the entire distribution, it can be interpreted as an aggregate representation of the complex interactions of the hydrologic cycle.

In the data preprocessing, we computed the entropy of the distribution of each individual streamflow time series in bits per sample.  We'll now use an ensemble decision tree method called XGBoost (eXtreme Gradient Boosted decision tree) {cite}`chen2016xgboost` to see if the entropy (or uncertainty) of a distribution can be predicted from catchment attributes.  The dictionary size (number of quantization levels) is varied to test if the additional information in the distribution can be exploited by the model.  The model input features are added in successive model tests to compare the contribution of catchment attribute groups related to climate, terrain, land cover, and soil.  

In [1]:
import os
import pandas as pd
import numpy as np

import xgboost as xgb
from sklearn.metrics import (
    root_mean_squared_error,
    mean_absolute_error,
    roc_auc_score,
    accuracy_score,
)

from scipy.stats import linregress

import data_processing_functions as dpf

from bokeh.plotting import figure, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource
from bokeh.io import output_notebook
from bokeh.palettes import Sunset10, Vibrant7


output_notebook()

BASE_DIR = os.getcwd()

## Load Input Data

In [2]:
# load the catchment characteristics
attributes_filename = 'BCUB_watershed_attributes_updated.csv'
df = pd.read_csv(os.path.join(BASE_DIR, 'data', 'processed_divergence_inputs', attributes_filename))
df.columns = [c.lower() for c in df.columns]

Subdivide the attributes into related classes: terrain, land cover, soil, climate.

In [3]:
print(df.columns.tolist())

['region', 'official_id', 'drainage_area_km2', 'centroid_lon_deg_e', 'centroid_lat_deg_n', 'logk_ice_x100', 'porosity_x100', 'land_use_forest_frac_2010', 'land_use_shrubs_frac_2010', 'land_use_grass_frac_2010', 'land_use_wetland_frac_2010', 'land_use_crops_frac_2010', 'land_use_urban_frac_2010', 'land_use_water_frac_2010', 'land_use_snow_ice_frac_2010', 'lulc_check_2010', 'land_use_forest_frac_2015', 'land_use_shrubs_frac_2015', 'land_use_grass_frac_2015', 'land_use_wetland_frac_2015', 'land_use_crops_frac_2015', 'land_use_urban_frac_2015', 'land_use_water_frac_2015', 'land_use_snow_ice_frac_2015', 'lulc_check_2015', 'land_use_forest_frac_2020', 'land_use_shrubs_frac_2020', 'land_use_grass_frac_2020', 'land_use_wetland_frac_2020', 'land_use_crops_frac_2020', 'land_use_urban_frac_2020', 'land_use_water_frac_2020', 'land_use_snow_ice_frac_2020', 'lulc_check_2020', 'slope_deg', 'aspect_deg', 'median_el', 'mean_el', 'max_el', 'min_el', 'elevation_m', 'prcp', 'tmin', 'tmax', 'vp', 'swe', 's

## Define Attribute Groups

In [4]:
terrain = ['drainage_area_km2', 'elevation_m', 'slope_deg', 'aspect_deg']
land_cover = [
    'land_use_forest_frac_2010', 'land_use_grass_frac_2010', 'land_use_wetland_frac_2010', 'land_use_water_frac_2010', 
    'land_use_urban_frac_2010', 'land_use_shrubs_frac_2010', 'land_use_crops_frac_2010', 'land_use_snow_ice_frac_2010']
soil = ['logk_ice_x100', 'porosity_x100']
climate = ['prcp', 'srad', 'swe', 'tmax', 'tmin', 'vp', 'high_prcp_freq', 'high_prcp_duration', 'low_prcp_freq', 'low_prcp_duration']
all_attributes = terrain + land_cover + soil + climate
len(all_attributes)
assert len([c for c in all_attributes if c not in df.columns]) == 0

attribute_set_dict = {
    'climate': climate, 
    '+land_cover': land_cover,
    '+terrain': terrain, 
    '+soil': soil,
}

In [5]:
results_folder = os.path.join(BASE_DIR, 'data', 'entropy_prediction_results')
if not os.path.exists(results_folder):
    os.makedirs(results_folder)

In [6]:
def predict_entropy_from_attributes(df, train_indices, test_indices, attribute_set_names, results_folder):
    results = {}
    for bitrate in [4, 6, 8, 9, 10, 11, 12]:
        results[bitrate] = {}
        print(f'bitrate = {bitrate}')
        # set the target column
        target_column = f'h_{bitrate}_bits'
        input_attributes = []

        loss = 'reg:squarederror'

        # add attribute groups successively
        for group in attribute_set_names:
            print(f'  Processing {group} attribute set.')
            group_attributes = attribute_set_dict[group]
            input_attributes += group_attributes
            input_data = df[['official_id'] + input_attributes + [target_column]].copy()

            trial_df, test_df = dpf.run_xgb_CV_trials(
                group, input_attributes, target_column, 
                input_data, train_indices, test_indices, n_optimization_rounds, 
                nfolds, n_boost_rounds, results_folder, loss
            )
            
            test_rmse = root_mean_squared_error(test_df['actual'], test_df['predicted'])
            test_mae = mean_absolute_error(test_df['actual'], test_df['predicted'])

            print(f'    Held-out test rmse: {test_rmse:.2f}, mae: {test_mae:.2f}')
            print('')
            # store the test set predictions and actuals
            results[bitrate][group] = {
                'trials': trial_df, 'test_df': test_df,
                'test_mae': test_mae, 'test_rmse': test_rmse} 
    return results

## Set Trial Parameters

In [7]:
# define the amount of data to set aside for final testing
holdout_pct = 0.10
nfolds = 10
n_boost_rounds = 2500
n_optimization_rounds = 25

# attribute_set_names = ['climate', '+land_cover', '+terrain', '+soil']
attribute_set_names = ['climate', '+terrain', '+land_cover', '+soil']

In [8]:
# reset the index to ensure the split is done correctly
df.reset_index(drop=True, inplace=True)
# the np.random.seed is set in the train_test_split() function, 
# so the test indices will be identical across notebooks
# where the training and test data are split by this function
# training_stns, test_stns = dpf.train_test_split(df, holdout_pct)

In [9]:
df.head()

Unnamed: 0,region,official_id,drainage_area_km2,centroid_lon_deg_e,centroid_lat_deg_n,logk_ice_x100,porosity_x100,land_use_forest_frac_2010,land_use_shrubs_frac_2010,land_use_grass_frac_2010,...,low_prcp_freq,high_prcp_duration,high_prcp_freq,h_4_bits,h_6_bits,h_8_bits,h_9_bits,h_10_bits,h_11_bits,h_12_bits
0,08A,08AA004,695.3,60.490103,-137.436239,-1338.24,11.70636,0.3,0.11,0.47,...,0.7,1.0,0.1,3.099421,5.289804,7.411504,8.213345,8.444623,8.468595,8.468595
1,08A,08AA008,1226.8,61.446754,-137.752162,-1108.36,0.65119,0.2,0.21,0.51,...,0.8,1.0,0.1,2.749133,4.336423,5.9298,6.354848,7.534496,8.376749,9.05972
2,08A,08AA009,184.6,61.211074,-136.863365,-1068.02,0.05309,0.27,0.23,0.44,...,0.8,1.0,0.1,2.83036,3.947816,4.843135,5.983963,7.013831,7.917109,8.405969
3,08A,08AB001,15352.7,60.807985,-137.677689,-1324.77,2.82491,0.33,0.13,0.33,...,0.7,1.0,0.1,2.801065,3.548239,6.045037,7.720094,9.006204,9.874433,10.134966
4,08A,08AB002,27320.5,60.373054,-137.607298,-1327.49,3.90509,0.28,0.12,0.33,...,0.7,1.0,0.1,2.907379,4.607341,6.595192,8.020174,8.79604,8.892634,8.895544


In [10]:
# retrieve the same training and test split stations from the previous 
# analysis to check whether the predictive performance is correlated 
# across target variables
runoff_results_folder = os.path.join(BASE_DIR, 'data', 'runoff_prediction_results')
runoff_test_file = f'Mean_runoff_prediction_results_{"".join(attribute_set_names)}.npy'
runoff_test_data = np.load(os.path.join(runoff_results_folder, runoff_test_file), allow_pickle=True).item()

runoff_test_df = runoff_test_data['+soil']['test_df']
test_stns = runoff_test_df['official_id'].values
training_stns = [e for e in df['official_id'].values if e not in test_stns]

## Run XGBoost Models

In [11]:
test_results_fname = f'Entropy_prediction_results.npy'
test_results_fpath = os.path.join(results_folder, test_results_fname)

if os.path.exists(test_results_fpath):
    test_results = np.load(test_results_fpath, allow_pickle=True).item()
else:
    test_results = predict_entropy_from_attributes(df, training_stns, test_stns, attribute_set_names, results_folder)
    np.save(test_results_fpath, test_results)

## View Results

In [26]:
plots = []
scatter_plots = []
for b, set_dict in test_results.items():
    test_rmse, test_mae = [], []
    attribute_sets = list(set_dict.keys())

    y1 = [set_dict[e]['test_rmse'] for e in attribute_sets]
    y2 = [set_dict[e]['test_mae'] for e in attribute_sets]
    
    source = ColumnDataSource({'x': attribute_sets, 'y1': y1, 'y2': y2})
    
    title = f'{b} bits'
    if len(plots) == 0:
        fig = figure(title=title, x_range=attribute_sets)
    else:
        fig = figure(title=title, x_range=attribute_sets, y_range=plots[0].y_range)
    fig.line('x', 'y1', legend_label='rmse', color='green', source=source, line_width=3)
    fig.line('x', 'y2', legend_label='mae', color='dodgerblue', source=source, line_width=3)
    fig.legend.background_fill_alpha = 0.6
    fig.yaxis.axis_label = 'Error'
    
    result_df = pd.DataFrame({'set': attribute_sets, 'rmse': y1, 'mae': y2})
    best_rmse_idx = result_df['rmse'].idxmin()
    best_mae_idx = result_df['mae'].idxmin()
    best_rmse_set = result_df.loc[best_rmse_idx, 'set']
    best_mae_set = result_df.loc[best_mae_idx, 'set']
    best_result = set_dict[best_rmse_set]['test_df']
    
    xx, yy = best_result['actual'], best_result['predicted']
    slope, intercept, r, p, se = linregress(xx, yy)
    
    sfig = figure(title=f'{b} bits quantization')# bits best model {best_rmse_set} (N={len(best_result)})')
    sfig.scatter(xx, yy, size=3, alpha=0.8)
    xpred = np.linspace(min(xx), max(xx), 100)
    ybf = [slope * e + intercept for e in xpred]
    sfig.line(xpred, ybf, color='red', line_width=3, line_dash='dashed', legend_label=f'R²={r**2:.2f}') 
    # plot a 1:1 line
    sfig.line([min(yy), max(yy)], [min(yy), max(yy)], color='black', line_dash='dotted', 
              line_width=2, legend_label='1:1')
    sfig.xaxis.axis_label = 'Actual H [bits/sample]'
    sfig.yaxis.axis_label = 'Predicted H [bits/sample]'
    sfig.legend.location = 'top_left'
    sfig.legend.background_fill_alpha = 0.6
    if b % 2 == 0:
        scatter_plots.append(sfig)
    
    plots.append(fig)
    plots.append(sfig)

In [23]:
layout = gridplot(plots, ncols=2, width=350, height=300)
show(layout)

In [27]:
scatter_layout = gridplot(scatter_plots, ncols=5, width=250, height=225)
show(scatter_layout)

## Discussion

Model performance metrics can't be compared across different dictionary sizes because the distributions of the target variable are different, and the scale changes as a function of dictionary size.  However looking at the $R^2$ of the predicted vs. "actual" entropy plots (right column), it seems that the model works best at at bits, corresponding to a dictionary size of 1024 unique symbols (states).  

An important question about the model results, or the unexplained variance in the predictability of entropy, is whether the residuals correlate with the predictability of other, independent signatures.  Below we look at two things -- the first is the correlation of mean runoff and entropy, and the correlation of residuals between the two models.  Is it the same catchments that are difficult to predict, regardless of the target variable, and likewise is it the same catchments that are predicted well?

In [14]:
# use n = 1 to match the ordering used in the mean runoff prediction notebook for group_1
runoff_results_fname = f'Mean_runoff_prediction_results_{"".join(attribute_set_names)}.npy'
runoff_results_folder = os.path.join(BASE_DIR, 'data', 'runoff_prediction_results')
runoff_results_fpath = os.path.join(runoff_results_folder, runoff_results_fname)
if os.path.exists(runoff_results_fpath):
    runoff_test_results = np.load(runoff_results_fpath, allow_pickle=True).item()
else:
    print('results not found')

In [15]:
resid_plots = []

for b in test_results.keys():
    print(f' bitrate = {b}')
    attr_groups = test_results[b].keys()
    # for group in attr_groups:
    group = '+soil'
    rdf = runoff_test_results[group]['test_df']
    rdf['residuals'] = rdf['predicted'] - rdf['actual']
    
    assert (sorted(rdf['official_id']) == sorted(hdf['official_id']))
    
    hdf = test_results[b][group]['test_df']
    hdf['residuals'] = hdf['predicted'] - hdf['actual']

    xx, yy = rdf['residuals'], hdf['residuals']
    slope, intercept, r, p, se = linregress(xx, yy)
    
    sfig = figure(title=f'Runoff & Entropy Prediction Residuals Correlation Test ', width=600, height=400)
    sfig.scatter(xx, yy, size=3, alpha=0.8)
    xpred = np.linspace(min(xx), max(xx), 100)
    ybf = [slope * e + intercept for e in xpred]
    sfig.line(xpred, ybf, color='red', line_width=3, line_dash='dashed', legend_label=f'R²={r**2:.2f}')
    # plot a 1:1 line
    sfig.line([min(yy), max(yy)], [min(yy), max(yy)], color='black', line_dash='dotted', 
              line_width=2, legend_label='1:1')
    sfig.xaxis.axis_label = 'Runoff Prediction Residuals [mm/day]'
    sfig.yaxis.axis_label = 'Entropy Prediction Residuals [bits/sample]'
    sfig.legend.location = 'top_left'
    
    resid_plots.append(sfig)


 bitrate = 4


NameError: name 'hdf' is not defined

In [None]:
resid_layout = gridplot(resid_plots, ncols=2, width=500, height=350)
show(resid_layout)

The lack of correlation between mean runoff and entropy prediction residuals suggests the two target variables are independent.  Mean runoff captures a central tendency, while entropy captures the system variability.  The lack of correlation suggests that these variables are independent, the model could simply be limited in its capacity to capture the interdependence between the variability of the system and its average behaviour.

## Citations

```{bibliography}
:filter: docname in docnames
```