# Predictability of Mean Runoff

## Introduction

In the data preprocessing, we computed the entropy of the distribution of each individual streamflow time series in bits per sample.  We'll now use an ensemble decision tree method called gradient boosting using the XGBoost (eXtreme Gradient Boosted decision tree) {cite}`chen2016xgboost` library to see if runoff can be predicted from catchment attributes as was shown in {cite}`addor2018ranking`.  The model input features are added in successive model tests to compare the contribution of catchment attribute groups related to climate, terrain, land cover, and soil.  

In [7]:
import os
import pandas as pd
import numpy as np

from bokeh.plotting import figure, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource
from bokeh.io import output_notebook
from bokeh.palettes import Sunset10, Vibrant7

import xgboost as xgb
from sklearn.metrics import (
    root_mean_squared_error,
    mean_absolute_error,
    roc_auc_score,
    accuracy_score,
)

import data_processing_functions as dpf

from scipy.stats import linregress
output_notebook()

BASE_DIR = os.getcwd()

## Load Input Data

In [8]:
# load the catchment characteristics
attributes_filename = 'BCUB_watershed_attributes_updated.csv'
df = pd.read_csv(os.path.join('data', attributes_filename))
df.columns = [c.lower() for c in df.columns]

Compute the mean runoff for each streamflow timeseries.

In [9]:
if 'mean_runoff' not in df.columns:
    for i, row in df.iterrows():
        mean_runoff = dpf.compute_mean_runoff(row)
        df.loc[i, 'mean_runoff'] = mean_runoff
    df.to_csv(os.path.join(BASE_DIR, 'data', attributes_filename), index=False)

Subdivide the attributes into related classes: terrain, land cover, soil, climate.

In [10]:
# list all the attributes in the input dataframe
print(df.columns.tolist())

['region', 'official_id', 'drainage_area_km2', 'centroid_lon_deg_e', 'centroid_lat_deg_n', 'logk_ice_x100', 'porosity_x100', 'land_use_forest_frac_2010', 'land_use_shrubs_frac_2010', 'land_use_grass_frac_2010', 'land_use_wetland_frac_2010', 'land_use_crops_frac_2010', 'land_use_urban_frac_2010', 'land_use_water_frac_2010', 'land_use_snow_ice_frac_2010', 'lulc_check_2010', 'land_use_forest_frac_2015', 'land_use_shrubs_frac_2015', 'land_use_grass_frac_2015', 'land_use_wetland_frac_2015', 'land_use_crops_frac_2015', 'land_use_urban_frac_2015', 'land_use_water_frac_2015', 'land_use_snow_ice_frac_2015', 'lulc_check_2015', 'land_use_forest_frac_2020', 'land_use_shrubs_frac_2020', 'land_use_grass_frac_2020', 'land_use_wetland_frac_2020', 'land_use_crops_frac_2020', 'land_use_urban_frac_2020', 'land_use_water_frac_2020', 'land_use_snow_ice_frac_2020', 'lulc_check_2020', 'slope_deg', 'aspect_deg', 'median_el', 'mean_el', 'max_el', 'min_el', 'elevation_m', 'prcp', 'tmin', 'tmax', 'vp', 'swe', 's

## Define attribute groups

In [11]:
terrain = ['drainage_area_km2', 'elevation_m', 'slope_deg', 'aspect_deg']
land_cover = [
    'land_use_forest_frac_2010', 'land_use_grass_frac_2010', 'land_use_wetland_frac_2010', 'land_use_water_frac_2010', 
    'land_use_urban_frac_2010', 'land_use_shrubs_frac_2010', 'land_use_crops_frac_2010', 'land_use_snow_ice_frac_2010']
climate = ['prcp', 'srad', 'swe', 'tmax', 'tmin', 'vp', 'high_prcp_freq', 'high_prcp_duration', 'low_prcp_freq', 'low_prcp_duration']
soil = ['logk_ice_x100', 'porosity_x100']
all_attributes = terrain + land_cover + soil + climate
len(all_attributes)

24

In [12]:
assert len([c for c in all_attributes if c not in df.columns]) == 0

In [13]:
results_folder = os.path.join(BASE_DIR, 'data', 'runoff_prediction_results')
if not os.path.exists(results_folder):
    os.makedirs(results_folder)

In [14]:
def predict_runoff_from_attributes(df, train_indices, test_indices, group_order, results_folder):
        
    # set the target column
    target_column = f'mean_runoff'
    test_attributes = []

    # add attribute groups successively
    for set_name in group_order:
        attribute_set = attribute_set_dict[set_name]
        print(f' Processing {set_name} attribute set')
        test_attributes += attribute_set
        input_data = df[test_attributes + [target_column]].copy()

        # run the XGBoost model with cross validation and test on holdout set
        trial_df, test_df = dpf.run_xgb_CV_trials(
            set_name, test_attributes, target_column, input_data, train_indices, 
            test_indices, n_optimization_rounds, nfolds, n_boost_rounds, results_folder
        )

        test_rmse = root_mean_squared_error(test_df['actual'], test_df['predicted'])
        test_mae = mean_absolute_error(test_df['actual'], test_df['predicted'])

        print(f'  {set_name}')
        print(f'   held-out test rmse: {test_rmse:.2f}, mae: {test_mae:.2f}')
        print('')
        # store the test set predictions and actuals
        all_test_results[set_name] = {
            'trials': trial_df, 'test_df': test_df,
            'test_mae': test_mae, 'test_rmse': test_rmse,
        } 
    return all_test_results

In [15]:
# define the amount of data to set aside for final testing
holdout_pct = 0.10
nfolds = 5
n_boost_rounds = 2000
n_optimization_rounds = 20

all_test_results = {}
attribute_set_dict = {
    'climate': climate, 
    '+land_cover': land_cover,
    '+terrain': terrain, 
    '+soil': soil,
}

## Set Attribute Groupings

In [16]:
group_1 = ['climate', '+terrain', '+land_cover', '+soil']
group_2 = group_1[::-1]
group_3 = ['+land_cover', '+terrain', '+soil', 'climate']
group_4 = ['+soil', 'climate', '+land_cover', '+terrain']
attribute_group_orders = [group_1, group_2, group_3, group_4]

## Run XGBoost Models

Separate the test set at the outset so the attribute group ordering is tested on the same hold-out set but necessarily on unique training optimizations.  This ensures that at least the presence of outliers in the hold-out set should at least be constant across the attribute group reordering.

In [17]:
# reset the index to ensure the split is done correctly
df.reset_index(drop=True, inplace=True)
train_indices, test_indices = dpf.train_test_split(df, holdout_pct)

In [18]:
n = 0
group_results = {}
for group in attribute_group_orders:
    print(f'Processing: {group} ordering.')
    n += 1
    test_results_fname = f'Mean_runoff_prediction_results_{n}.npy'
    test_results_fpath = os.path.join(results_folder, test_results_fname)
    if os.path.exists(test_results_fpath):
        all_test_results = np.load(test_results_fpath, allow_pickle=True).item()
    else:
        all_test_results = predict_runoff_from_attributes(df, train_indices, test_indices, group, results_folder)
        np.save(test_results_fpath, all_test_results)
    
    group_results[n] = {'order': group, 'results': all_test_results}
    

Processing: ['climate', '+terrain', '+land_cover', '+soil'] ordering.
 Processing climate attribute set
    39.74 ± 6.195 RMSE mean on the test set (N=20)
  climate
   held-out test rmse: 154.67, mae: 53.76

 Processing +terrain attribute set
    14.40 ± 1.756 RMSE mean on the test set (N=20)
  +terrain
   held-out test rmse: 32.60, mae: 9.70

 Processing +land_cover attribute set
    15.18 ± 1.654 RMSE mean on the test set (N=20)
  +land_cover
   held-out test rmse: 37.15, mae: 11.22

 Processing +soil attribute set
    14.91 ± 1.542 RMSE mean on the test set (N=20)
  +soil
   held-out test rmse: 33.49, mae: 10.16

Processing: ['+soil', '+land_cover', '+terrain', 'climate'] ordering.
 Processing +soil attribute set
    46.60 ± 2.693 RMSE mean on the test set (N=20)
  +soil
   held-out test rmse: 120.54, mae: 43.39

 Processing +land_cover attribute set
    36.09 ± 3.438 RMSE mean on the test set (N=20)
  +land_cover
   held-out test rmse: 71.90, mae: 28.95

 Processing +terrain attrib

## View Results

In [19]:
def create_results_plots(all_test_results, attribute_sets):
    
    plots = []

    test_rmse, test_mae = [], []

    y1 = [all_test_results[e]['test_rmse'] for e in attribute_sets]
    y2 = [all_test_results[e]['test_mae'] for e in attribute_sets]

    source = ColumnDataSource({'x': attribute_sets, 'y1': y1, 'y2': y2})

    title = f'Runoff Predictability'

    if len(plots) == 0:
        fig = figure(title=title, x_range=attribute_sets)
    else:
        fig = figure(title=title, x_range=attribute_sets, y_range=plots[0].y_range)
    fig.line('x', 'y1', legend_label='rmse', color='green', source=source, line_width=3)
    fig.line('x', 'y2', legend_label='mae', color='dodgerblue', source=source, line_width=3)
    fig.legend.background_fill_alpha = 0.6
    fig.yaxis.axis_label = 'RMSE'

    result_df = pd.DataFrame({'set': attribute_sets, 'rmse': y1, 'mae': y2})
    best_rmse_idx = result_df['rmse'].idxmin()
    best_mae_idx = result_df['mae'].idxmin()
    best_rmse_set = result_df.loc[best_rmse_idx, 'set']
    best_mae_set = result_df.loc[best_mae_idx, 'set']
    best_result = all_test_results[best_rmse_set]['test_df']

    xx, yy = best_result['actual'], best_result['predicted']
    slope, intercept, r, p, se = linregress(xx, yy)

    sfig = figure(title=f'Test: best model {best_rmse_set} (N={len(best_result)})',
                 )
    sfig.scatter(xx, yy, size=3, alpha=0.8)
    x_obs = np.linspace(min(xx), max(xx), 1000)
    ybf = [slope * e + intercept for e in x_obs]
    sfig.line(x_obs, ybf, color='red', line_width=3, line_dash='dashed', legend_label=f'R²={r**2:.2f}')    
    sfig.xaxis.axis_label = r'$$\text{Observed Mean} \left[ m^3 / s \right]$$'
    sfig.yaxis.axis_label = r'$$\text{Predicted Mean} \left[ m^3 / s \right]$$'
    sfig.legend.location = 'top_left'
    plots.append(fig)
    plots.append(sfig)
    
    # plot a 1:1 line
    sfig.line([0, max(ybf)], [0, max(ybf)], color='black', line_dash='dotted', 
              line_width=2, legend_label='1:1')
    
    return plots

In [20]:
n = 1
grp_1_plots = create_results_plots(group_results[n]['results'], group_results[n]['order'])
layout = gridplot(grp_1_plots, ncols=2, width=350, height=300)
show(layout)

### Test the sensitivity to Order of attribute groups

In [21]:
n = 2
grp_2_plots = create_results_plots(group_results[n]['results'], group_results[n]['order'])
layout = gridplot(grp_2_plots, ncols=2, width=350, height=300)
show(layout)

In [22]:
n = 3
grp_3_plots = create_results_plots(group_results[n]['results'], group_results[n]['order'])
layout = gridplot(grp_3_plots, ncols=2, width=350, height=300)
show(layout)

In [23]:
n = 4
grp_4_plots = create_results_plots(group_results[n]['results'], group_results[n]['order'])
layout = gridplot(grp_4_plots, ncols=2, width=350, height=300)
show(layout)

### Test randomly permuted target values

As a last iteration, randomize the order of the mean_runoff column to test what the algorithm is learning.

The predictive power decreases substantially across all groupings of input attributes.

In [24]:
test_results_fname = f'Mean_runoff_prediction_results_shuffled_Y.npy'
test_results_fpath = os.path.join('data', test_results_fname)
if os.path.exists(test_results_fpath):
    all_test_results = np.load(test_results_fpath, allow_pickle=True).item()
else:
    shuffled_df = df.copy()
    runoff_values = df['mean_runoff'].values
    # randomly shuffle the order of runoff values
    np.random.shuffle(runoff_values)
    shuffled_df['mean_runoff'] = runoff_values
    all_test_results = predict_runoff_from_attributes(shuffled_df, train_indices, test_indices, group, results_folder)
    np.save(test_results_fpath, all_test_results)

group_results['shuffled'] = {'order': group_1, 'results': all_test_results}

 Processing +soil attribute set
    51.03 ± 5.401 RMSE mean on the test set (N=20)
  +soil
   held-out test rmse: 78.79, mae: 40.91

 Processing climate attribute set
    51.20 ± 5.260 RMSE mean on the test set (N=20)
  climate
   held-out test rmse: 76.83, mae: 42.37

 Processing +land_cover attribute set
    51.14 ± 5.381 RMSE mean on the test set (N=20)
  +land_cover
   held-out test rmse: 85.68, mae: 46.12

 Processing +terrain attribute set
    51.14 ± 5.352 RMSE mean on the test set (N=20)
  +terrain
   held-out test rmse: 85.56, mae: 48.41



### View results of shuffled target variable (mean runoff)

In [25]:
shuffled_results = group_results['shuffled']['results']
group_order = group_results['shuffled']['order']
shuffled_runoff_plots = create_results_plots(shuffled_results, group_order)
layout = gridplot(shuffled_runoff_plots, ncols=2, width=350, height=300)
show(layout)

## Discussion

- Reordering the attribute groupings suggests there are interactions between attributes in model training.  
- Across all orderings, the terrain attributes appear to be the best predictors, and surprisingly the climate attributes are not.  
- Randomly permuting the order of the target variable, `mean_runoff` erases all predictive power.

### Need to test sensitivity to hold-out set

## Citations

```{bibliography}
:filter: docname in docnames
```