# License

***

Copyright 2018-2019 Lingyao Meng (danielle@h2o.ai), J. Patrick Hall (phall@h2o.ai), and the H2O.ai team. 

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

**DISCLAIMER:** This notebook is not legal compliance advice.

# Partial Dependence, Individual Conditional Expectation, and Surrogate Models [FIX THIS]

#### Start H2O cluster


The `os` commands below check whether this notebook is being run on the Aquarium platform. 

In [1]:
import os

startup = '/home/h2o/bin/aquarium_startup'
if os.path.exists(startup):
    os.system(startup)
    local_url = 'http://localhost:54321/h2o'
    aquarium = True
    !sleep 5
else:
    local_url = 'http://localhost:54321'
    aquarium = False

#### Python imports
In general, NumPy and Pandas will be used for data manipulation purposes and H2o and XGBoost will be used for modeling tasks.

In [2]:
# for handling external processes to generate PNG file of decision tree
import time
import sys
import re
import subprocess

# in-notebook display of decision tree
from IPython.display import Image
from IPython.display import display

# to generate synthetic data w/ known interactions
from data_maker_and_getter import DataMakerAndGetter

import h2o                                                        # Python API for h2o library and server 
from h2o.estimators.random_forest import H2ORandomForestEstimator # h2o, for single tree
from h2o.backend import H2OLocalServer                            # h2o, for plotting local tree in-notebook

import matplotlib.pyplot as plt                                   # basic plotting
%matplotlib inline

import numpy as np                                                # basic math and array and matrix handling 
import pandas as pd                                               # basic Dataframe handling
import xgboost as xgb                                             # for training gradient boosting machines 

# ignore irrelevant warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

#### Start h2o
H2o is both a library and a server. The machine learning algorithms in the library take advantage of the multithreaded and distributed architecture provided by the server to train machine learning algorithms extremely efficiently. The API for the library was imported above in cell 2, but the server still needs to be started.

>The parameters used in `h2o.init` will depend on your specific environment. Regardless of how H2O is installed, if you start a cluster, you will need to ensure that it is shut down when you are done.

In [3]:
h2o.init(url=local_url, max_mem_size='2G')
h2o.remove_all()    # remove any existing data structures from h2o memory

Checking whether there is an H2O instance running at http://localhost:54321/h2o .

H2OResponseError: Server error water.exceptions.H2ONotFoundArgumentException:
  Error: Resource /h2o/3/Cloud not found
  Request: GET /3/Cloud


#### Dataset with Known Signal Generating Functions

Create dataset with a known signal generating function with noise: 

$$ y = x_1 * x_4 + |x_8| * (x_9)^2 + e $$

## Data Generation and XGBoost Training

Generate synthetic data with:
* 200,000 rows
* Binary target
* A single, known signal-generating function (see above)
* With noise via label switching

In [None]:
ds_ = DataMakerAndGetter(nrows=200000, target='binary', one_function=True, noise=True)
rson = ds_.make_random_with_signal()

#### Histogram of important variable in generated data

In [None]:
rson_pd = rson.as_data_frame()
_ = rson_pd['num9'].plot(kind='hist', bins=30, title='Histogram: num9')

#### Assign modeling roles

In [None]:
y = 'target'
X = [name for name in rson.columns if name not in [y,'row_id','function','cat1','cat2','cat3']]
print(X)

#### Split data into training and validation set

In [None]:
rson[y] = rson[y].asfactor()
train, valid, _ = rson.split_frame([0.4, 0.3], seed = 12345)
print(train.shape)
print(valid.shape)

#### Convert h2o frame into Pandas Dataframe for various activites below

In [None]:
rsontrain_pd = train.as_data_frame()
rsonvalid_pd = valid.as_data_frame()

#### Convert Pandas Dataframes into XGBoost DMatrices (LightSVM format), required for training XGBoost

In [None]:
rsontrain_dm = xgb.DMatrix(rsontrain_pd[X],
                           rsontrain_pd[y])
rsonvalid_dm = xgb.DMatrix(rsonvalid_pd[X],
                           rsonvalid_pd[y])

#### Use average of y as XGBoost null model

In [None]:
ave_y = rsontrain_pd['target'].mean()
print(ave_y)

#### Train XGBoost GBM
Hyperparameters selected by Cartesian grid search: https://gist.github.com/jphall663/705595e3bc72e8fdfee8fa56220503a5

In [None]:
params = {
     'base_score': ave_y,
     'booster': 'gbtree',
     'colsample_bytree': 0.9,
     'eta': 0.01,
     'eval_metric': 'auc',
     'max_depth': 12,
     'nthread': 4,
     'objective': 'binary:logistic',
     'reg_alpha': 0.001,
     'reg_lambda': 0.01,
     'seed': 12345,
     'silent': 0,
     'subsample': 0.1}

watchlist = [(rsontrain_dm, 'train'), (rsonvalid_dm, 'eval')]

rson_model = xgb.train(params, 
                       rsontrain_dm, 
                       400,
                       early_stopping_rounds=50,
                       evals=watchlist, 
                       verbose_eval=True)

## Calculate and Display Partial Dependence and ICE Curves

#### Function for calculating partial dependence and ICE

In [None]:
def par_dep(xs, frame, model, resolution=20, bins=None):
    
    """ Creates Pandas DataFrame containing partial dependence for a 
        single variable.
    
    Args:
        xs: Variable for which to calculate partial dependence.
        frame: Pandas DataFrame for which to calculate partial dependence.
        model: XGBoost model for which to calculate partial dependence.
        resolution: The number of points across the domain of xs for which 
                    to calculate partial dependence, default 20.
        bins: List of values at which to set xs, default 20 equally-spaced 
              points between column minimum and maximum.
    
    Returns:
        Pandas DataFrame containing partial dependence values.
        
    """
    
    # turn off pesky Pandas copy warning
    pd.options.mode.chained_assignment = None
    
    # initialize empty Pandas DataFrame with correct column names
    par_dep_frame = pd.DataFrame(columns=[xs, 'partial_dependence'])
    
    # cache original column values 
    col_cache = frame.loc[:, xs].copy(deep=True)
  
    # determine values at which to calculate partial dependence
    if bins == None:
        min_ = frame[xs].min()
        max_ = frame[xs].max()
        by = (max_ - min_)/resolution
        bins = np.arange(min_, max_, by)
        
    # calculate partial dependence  
    # by setting column of interest to constant 
    # and scoring the altered data and taking the mean of the predictions
    for j in bins:
        frame.loc[:, xs] = j
        dframe = xgb.DMatrix(frame)
        par_dep_i = pd.DataFrame(model.predict(dframe))
        par_dep_j = par_dep_i.mean()[0]
        par_dep_frame = par_dep_frame.append({xs:j,
                                              'partial_dependence': par_dep_j}, 
                                              ignore_index=True)
        
    # return input frame to original cached state    
    frame.loc[:, xs] = col_cache

    return par_dep_frame

#### Calculate partial dependence for the most important input variables in the GBM

In [None]:
par_dep_num9 = par_dep('num9', rsonvalid_pd[X], rson_model)

#### Display partial dependence for important variable

In [None]:
par_dep_num9

#### Bind XGBoost predictions to training data and display

In [None]:
rson_preds = pd.DataFrame(rson_model.predict(rsonvalid_dm))
rson_decile_frame = pd.concat([rsonvalid_pd, rson_preds], axis=1)
rson_decile_frame = rson_decile_frame.rename(columns={0: 'predict'})
rson_decile_hframe = h2o.H2OFrame(rson_decile_frame)
rson_decile_frame.head()

#### Find percentiles of XGBoost predictions

In [None]:
rson_percentile_dict = ds_.get_percentile_dict('predict', 'row_id', rson_decile_hframe)

#### Display percentiles with row identifiers

In [None]:
rson_percentile_dict

#### Calculate ICE curve values

In [None]:
# retreive bins from original partial dependence calculation
bins_num9 = list(par_dep_num9['num9'])

# for each percentile in percentile_dict
# create a new column in the par_dep frame 
# representing the ICE curve for that percentile
# and the variables of interest
for i in sorted(rson_percentile_dict.keys()):
    
    col_name = 'Percentile_' + str(i)
    
    # ICE curves for num11 across percentiles at bins_num11 intervals
    par_dep_num9[col_name] = par_dep('num9', 
                                     rsonvalid_pd[rsonvalid_pd['row_id'] == int(rson_percentile_dict[i])][X], 
                                     rson_model, 
                                     bins=bins_num9)['partial_dependence']


#### Display partial dependence and ICE for num9 -- all calculated DIRECTLY from the XGBoost model

In [None]:
par_dep_num9

#### Plot partial dependence and ICE

In [None]:
def plot_par_dep_ICE(xs, par_dep_frame):

    
    """ Plots ICE overlayed onto partial dependence for a single variable.
    
    Args: 
        xs: Name of variable for which to plot ICE and partial dependence.
        par_dep_frame: Name of Pandas DataFrame containing ICE and partial
                       dependence values.
    
    """
    
    # initialize figure and axis
    fig, ax = plt.subplots()
    
    # plot ICE curves
    par_dep_frame.drop('partial_dependence', axis=1).plot(x=xs, 
                                                          colormap='gnuplot',
                                                          ax=ax)

    # overlay partial dependence, annotate plot
    par_dep_frame.plot(title='Partial Dependence and ICE for ' + str(xs),
                       x=xs, 
                       y='partial_dependence',
                       style='r-', 
                       linewidth=3, 
                       ax=ax)

    # add legend
    _ = plt.legend(bbox_to_anchor=(1.05, 0),
                   loc=3, 
                   borderaxespad=0.)

plot_par_dep_ICE('num9', par_dep_num9)

Notice the divergence of the ICE curves from the red partial dependence curve for `~-1 < num9 < ~1`. This divergence can be indicative of an interaction between input variables. The surrogate tree at the bottom of the notebook can provide further insight into which input variables are driving the detected interaction.

## Train and Display Surrogate Decision Tree

#### Train single h2o decision tree as surrogate between the XGBoost predictions and inputs

In [None]:
model_id = 'dt_surrogate_mojo' # gives MOJO artifact a recognizable name

# initialize single tree surrogate model
surrogate = H2ORandomForestEstimator(ntrees=1,          # use only one tree
                                     sample_rate=1,     # use all rows in that tree
                                     mtries=-2,         # use all columns in that tree
                                     max_depth=5,       # shallow trees are easier to understand
                                     seed=12345,        # random seed for reproducibility
                                     model_id=model_id) # gives MOJO artifact a recognizable name

# train single tree surrogate model
surrogate.train(x=X, y='predict', training_frame=rson_decile_hframe)

# persist MOJO (compiled, representation of trained model)
# from which to generate plot of surrogate
mojo_path = surrogate.download_mojo(path='.')
print('Generated MOJO path:\n', mojo_path)

#### Create GraphViz dot file of tree

In [None]:
# title for plot
title = 'Known Signal Data (with Validation) Decision Tree Surrogate'  

# locate h2o jar
hs = H2OLocalServer()
h2o_jar_path = hs._find_jar()
print('Discovered H2O jar path:\n', h2o_jar_path)

# construct command line call to generate graphviz version of 
# surrogate tree see for more information: 
# http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html
gv_file_name = model_id + '.gv'
gv_args = str('-cp ' + h2o_jar_path +
              ' hex.genmodel.tools.PrintMojo --tree 0 -i '
              + mojo_path + ' -o').split()
gv_args.insert(0, 'java')
gv_args.append(gv_file_name)
if title is not None:
    gv_args = gv_args + ['--title', title]
    
# call 
print()
print('Calling external process ...')
print(' '.join(gv_args))
_ = subprocess.call(gv_args)

#### Create PNG from GraphViz dot file and display

In [None]:
# construct call to generate PNG from 
# graphviz representation of the tree
png_file_name = model_id + '.png'
png_args = str('dot -Tpng ' + gv_file_name + ' -o ' + png_file_name)
png_args = png_args.split()

# call
print('Calling external process ...')
print(' '.join(png_args))
_ = subprocess.call(png_args)

#### Display surrogate tree in-notebook
Double click to zoom.

In [None]:
display(Image((png_file_name)))

In the decision tree above, there are parent-child node relationships between num9 and several other input variables for the range `~-1 < num9 < ~1`. These parent-child relationships can be indicative of the variables driving the interactions detected in the partial dependence and ICE plot above.

#### Shutdown H2O
After using h2o, it's typically best to shut it down. However, before doing so, users should ensure that they have saved any h2o data structures, such as models and H2OFrames, or scoring artifacts, such as POJOs and MOJOs.

In [None]:
# be careful, this can erase your work!
h2o.cluster().shutdown()

**Conclusion**: In this notebook, direct and indirect model explanation techniques were combined to gain insights into the behavior of a complex ML model. 