# Introduction



## Install packages

Install treeinterpreter ([source](https://github.com/andosa/treeinterpreter/)) with:

    pip install treeinterpreter

Install ELI5 ([source](https://github.com/TeamHG-Memex/eli5)) with:

    pip install eli5

# TreeExplainer produces the same values of treeinterpreter

In [1]:
from time import time
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from treeinterpreter import treeinterpreter as ti
from tree_explainer.tree_explainer import TreeExplainer

%load_ext autoreload


SEED = 17

# Generate data
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.20, random_state=SEED)

# Create and train classifier
RF = RandomForestClassifier(n_estimators=100, random_state=SEED, n_jobs=-1)
RF.fit(X_train, y_train);

In [2]:
%%time 

# Use treeinterpreter to compute feature contributions
# Call directly _predict_forest()
ti_prediction, ti_bias, ti_contributions = ti._predict_forest(RF, X_test, joint_contribution=False)

CPU times: user 104 ms, sys: 4.48 ms, total: 108 ms
Wall time: 106 ms


In [3]:
%%time

# Initialize TreeExplainer
TE = TreeExplainer(RF, X_test)

CPU times: user 25.2 ms, sys: 6.85 ms, total: 32 ms
Wall time: 110 ms


In [4]:
%%time

# Compute feature contributions
TE.explain_feature_contributions(joint_contributions=False)

CPU times: user 323 ms, sys: 3.83 ms, total: 327 ms
Wall time: 325 ms


<tree_explainer.tree_explainer.TreeExplainer at 0x1a18f134a8>

Make sure that results are the same:

In [5]:
# Are contribution values the same?
np.allclose(TE.contributions, 
            ti_contributions)

True

In [6]:
# Are the target probabilities at the root of each tree the same?
np.allclose(TE.target_probability_at_root.mean(axis=0), 
            ti_bias[0, :])

True

In [7]:
# Are predicted values the same?
np.allclose(TE.prediction_probabilities, 
            ti_prediction)

True

In [8]:
# Do predictions equal the sum of feature contributions and target probabilities at the root of each tree?
np.allclose(ti_prediction, 
            np.sum(ti_contributions, axis=1) + ti_bias)

True

In [9]:
# Do predictions equal the sum of feature contributions and target probabilities at the root of each tree?
np.allclose(TE.prediction_probabilities, 
            np.sum(TE.contributions, axis=1) + TE.target_probability_at_root.mean(axis=0))

True

## ELI5 is cumbersome and slow, but offers a colorful output

In [10]:
%%time

from eli5.sklearn.explain_prediction import explain_prediction_tree_classifier
from eli5.formatters.as_dataframe import format_as_dataframe


# Iterate through each observation
RES = list()
for i_obs in range(X_test.shape[0]):
    res = format_as_dataframe(explain_prediction_tree_classifier(RF, X_test[i_obs, :]))
    # Add index of observation
    res['observation'] = i_obs
    RES.append(res)

# Concatenate results
all_RES = pd.concat(RES, axis=0, ignore_index=True, sort=False)

CPU times: user 16.3 s, sys: 1.63 s, total: 17.9 s
Wall time: 1min 11s


In [11]:
# Check target probability at root
print(np.allclose(ti_bias[0, :], 
                  eli5_contributions.loc['<BIAS>'].values))

print(np.allclose(TE.target_probability_at_root.mean(axis=0), 
                  eli5_contributions.loc['<BIAS>'].values))

NameError: name 'eli5_contributions' is not defined

In [None]:
# Check that predictions are the same
eli5_predictions = all_RES.groupby(by=['observation', 'target'])['weight'].sum()
rows = eli5_predictions.to_frame().reset_index()['observation'].values
cols = eli5_predictions.to_frame().reset_index()['target'].values

print(np.allclose(ti_prediction[rows, cols], 
                  eli5_predictions.values))

print(np.allclose(TE.prediction_probabilities[rows, cols], 
                  eli5_predictions.values))

In [None]:
eli5_contributions = (all_RES
                      .groupby(by=['feature', 'target'])['weight']
                      .mean()
                      .to_frame()
                      .reset_index()
                      .pivot(index='feature', columns='target', values='weight')
                      .values[1:, :]  # first row refers to <BIAS>
                     )

eli5_contributions

These values are completely different from treeinterpreter, and more importantly, they are not symmetric, that is, the average contribution to a target does not correspond to an equal weight in the opposite direction to the other target. 

In [None]:
ti_contributions.mean(axis=0)

In fact, eli5 does not use information from true labels. This means, that the average contribution values refer to the predicted targets. If we group contributions by predicted targets, the results are the same of treeinterpreter.

In [None]:
predicted_targets = ti_prediction.argmax(axis=1) 
ti_contributions_by_predicted_target = np.vstack((ti_contributions[predicted_targets == 0, :, 0].mean(0), 
                                                  ti_contributions[predicted_targets == 1, :, 1].mean(0))).T

np.allclose(ti_contributions_by_predicted_target, 
            eli5_contributions)

Despite all (!), ELI5 offers a colorful output (in a Jupyter notebook)

In [None]:
from eli5 import show_prediction

# Again, one observation at a time. Let's see observation number 1
show_prediction(RF, X_test[0, :], show_feature_values=True)

If you can live without colors, TreeExplainer offers a similar interface:

In [None]:
# Transform X_test into a DataFrame, so we can use the same feature names as in ELI5
df_test = pd.DataFrame(X_test, columns=['x%i' % i for i in range(X_test.shape[1])])

# Rerun TreeExplainer by passing joint_contributions = True
TE = TreeExplainer(RF, df_test, y_test).explain_feature_contributions(joint_contributions=True);
# Let's analyze the rest of the tree structure
TE.analyze_tree_structure();

In [None]:
# Let's see observation number 1
TE.explain_single_prediction(observation_idx=0)

TreeExplainer outputs the `value` of the observation, alongside the confidence of the model (as quartiles) regarding the range of values that of a feature falling in the target class. The last column shows the contribution of each feature, in terms of percentage to the final decision.

### Conclusions

treeinterpreter is fast and simple. If you only need the values of feature contributions, it is the best option of the 3 tested here.

TreeExplainer offers more advanced features, and to do so it holds and scans the tree structure more thoroughly. This comes at a cost in speed. Most analyses can be carried on in parallel, but (in my experience) this hardly reduces the time of computations. Rather, the overhead is usually slowing the algorithm, but your mileage may vary. More advanced options are available in most methods of this class, which port ideas of R packages to python.

ELI5 offers many options, including more advanced ones, such as LIME. However, it is slow, and it doesn't support the inspection of a whole dataset. Moreover, even after looping through all observations, it is cumbersome to obtain average values of feature contributions. If you are simply after feature contributions, go for treeinterpreter; if you want to try LIME, just do so, instead of relying on a third party library; for all other cases, there are better libraries than ELI5, and TreeExplainer might do for you.