# Anchor explanations for income prediction

In this example, we will explain predictions of a Random Forest classifier whether a person will make more or less than $50k based on characteristics like age, marital status, gender or occupation. The features are a mixture of ordinal and categorical data and will be pre-processed accordingly.

In [1]:
!pip install alibi

Collecting alibi
[?25l  Downloading https://files.pythonhosted.org/packages/00/e7/54214fcf84a65339d6c993121da52edea52b56d39e6ec87ad30c755d665a/alibi-0.3.2-py3-none-any.whl (81kB)
[K     |████                            | 10kB 12.5MB/s eta 0:00:01[K     |████████                        | 20kB 7.3MB/s eta 0:00:01[K     |████████████                    | 30kB 10.0MB/s eta 0:00:01[K     |████████████████                | 40kB 5.4MB/s eta 0:00:01[K     |████████████████████            | 51kB 6.4MB/s eta 0:00:01[K     |████████████████████████        | 61kB 7.6MB/s eta 0:00:01[K     |████████████████████████████    | 71kB 8.7MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 9.7MB/s eta 0:00:01[K     |████████████████████████████████| 92kB 5.1MB/s 
Installing collected packages: alibi
Successfully installed alibi-0.3.2


In [0]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# to reverse
# warnings.filterwarnings("default", category=[Name])


In [3]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from alibi.explainers import AnchorTabular
from alibi.datasets import fetch_adult

### Load adult dataset

The `fetch_adult` function returns a `Bunch` object containing the features, the targets, the feature names and a mapping of categorical variables to numbers which are required for formatting the output of the Anchor explainer.

In [4]:
adult = fetch_adult()
adult.keys()

dict_keys(['data', 'target', 'feature_names', 'target_names', 'category_map'])

In [0]:
data = adult.data
target = adult.target
feature_names = adult.feature_names
category_map = adult.category_map

For other datasets the utility function [gen_category_map](../api/alibi.utils.data.rst) can create the category map:

In [0]:
from alibi.utils.data import gen_category_map

Define shuffled training and test set

In [0]:
np.random.seed(0)
data_perm = np.random.permutation(np.c_[data, target])
data = data_perm[:,:-1]
target = data_perm[:,-1]

In [0]:
idx = 30000
X_train,Y_train = data[:idx,:], target[:idx]
X_test, Y_test = data[idx+1:,:], target[idx+1:]

### Create feature transformation pipeline
Create feature pre-processor. Needs to have 'fit' and 'transform' methods. Different types of pre-processing can be applied to all or part of the features. In this example - standardize ordinal features and apply one-hot-encoding to categorical features.

Ordinal features:

In [0]:
ordinal_features = [x for x in range(len(feature_names)) if x not in list(category_map.keys())]
ordinal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                                      ('scaler', StandardScaler())])

Categorical features:

In [0]:
categorical_features = list(category_map.keys())
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                                          ('onehot', OneHotEncoder(handle_unknown='ignore'))])

Combine and fit:

In [11]:
preprocessor = ColumnTransformer(transformers=[('num', ordinal_transformer, ordinal_features),
                                               ('cat', categorical_transformer, categorical_features)])
preprocessor.fit(X_train)

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('num',
                                 Pipeline(memory=None,
                                          steps=[('imputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='median',
                                                                verbose=0)),
                                                 ('scaler',
                                                  StandardScaler(copy=True,
                                                                 with_mean=True,
                           

### Train Random Forest model

Fit on pre-processed (imputing, OHE, standardizing) data.

In [12]:
np.random.seed(0)
clf = RandomForestClassifier(n_estimators=50)
clf.fit(preprocessor.transform(X_train), Y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Define predict function

In [13]:
predict_fn = lambda x: clf.predict(preprocessor.transform(x))
print('Train accuracy: ', accuracy_score(Y_train, predict_fn(X_train)))
print('Test accuracy: ', accuracy_score(Y_test, predict_fn(X_test)))

Train accuracy:  0.9655333333333334
Test accuracy:  0.855859375


### Initialize and fit anchor explainer for tabular data

In [0]:
explainer = AnchorTabular(predict_fn, feature_names, categorical_names=category_map)

Discretize the ordinal features into quartiles

In [0]:
explainer.fit(X_train, disc_perc=[25, 50, 75])

### Getting an anchor

Now get an anchor for the prediction of the first observation (row 0) of the test set. An anchor is a sufficient condition - i.e, when the anchor holds, the prediction should be the same as the prediction for this instance.

In [16]:
idx = 0
class_names = adult.target_names
print('Prediction: ', class_names[explainer.predict_fn(X_test[idx].reshape(1, -1))[0]])

Prediction:  <=50K


Set the precision threshold to 0.95, so that predictions on observations where the anchor holds will be the same as the prediction of the explained instance at least 95% of the time.

In [17]:
explanation = explainer.explain(X_test[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('Coverage: %.2f' % explanation['coverage'])

Anchor: Marital Status = Separated AND Sex = Female
Precision: 0.96
Coverage: 0.11


In [18]:
idx = 71
class_names = adult.target_names
print('Prediction: ', class_names[explainer.predict_fn(X_test[idx].reshape(1, -1))[0]])

explanation = explainer.explain(X_test[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('Coverage: %.2f' % explanation['coverage'])

Prediction:  <=50K
Anchor: Relationship = Own-child AND Marital Status = Never-Married
Precision: 0.99
Coverage: 0.14


In [19]:
idx = 241
class_names = adult.target_names
print('Prediction: ', class_names[explainer.predict_fn(X_test[idx].reshape(1, -1))[0]])

explanation = explainer.explain(X_test[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('Coverage: %.2f' % explanation['coverage'])

Prediction:  <=50K
Anchor: Relationship = Own-child AND Age <= 28.00
Precision: 0.99
Coverage: 0.12


In [21]:
idx = 299
class_names = adult.target_names
print('Prediction: ', class_names[explainer.predict_fn(X_test[idx].reshape(1, -1))[0]])

explanation = explainer.explain(X_test[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('Coverage: %.2f' % explanation['coverage'])

Prediction:  <=50K
Anchor: Occupation = Blue-Collar AND Education = High School grad AND Capital Gain <= 0.00 AND Capital Loss <= 0.00 AND Country = United-States AND Race = White AND Workclass = Self-emp-not-inc
Precision: 0.96
Coverage: 0.01


In [22]:
idx = 59
class_names = adult.target_names
print('Prediction: ', class_names[explainer.predict_fn(X_test[idx].reshape(1, -1))[0]])

explanation = explainer.explain(X_test[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('Coverage: %.2f' % explanation['coverage'])

Prediction:  <=50K


Could not find an anchor satisfying the 0.95 precision constraint. Now returning the best non-eligible anchor.


Anchor: Occupation = Blue-Collar AND Age <= 37.00 AND Capital Gain <= 0.00 AND Capital Loss <= 0.00
Precision: 0.95
Coverage: 0.15


In [23]:
idx = 26
class_names = adult.target_names
print('Prediction: ', class_names[explainer.predict_fn(X_test[idx].reshape(1, -1))[0]])

explanation = explainer.explain(X_test[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('Coverage: %.2f' % explanation['coverage'])

Prediction:  <=50K
Anchor: Marital Status = Separated AND Education = High School grad
Precision: 0.98
Coverage: 0.11


### ...or not?

Now get an anchor for a different observation in the test set - one for the which the prediction is `>50K`.

In [24]:
idx = 6
class_names = adult.target_names
print('Prediction: ', class_names[explainer.predict_fn(X_test[idx].reshape(1, -1))[0]])

explanation = explainer.explain(X_test[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('Coverage: %.2f' % explanation['coverage'])

Prediction:  >50K


Could not find an anchor satisfying the 0.95 precision constraint. Now returning the best non-eligible anchor.


Anchor: Capital Loss > 0.00 AND Marital Status = Married AND Age > 37.00 AND Relationship = Husband AND Country = United-States AND Sex = Male AND Race = White
Precision: 0.66
Coverage: 0.02


Notice how no anchor is found!

This is due to the imbalanced dataset (roughly 25:75 high:low earner proportion), so during the sampling stage feature ranges corresponding to low-earners will be oversampled. This is a feature because it can point out an imbalanced dataset, but it can also be fixed by producing balanced datasets to enable anchors to be found for either class.