Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/how-to-use-azureml/explain-model/tabular-data/simple-feature-transformations-explain-local.png)

# Explain binary classification model predictions with raw feature transformations
_**This notebook showcases how to use the Azure Machine Learning Interpretability SDK to explain and visualize a binary classification model that uses one to one and one to many feature transformations.**_


## Table of Contents

1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Run model explainer locally at training time](#Explain)
    1. Apply feature transformations
    1. Train a binary classification model
    1. Explain the model on raw features
        1. Generate global explanations
        1. Generate local explanations
1. [Visualize results](#Visualize)
1. [Next steps](#Next%20steps)

## Introduction

This notebook illustrates creating explanations for a binary classification model, IBM employee attrition classification, that uses one to one and one to many feature transformations from raw data to engineered features. The one to many feature transformations include one hot encoding on categorical features. The one to one feature transformations apply standard scaling on numeric features. Our tabular data explainer is then used to get raw feature importances.


We will showcase raw feature transformations with three tabular data explainers: TabularExplainer (SHAP), MimicExplainer (global surrogate), and PFIExplainer.

| ![Interpretability Toolkit Architecture](./img/interpretability-architecture.PNG) |
|:--:|
| *Interpretability Toolkit Architecture* |

Problem: IBM employee attrition classification with scikit-learn (run model explainer locally)

1. Transform raw features to engineered features
2. Train a SVC classification model using Scikit-learn
3. Run 'explain_model' globally and locally with full dataset in local mode, which doesn't contact any Azure services.
4. Visualize the global and local explanations with the visualization dashboard.
---

Setup: If you are using Jupyter notebooks, the extensions should be installed automatically with the package.
If you are using Jupyter Labs run the following command:
```
(myenv) $ jupyter labextension install @jupyter-widgets/jupyterlab-manager
```


## Explain

### Run model explainer locally at training time

In [31]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.svm import SVC
import pandas as pd
import numpy as np
from azureml.core import Workspace, Dataset

# Explainers:
# 1. SHAP Tabular Explainer
from azureml.explain.model.tabular_explainer import TabularExplainer

# OR

# 2. Mimic Explainer
from azureml.explain.model.mimic.mimic_explainer import MimicExplainer
# You can use one of the following four interpretable models as a global surrogate to the black box model
from azureml.explain.model.mimic.models.lightgbm_model import LGBMExplainableModel
from azureml.explain.model.mimic.models.linear_model import LinearExplainableModel
from azureml.explain.model.mimic.models.linear_model import SGDExplainableModel
from azureml.explain.model.mimic.models.tree_model import DecisionTreeExplainableModel

# OR

# 3. PFI Explainer
from azureml.explain.model.permutation.permutation_importance import PFIExplainer 

### Load the IBM employee attrition data created before

In [29]:
# get the IBM employee attrition dataset from the workspace
ws = Workspace.from_config()
attritionData = ws.datasets['IBM-Employee-Attrition'].to_pandas_dataframe()
attritionData.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,1,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,0,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,1,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,0,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,0,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [37]:
# or, if you did not create the dataset in the first part of the workshop, you can just create it now
web_path ='https://raw.githubusercontent.com/danielsc/azureml-workshop-2019/master/data/IBM-Employee-Attrition.csv'
attritionData = Dataset.Tabular.from_delimited_files(path=web_path).to_pandas_dataframe()
attritionData.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,1,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,0,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,1,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,0,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,0,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [38]:
# Dropping Employee count as all values are 1 and hence attrition is independent of this feature
attritionData = attritionData.drop(['EmployeeCount'], axis=1)
# Dropping Employee Number since it is merely an identifier
attritionData = attritionData.drop(['EmployeeNumber'], axis=1)

attritionData = attritionData.drop(['Over18'], axis=1)

# Since all values are 80
attritionData = attritionData.drop(['StandardHours'], axis=1)
target = attritionData["Attrition"]

attritionXData = attritionData.drop(['Attrition'], axis=1)

In [39]:
# Split data into train and test
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(attritionXData, 
                                                    target, 
                                                    test_size = 0.2,
                                                    random_state=0,
                                                    stratify=target)

In [41]:
# collect the categorical and numerical column names in separate lists
categorical = []
for col, value in attritionXData.iteritems():
    if value.dtype == 'object':
        categorical.append(col)
        
numerical = attritionXData.columns.difference(categorical)        

### Transform raw features

We can explain raw features by either using a `sklearn.compose.ColumnTransformer` or a list of fitted transformer tuples. The cell below uses `sklearn.compose.ColumnTransformer`. In case you want to run the example with the list of fitted transformer tuples, comment the cell below and uncomment the cell that follows after. 

In [42]:
from sklearn.compose import ColumnTransformer

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

transformations = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical),
        ('cat', categorical_transformer, categorical)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', transformations),
                      ('classifier', SVC(kernel='linear', C = 1.0, probability=True))])

In [43]:
'''
# Uncomment below if sklearn-pandas is not installed
#!pip install sklearn-pandas
from sklearn_pandas import DataFrameMapper

# Impute, standardize the numeric features and one-hot encode the categorical features.    


numeric_transformations = [([f], Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])) for f in numerical]

categorical_transformations = [([f], OneHotEncoder(handle_unknown='ignore', sparse=False)) for f in categorical]

transformations = numeric_transformations + categorical_transformations

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', transformations),
                      ('classifier', SVC(kernel='linear', C = 1.0, probability=True))]) 



'''

"\n# Uncomment below if sklearn-pandas is not installed\n#!pip install sklearn-pandas\nfrom sklearn_pandas import DataFrameMapper\n\n# Impute, standardize the numeric features and one-hot encode the categorical features.    \n\n\nnumeric_transformations = [([f], Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])) for f in numerical]\n\ncategorical_transformations = [([f], OneHotEncoder(handle_unknown='ignore', sparse=False)) for f in categorical]\n\ntransformations = numeric_transformations + categorical_transformations\n\n# Append classifier to preprocessing pipeline.\n# Now we have a full prediction pipeline.\nclf = Pipeline(steps=[('preprocessor', transformations),\n                      ('classifier', SVC(kernel='linear', C = 1.0, probability=True))]) \n\n\n\n"

### Train a SVM classification model, which you want to explain

In [44]:
model = clf.fit(x_train, y_train)

### Explain predictions on your local machine

In [45]:
# 1. Using SHAP TabularExplainer
# clf.steps[-1][1] returns the trained classification model
explainer = TabularExplainer(clf.steps[-1][1], 
                             initialization_examples=x_train, 
                             features=attritionXData.columns, 
                             classes=["Not leaving", "leaving"], 
                             transformations=transformations)




# 2. Using MimicExplainer
# augment_data is optional and if true, oversamples the initialization examples to improve surrogate model accuracy to fit original model.  Useful for high-dimensional data where the number of rows is less than the number of columns. 
# max_num_of_augmentations is optional and defines max number of times we can increase the input data size.
# LGBMExplainableModel can be replaced with LinearExplainableModel, SGDExplainableModel, or DecisionTreeExplainableModel
# explainer = MimicExplainer(clf.steps[-1][1], 
#                            x_train, 
#                            LGBMExplainableModel, 
#                            augment_data=True, 
#                            max_num_of_augmentations=10, 
#                            features=attritionXData.columns, 
#                            classes=["Not leaving", "leaving"], 
#                            transformations=transformations)





# 3. Using PFIExplainer

# Use the parameter "metric" to pass a metric name or function to evaluate the permutation. 
# Note that if a metric function is provided a higher value must be better.
# Otherwise, take the negative of the function or set the parameter "is_error_metric" to True.
# Default metrics: 
# F1 Score for binary classification, F1 Score with micro average for multiclass classification and
# Mean absolute error for regression

# explainer = PFIExplainer(clf.steps[-1][1], 
#                          features=x_train.columns, 
#                          transformations=transformations,
#                          classes=["Not leaving", "leaving"])



### Generate global explanations
Explain overall model predictions (global explanation)

In [46]:
# Passing in test dataset for evaluation examples - note it must be a representative sample of the original data
# x_train can be passed as well, but with more examples explanations will take longer although they may be more accurate
global_explanation = explainer.explain_global(x_test)

# Note: if you used the PFIExplainer in the previous step, use the next line of code instead
# global_explanation = explainer.explain_global(x_test, true_labels=y_test)

In [47]:
# Sorted SHAP values
print('ranked global importance values: {}'.format(global_explanation.get_ranked_global_values()))
# Corresponding feature names
print('ranked global importance names: {}'.format(global_explanation.get_ranked_global_names()))
# Feature ranks (based on original order of features)
print('global importance rank: {}'.format(global_explanation.global_importance_rank))

# Note: PFIExplainer does not support per class explanations
# Per class feature names
print('ranked per class feature names: {}'.format(global_explanation.get_ranked_per_class_names()))
# Per class feature importance values
print('ranked per class feature values: {}'.format(global_explanation.get_ranked_per_class_values()))

ranked global importance values: [0.5134284805406166, 0.30030967354296456, 0.2998085905496084, 0.25167174562600797, 0.251665144329456, 0.24842790515897314, 0.23091624707167333, 0.2282622658043289, 0.20708691720443917, 0.20546454593806174, 0.20168882722987594, 0.19185389234256517, 0.18228916781256482, 0.17516570245223695, 0.16657039369934307, 0.16480897662304778, 0.1552006121906684, 0.13637139403401735, 0.12678964171011925, 0.12257630403924014, 0.04878692188912716, 0.035389982269014755, 0.032518503366939094, 0.029916525079299133, 0.029192633082262415, 0.018732191529540244, 0.017129450861988776, 0.013402471316552865, 0.009678550607134809, 0.004254087046426229]
ranked global importance names: ['OverTime', 'TotalWorkingYears', 'NumCompaniesWorked', 'MaritalStatus', 'JobSatisfaction', 'JobRole', 'YearsWithCurrManager', 'YearsSinceLastPromotion', 'BusinessTravel', 'EnvironmentSatisfaction', 'DistanceFromHome', 'RelationshipSatisfaction', 'YearsInCurrentRole', 'JobInvolvement', 'TrainingTimes

In [48]:
# Print out a dictionary that holds the sorted feature importance names and values
# print('global importance rank: {}'.format(global_explanation.get_feature_importance_dict()))

### Explain overall model predictions as a collection of local (instance-level) explanations

In [49]:
# feature shap values for all features and all data points in the training data
#print('local importance values: {}'.format(global_explanation.local_importance_values))

### Generate local explanations
Explain local data points (individual instances)

In [50]:
# Note: PFIExplainer does not support local explanations
# You can pass a specific data point or a group of data points to the explain_local function

# E.g., Explain the first data point in the test set
instance_num = 1
local_explanation = explainer.explain_local(x_test[:instance_num])

In [51]:
# Get the prediction for the first member of the test set and explain why model made that prediction
prediction_value = clf.predict(x_test)[instance_num]

sorted_local_importance_values = local_explanation.get_ranked_local_values()[prediction_value]
sorted_local_importance_names = local_explanation.get_ranked_local_names()[prediction_value]

print('local importance values: {}'.format(sorted_local_importance_values))
print('local importance names: {}'.format(sorted_local_importance_names))

local importance values: [[0.6233179710596306, 0.3728301775065386, 0.2889473339334777, 0.25712756485848065, 0.24148793764321672, 0.23408442449743197, 0.21590096835523745, 0.18557871316436628, 0.1658628112964533, 0.0802450933295773, 0.07861270726169912, 0.07512567866968925, 0.06692783763325688, 0.056363549710890186, 0.038747904191682066, 0.031599721441321144, 0.029335744203670838, 0.010189464347524898, 0.009649547044985423, -0.010341395331854016, -0.016834903675227244, -0.019498857839416586, -0.02125908444390076, -0.025057076003163807, -0.12678408691466947, -0.23017772957470045, -0.25423978357592675, -0.3220474134993808, -0.34499991862657625, -0.39177022068993317]]
local importance names: [['TotalWorkingYears', 'OverTime', 'WorkLifeBalance', 'Age', 'EducationField', 'YearsSinceLastPromotion', 'MaritalStatus', 'YearsAtCompany', 'DistanceFromHome', 'JobSatisfaction', 'NumCompaniesWorked', 'BusinessTravel', 'MonthlyIncome', 'RelationshipSatisfaction', 'StockOptionLevel', 'Department', 'Dai

## Visualize
Load the visualization dashboard

In [52]:
from azureml.contrib.explain.model.visualize import ExplanationDashboard

In [53]:
ExplanationDashboard(global_explanation, model, x_test)

A Jupyter Widget

<azureml.contrib.explain.model.visualize.ExplanationDashboard.ExplanationDashboard at 0x7efe48073f60>

## Next
Learn about other use cases of the explain package on a:
       
1. [Training time: regression problem](./explain-regression-local.ipynb)
1. [Training time: binary classification problem](./explain-binary-classification-local.ipynb)
1. [Training time: multiclass classification problem](./explain-multiclass-classification-local.ipynb)
1. [Explain models with advanced feature transformations](./advanced-feature-transformations-explain-local.ipynb)
1. [Save model explanations via Azure Machine Learning Run History](../azure-integration/run-history/save-retrieve-explanations-run-history.ipynb)
1. [Run explainers remotely on Azure Machine Learning Compute (AMLCompute)](../azure-integration/remote-explanation/explain-model-on-amlcompute.ipynb)
1. Inferencing time: deploy a classification model and explainer:
    1. [Deploy a locally-trained model and explainer](../azure-integration/scoring-time/train-explain-model-locally-and-deploy.ipynb)
    1. [Deploy a remotely-trained model and explainer](../azure-integration/scoring-time/train-explain-model-on-amlcompute-and-deploy.ipynb)