Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Explain binary classification model predictions with raw feature transformations
_**This notebook showcases how to use the Azure Machine Learning Interpretability SDK to explain and visualize a binary classification model.**_


## Table of Contents

1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Run model explainer locally at training time](#Explain)
    1. Apply feature transformations
    1. Train an EBM model
    1. Explain the model on raw features
        1. Generate global explanations
        1. Generate local explanations
1. [Visualize results](#Visualize)

## Introduction

This notebook illustrates running an algorithm called the Explainable Boosting Machine (EBM) which has both high accuracy and intelligibility. EBM uses modern machine learning techniques like bagging and boosting to breathe new life into traditional GAMs (Generalized Additive Models). This makes them as accurate as random forests and gradient boosted trees, and also enhances their intelligibility and editability.




Problem: IBM employee attrition classification with scikit-learn (run model explainer locally)

1. Train an EBM model
3. Run 'explain_model' globally and locally with full dataset in local mode, which doesn't contact any Azure services.
4. Visualize the global and local explanations with the visualization dashboard.



## Explain

### Run model explainer locally at training time

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.svm import SVC
import pandas as pd
import numpy as np

# Explainers:
# 1. SHAP Tabular Explainer
from interpret.glassbox import ExplainableBoostingClassifier

### Load the IBM employee attrition data

In [5]:
# get the IBM employee attrition dataset
outdirname = 'dataset.6.21.19'
try:
    from urllib import urlretrieve
except ImportError:
    from urllib.request import urlretrieve
import zipfile
zipfilename = outdirname + '.zip'
urlretrieve('https://publictestdatasets.blob.core.windows.net/data/' + zipfilename, zipfilename)
with zipfile.ZipFile(zipfilename, 'r') as unzip:
    unzip.extractall('.')
attritionData = pd.read_csv('./WA_Fn-UseC_-HR-Employee-Attrition.csv')




# Dropping Employee count as all values are 1 and hence attrition is independent of this feature
attritionData = attritionData.drop(['EmployeeCount'], axis=1)
# Dropping Employee Number since it is merely an identifier
attritionData = attritionData.drop(['EmployeeNumber'], axis=1)

attritionData = attritionData.drop(['Over18'], axis=1)

# Since all values are 80
attritionData = attritionData.drop(['StandardHours'], axis=1)

# Converting target variables from string to numerical values
target_map = {'Yes': 1, 'No': 0}
attritionData["Attrition_numerical"] = attritionData["Attrition"].apply(lambda x: target_map[x])
target = attritionData["Attrition_numerical"]

attritionXData = attritionData.drop(['Attrition_numerical', 'Attrition'], axis=1)

In [6]:
# Split data into train and test
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(attritionXData, 
                                                    target, 
                                                    test_size = 0.2,
                                                    random_state=0,
                                                    stratify=target)

In [9]:
attritionXData.columns

Index(['Age', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome',
       'Education', 'EducationField', 'EnvironmentSatisfaction', 'Gender',
       'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole',
       'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate',
       'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

### Train an EBM model

In [11]:
from interpret.glassbox import ExplainableBoostingClassifier

ebm = ExplainableBoostingClassifier(feature_names=attritionXData.columns)
ebm.fit(x_train, y_train)

ExplainableBoostingClassifier(binning_strategy='uniform',
               data_n_episodes=2000, early_stopping_run_length=50,
               early_stopping_tolerance=1e-05,
               feature_names=['Age', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome', 'Education', 'EducationField', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', ...Balance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager'],
               feature_step_n_inner_bags=0,
               feature_types=['continuous', 'categorical', 'continuous', 'categorical', 'continuous', 'continuous', 'categorical', 'continuous', 'categorical', 'continuous', 'continuous', 'continuous', 'categorical', 'continuous', 'categorical', 'continuous', 'continuous', 'continuous', 'categorical', 'continuous', 'categorical', 'continuous', 'continuous', 'con

### Generate global explanations
Explain overall model predictions (global explanation)

In [12]:
from interpret import show

ebm_global = ebm.explain_global()
show(ebm_global)

### Generate local explanations
Explain local data points (individual instances)

In [13]:
ebm_local = ebm.explain_local(x_test, y_test)
show(ebm_local)