<a id = 'top-of-notebook'></a>
# Data Science Quick Tip #006: Explaining AI with SHAP and LIME!
In this notebook, we'll be using the Titanic dataset once again to explain artificial intelligence through SHAP and LIME.

## Table of Contents
- [Project Setup](#project-setup)
- [Data Cleansing](#data-cleansing)
- [Data Modelling](#data-modelling)
- [Model Validation](#model-validation)
- [Explainability with SHAP](#SHAP)
- [Explainability with LIME](#LIME)

<a id = 'project-setup'></a>
## Project Setup

In [46]:
# Importing necessary libraries
import pandas as pd

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Importing the training data
df_train = pd.read_csv('../data/titanic/train.csv')

In [3]:
# Viewing the first few rows of the data
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


[*Return to top of notebook*](#top-of-notebook)

<a id = 'data-cleansing'></a>
## Data Cleansing

#### Dropping columns of no value

In [4]:
df_train.drop(columns = ['PassengerId', 'Name', 'Ticket'], inplace = True)

#### Creating new "age_bins" feature from current "Age" feature

In [5]:
# Defining / instantiating the necessary variables
age_bins = [-3, -1, 12, 18, 25, 50, 100]
age_labels = ['unknown', 'child', 'teen', 'young_adult', 'adult', 'elder']

In [6]:
# Filling null age values
df_train['Age'] = df_train['Age'].fillna(-2)

In [7]:
# Binning the ages appropriate as defined via the variables above
df_train['Age_Bins'] = pd.cut(df_train['Age'], bins = age_bins, labels = age_labels)

In [8]:
# Dropping the now unneeded 'Age' feature
df_train.drop(columns = 'Age', inplace = True)

#### Creating a new "Section" feature using the current "Cabin" feature

In [9]:
# Grabbing the first character from the cabin section
df_train['Section'] = df_train['Cabin'].str[:1]

In [10]:
# Filling out the nulls
df_train['Section'].fillna('No Cabin', inplace = True)

In [11]:
# Dropping former 'Cabin' feature
df_train.drop(columns = 'Cabin', inplace = True)

#### Filling nulls in the "Embarked" feature

In [12]:
df_train['Embarked'] = df_train['Embarked'].fillna('Unknown')

#### One hot encoding the categorical variables

In [13]:
# Defining the categorical features
cat_feats = ['Pclass', 'Sex', 'Age_Bins', 'Embarked', 'Section']

In [14]:
# Instantiating OneHotEncoder
ohe = OneHotEncoder(categories = 'auto')

In [15]:
# Fitting the categorical variables to the encoder
cat_feats_encoded = ohe.fit_transform(df_train[cat_feats])

In [16]:
# Creating a DataFrame with the encoded value information
cat_df = pd.DataFrame(data = cat_feats_encoded.toarray(), columns = ohe.get_feature_names(cat_feats))

#### Scaling numerical columns

In [17]:
# Defining the numerical features
num_feats = ['SibSp', 'Parch', 'Fare']

In [18]:
# Instantiating the StandardScaler object
scaler = StandardScaler()

In [19]:
# Fitting the data to the scaler
num_feats_scaled = scaler.fit_transform(df_train[num_feats])

In [20]:
# Creating DataFrame with numerically scaled data
num_df = pd.DataFrame(data = num_feats_scaled, columns = num_feats)

#### Concatenating the encoded / scaled dataframes together to form "X" training data

In [21]:
X = pd.concat([cat_df, num_df], axis = 1)

#### Extracting target variable ("Survived") from df_train as "y"

In [22]:
y = df_train[['Survived']]

#### Peforming training / validation split on the data

In [23]:
# Performing train-test (validation) split on the data
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state = 42)

[*Return to top of notebook*](#top-of-notebook)

<a id = 'data-modelling'></a>
## Data Modelling

#### Performing GridSearch to find optimal parameters

In [33]:
# Defining the random forest parameters
params = {'n_estimators': [10, 50, 100],
          'min_samples_split': [2, 5, 10],
          'min_samples_leaf': [1, 2, 5],
          'max_depth': [10, 20, 50]
         }

In [34]:
# Instantiating RandomForestClassifier object and GridSearch object
rfc = RandomForestClassifier()
clf = GridSearchCV(estimator = rfc,
                   param_grid = params)

In [35]:
# Fitting the training data to the GridSearch object
clf.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              ra

In [36]:
# Displaying the best parameters from the GridSearch object
clf.best_params_

{'max_depth': 10,
 'min_samples_leaf': 2,
 'min_samples_split': 2,
 'n_estimators': 10}

#### Training the model with the ideal params

In [41]:
# Instantiating RandomForestClassifier with ideal params
rfc = RandomForestClassifier(max_depth = 10,
                             min_samples_leaf = 2,
                             min_samples_split = 2,
                             n_estimators = 10)

In [42]:
# Fitting the training data to the model
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=10, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

[*Return to top of notebook*](#top-of-notebook)

<a id = 'model-validation'></a>
## Model Validation

In [43]:
# Checking out our predicted results using the validation dataset
pipeline_preds = rfc.predict(X_val)

val_accuracy = accuracy_score(y_val, pipeline_preds)
val_roc_auc = roc_auc_score(y_val, pipeline_preds)
val_confusion_matrix = confusion_matrix(y_val, pipeline_preds)

print(f'Accuracy Score: {val_accuracy}')
print(f'ROC AUC Score: {val_roc_auc}')
print(f'Confusion Matrix: \n{val_confusion_matrix}')

Accuracy Score: 0.820627802690583
ROC AUC Score: 0.8073536810330371
Confusion Matrix: 
[[117  17]
 [ 23  66]]


[*Return to top of notebook*](#top-of-notebook)

<a id = 'SHAP'></a>
## Explainability with SHAP

In [None]:
# Selecting two people that respectively Survived and did not survive from the validation set


[*Return to top of notebook*](#top-of-notebook)

## Explainability with LIME

#### Instantiating our LIME tabular explainer

In [49]:
# Defining our LIME explainer
import lime.lime_tabular
lime_explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values,
                                                        mode = 'classification',
                                                        feature_names = X_train.columns,
                                                        class_names = ['Did Not Survive', 'Survived']
                                                        )

In [50]:
# Defining a quick function that can be used to explain the instance passed
predict_rfc_prob = lambda x: rfc.predict_proba(x).astype(float)

In [53]:
X_val.head()

Unnamed: 0,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Age_Bins_adult,Age_Bins_child,Age_Bins_elder,Age_Bins_teen,Age_Bins_unknown,...,Section_C,Section_D,Section_E,Section_F,Section_G,Section_No Cabin,Section_T,SibSp,Parch,Fare
709,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.432793,0.76763,-0.341452
439,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-0.474545,-0.473674,-0.437007
840,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-0.474545,-0.473674,-0.488854
720,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-0.474545,0.76763,0.016023
39,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.432793,-0.473674,-0.422074


In [51]:
first_instance = explainer.explain_instance(X_train.iloc[0].values,
                                            predict_rfc_prob,
                                            num_features = 10)

[*Return to top of notebook*](#top-of-notebook)