# Simple Example Using LIME.

LIME = (L)ocal (I)nterpretable (M)odel-agnostic (E)xplanations)

## Overview

- About example program.
 - Visualize the effectiveness of each explanatory variable related to the survival probability of Titanic by using LIME.
 - Created for the purpose of understanding LIME physically.
 - Program created as binary classification problem of logistic regression. 
 - Selecting features and handling-missing-values are decided on my own.
 - Use only Train data.
- Hot to install LIME.
 - pip install lime
- Future version upgrade plans.
 - Inplement considering Multiple collinearity.
    - However, "Simple and High Quality" is mandatory.
    - Anyway, I had tried using get_dummies / LabelBinarizer. but implemented spaghetti code, so I gave up.
- Reference URL.
 - https://lime-ml.readthedocs.io/en/latest/index.html
 - https://github.com/marcotcr/lime
 - https://towardsdatascience.com/decrypting-your-machine-learning-model-using-lime-5adc035109b5

Japanese Translation with jargon style. (Please use at your own risk when making reference. :))
- サンプルプログラムについて
 - タイタニックの生存確率に関係する各説明変数の効き具合を、LIMEを使って行単位でビジュアル化。
 - LIMEの動きを理屈でなく体感的に感じとることを目的に作成。
 - モデルはロジスティック回帰の二値分類問題。
 - 特徴量の選択とか、欠損値の扱いは適当。
 - Trainデータのみ使用。
- LIMEのインストール方法
 - pip install lime
- 今後のバージョンアップ予定
 - 多重共線（マルチコ）を考慮した実装にする。
    - 但し、ごり押しではなく「綺麗に」実装する。
    - 一応、get_dummies や LabelBinarizer で試みたが、ごり押しになったのでやめた。
- 参考サイト
 - https://lime-ml.readthedocs.io/en/latest/index.html
 - https://github.com/marcotcr/lime
 - https://towardsdatascience.com/decrypting-your-machine-learning-model-using-lime-5adc035109b5

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# LIME
import lime.lime_tabular
from lime.explanation import Explanation

import warnings
warnings.filterwarnings('ignore')

## Preprocessing

In [None]:
# Load Titanic DataSet
df = pd.read_csv('../input/train.csv')
df.isnull().sum()  # check missing data.

In [None]:
# Remove Useless Attributes
df.drop(['PassengerId', 'Name', 'Pclass', 'SibSp',
         'Parch', 'Ticket', 'Cabin'], axis=1, inplace=True)

# Handle Missing Data
age_train_mean = df.groupby('Sex').Age.mean()
df.loc[df['Age'].isnull() & (df['Sex'] == 'male'),
       'Age'] = age_train_mean['male']
df.loc[df['Age'].isnull() & (df['Sex'] == 'female'),
       'Age'] = age_train_mean['female']

df.dropna(subset=['Embarked'], axis=0, inplace=True)

print(df.isnull().sum())

### Visualizing pre-encoded dataframe

In [None]:
df.head()

In [None]:
sns.catplot(data=df, kind='violin', hue='Survived',
            x='Embarked', y='Age', col='Sex')

In [None]:
X_train = df.drop(['Survived'],  axis=1, inplace=False)
y_train = df.Survived

## Encoding

- Note
 - LimeTabularExplainer handles Label Encoded Data. So, it's  difficult to select get_dummies or LabelBinalizer. 

In [None]:
X_train_lbenc = X_train.copy()
cats = ['Sex','Embarked'] # not yet specified label encoded attributes.

cat_dic = {}  # also be used at LimeTabularExplainer's parameter.
cat_list = [] # also be used at OneHotEncoder, LimeTabularExplainer's parameter.

le = LabelEncoder()
for s in cats:
    i = X_train_lbenc.columns.get_loc(s)
    X_train_lbenc.loc[:,s] = le.fit_transform(X_train_lbenc[s])
    cat_dic[i] = le.classes_
    cat_list.append(i)

X_train_lbenc.head()

In [None]:
print(cat_list, '\n',  cat_dic) # check

In [None]:
# Non-categorical features are always stacked to the right of the matrix.
oe = OneHotEncoder(sparse=False, categorical_features=cat_list)
oe_fit = oe.fit(X_train_lbenc)
X_train_ohenc = oe_fit.transform(X_train_lbenc)
X_train_ohenc[:5, :]  # show 5 samples.

### Create Model

In [None]:
parameters = {
    'C': np.logspace(-5, 5, 10),
    'random_state': [0]
}

gs = GridSearchCV(
    LogisticRegression(),
    parameters,
    cv=5
)
gs.fit(X_train_ohenc, y_train)

print(gs.best_score_)
print(gs.best_params_)

model = LogisticRegression(**gs.best_params_)
model.fit(X_train_ohenc, y_train)

### Visualizing explanations for each data

In [None]:
explainer = lime.lime_tabular.LimeTabularExplainer(X_train_lbenc.values,  # Label Encoded Numpy Format
                                                   feature_names = X_train_lbenc.columns,
                                                   class_names = [
                                                       'dead', 'survive' ], # 0,1,...
                                                   categorical_features = cat_list,
                                                   categorical_names = cat_dic,
                                                   mode = 'classification'
                                                   )

In [None]:
def pred_fn(x):
    return model.predict_proba(oe_fit.transform(x)).astype(float)

In [None]:
exp = explainer.explain_instance(X_train_lbenc.values[2, :],
                                 pred_fn,
                                 num_features=len(X_train_lbenc.columns)
                                 )
exp.show_in_notebook(show_all=False)