# LIME with XGBoost

In this notebook, we will again use the Titanic dataset, but this time we will use the LIME package to explain the predictions of an XGBoost model. 

In [None]:
# Install the necessary libraries

# !pip install -q dalex xgboost lime

In [None]:
import dalex as dx
import xgboost
import lime

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

### Load and Preprocess Data

Annoyingly, both LIME and XGBoost are very particular about how we provide them with categorical entries. We will fall back on just using one-hot encoding, which is not as fancy but is assured to work.

In [None]:
df = dx.datasets.load_titanic()

X = df.drop(columns='survived')
X = pd.get_dummies(X, columns=['gender', 'class', 'embarked'], drop_first=True)
y = df.survived

In [None]:
X.head()

In [None]:
# Split the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Train the Model

As before, we will train an XGBoost model on the training data.

In [None]:
model = xgboost.XGBClassifier(
    n_estimators=200,
    max_depth=4,
    use_label_encoder=False,
    eval_metric="logloss"
)
model.fit(X_train, y_train)

### Explain the Model with LIME & dalex

dalex uses the original lime package to estimate LIME under a unified API.

dalex aims to improve the user's convenience by providing a simplified API compared to the actual LIME package. We will create an `explainer` object just like before, but this time we will use the `predict_surrogate` method to explain the model's predictions.


In [None]:
explainer = dx.Explainer(model, X_train, y_train, label='XGBoost')

Our model outputs a continuous value between zero and one, but what we actually want is a hard 0 or 1 prediction. As you might imagine the simplest way to do this is to use a cutoff of 0.5, but we can also use the mean value of the target variable. This can help to compensate for class imbalance, which we certainly have here.

In [None]:
explainer.model_performance(cutoff=y.mean())

In [None]:
observation = X.iloc[[0]]
explainer.predict(observation)

Just like in the first mini-lab, we need to specify how the model can get predictions from the data. This time, we will use the `predict_proba` method, which returns the probability of each class. We will also cast the output to a float, as LIME expects this.

In [None]:
predict_fn = lambda x: model.predict_proba(x).astype(float)
explanation = explainer.predict_surrogate(observation, predict_fn=predict_fn)

In [None]:
explanation.result

In [None]:
explanation.plot()

Be careful! LIME algorithm, like many other explanations, involves randomness. A different random seed will result in a different explanation. Take a look below and see if the differences seem significant to you.

In [None]:
import random
import matplotlib.pyplot as plt

for seed in range(4):
    random.seed(seed)
    np.random.seed(seed)
    exp = explainer.predict_surrogate(observation, predict_fn=predict_fn)
    exp.plot(return_figure=True)
    plt.title(f'Seed of {seed}')

### Explain with LIME package

We can also directly use the LIME package, which produces slightly different visualizations. Note that the underlying behaviour is the same, but the API is different.

In [None]:
lime_explainer = lime.lime_tabular.LimeTabularExplainer(
    training_data=X_train.values,
    feature_names=X_train.columns,
    mode='classification',
)

In [None]:
lime_explanation = lime_explainer.explain_instance(
    data_row=observation.iloc[0],
    predict_fn=lambda d: model.predict_proba(d)
)   

In [None]:
lime_explanation.as_list()

In [None]:
lime_explanation.as_pyplot_figure()

In [None]:
lime_explanation.show_in_notebook()