# Comparison with Partial Dependence Plot

In the "Interpretable Machine Learning" book, we [can read](https://christophm.github.io/interpretable-ml-book/pdp.html):

> The partial dependence plot (short PDP or PD plot) shows the marginal effect one or two features have on the predicted outcome of a machine learning model (Friedman, Jerome H. “Greedy function approximation: A gradient boosting machine.” Annals of statistics (2001): 1189-1232.). A partial dependence plot can show whether the relationship between the target and a feature is linear, monotonic or more complex. For example, when applied to a linear regression model, partial dependence plots always show a linear relationship.

Put differently further in the book:

> **Partial Dependence Plots:** “Let me show you what the model predicts on average when each data instance has the value v for that feature. I ignore whether the value v makes sense for all data instances.”

Computing a PDP is really straightforward:

1. Select a feature (e.g. "age")
2. Define a grid on the feature's domain (e.g. 20, 21, 22, ..., 59, 60)
3. For each value `v` of the grid:
    1. Replace the feature with `v` for all data samples
    2. Compute the predictions
    3. Take the average
4. Draw the curve `average_prediction = f(v)`

PDPs are used in Google's [What-If Tool](https://pair-code.github.io/what-if-tool/walkthrough.html). In this notebook, we compare this method with ours, Entropic Variable Boosting (EVB), on the "Adult" dataset (see the dedicated notebook for additional information).

In [41]:
import ethik
import lightgbm as lgb
import pandas as pd
import plotly.graph_objs as go
from sklearn import model_selection
import sklearn.inspection

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
names = [
    'age', 'workclass', 'fnlwgt', 'education',
    'education-num', 'marital-status', 'occupation',
    'relationship', 'race', 'gender', 'capital-gain',
    'capital-loss', 'hours-per-week', 'native-country',
    'salary'
]
dtypes = {
    'workclass': 'category',
    'education': 'category',
    'marital-status': 'category',
    'occupation': 'category',
    'relationship': 'category',
    'race': 'category',
    'gender': 'category',
    'native-country': 'category'
}

X = pd.read_csv(url, names=names, header=None, dtype=dtypes)
y = X.pop('salary').map({' <=50K': False, ' >50K': True})

# plot_partial_dependence() doesn't handle strings
cat_columns = X.select_dtypes(['category']).columns
X[cat_columns] = X[cat_columns].apply(lambda x: x.cat.codes)

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, shuffle=True, random_state=42)

model = lgb.LGBMClassifier(random_state=42).fit(X_train, y_train)
y_pred = pd.Series(model.predict_proba(X_test)[:, 1], name='>$50k')

explainer = ethik.Explainer()

Let's define helpers to compare PDP and EVB:

In [50]:
def create_fig():
    fig = go.Figure()
    fig.update_layout(
        margin=dict(t=50, r=50),
        xaxis=dict(title=feature, zeroline=False),
        yaxis=dict(title="Average prediction", range=[0, 1], showline=True, tickformat="%"),
        plot_bgcolor="white",
    )
    return fig

def plot_partial_dependence(feature, fig=None):
    averaged_predictions, values = sklearn.inspection.partial_dependence(
        estimator=model,
        X=X_test,
        features=[X_test.columns.get_loc(feature)],
        grid_resolution=41,
    )
    x = values[0]
    y = averaged_predictions[0]
    
    if fig is None:
        fig = create_fig()
    fig.add_trace(go.Scatter(
        x=x,
        y=y,
        name="PDP"
    ))
    return fig

def plot_evb(feature):
    explanation = explainer.explain_bias(
        X_test=X_test[feature],
        y_pred=y_pred
    )
    return explainer.make_bias_fig(explanation)[feature]

def plot_all(feature):
    fig = plot_evb(feature)
    return plot_partial_dependence(feature, fig=fig)

In [51]:
plot_all("age")