## DT8060
## Raffaello Baluyot

# Lab 3 - Feature Importance and Global Surrogates
In this lab we will use Feature Permutation to determine the feature importance of built models, both classification and regression models.

First, you are asked to implement feature permutation by yourself on a neural network classifier trained on a breast cancer dataset, then to compare your permutation scores to the prebuilt methods that exist.
Then, the same is to be performed on a neural network regressor.

Finally, you are to build a surrogate model to try and explain/interpret how a support vector machine (SVM) model makes its decisions by training a decision tree based on the predictions from the SVM.

## Package import

In [None]:
import pandas as pd
import numpy as np

import graphviz
from sklearn import metrics, datasets, tree
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.datasets import load_breast_cancer, load_diabetes, load_iris
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor

import pickle

## Classifier
Here, we provide a black box model using a Multi Layer Perceptron classifier that has been trained on the breast cancer dataset. Your task is to identify which features are most important for the model using feature permutation.

In [None]:
data = load_breast_cancer()
X = pd.DataFrame(data['data'])
y = data['target']
X.columns = data['feature_names']
X=(X-X.min())/(X.max()-X.min())

#X.head()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)

Load the black box model.

In [None]:
f = open('../data/model_bc.pkl', 'rb')
clf = pickle.load(f)
f.close()

If loading the black box model does not work, run the cell below to train the model yourself. It might take a while...

In [None]:
# # ONLY RUN IF YOU COULD NOT UPLOAD THE BLACK BOX MODELS

# parameters = {'solver': ['lbfgs'], 'max_iter': [2000], 'alpha': 10.0 ** -np.arange(1, 10), 'hidden_layer_sizes':np.arange(1,20), 'random_state':[42]}
# clf = GridSearchCV(MLPClassifier(), parameters, n_jobs=-1)
# clf.fit(X_train, y_train)
# clf = clf.best_estimator_
# with open('model_bc.pkl', 'wb') as out:
#     pickle.dump(clf, out)

### Todo
Implement feature permutation for the **clf** model and use the models built in scorer. Remember to permute the features multiple times as the results will vary depending on how the data is permuted. Present the results as the average result with standard deviations for each feature.

In [None]:
# Implement feature permutation

def permutation_importance_(model, X, y, k, scoring=None, seed=None):
    scoring = model.score if scoring is None else scoring
    rng = np.random.default_rng(seed)

    base_score = scoring(X, y)

    feature_importances = []

    for feature_name in X.columns:
        feature_scores = []
        for _ in range(k):
            perm_data = X.copy()
            perm_data[feature_name] = rng.choice(
                perm_data[feature_name],
                size=len(X),
                replace=False
            )

            feature_scores.append(scoring(perm_data, y))
        feature_importances.append(feature_scores)

    feature_importances = base_score - np.asarray(feature_importances).T

    return feature_importances.mean(0), feature_importances.std(0)

res = permutation_importance_(clf, X_test, y_test, k=100, seed=0)

for i in res[0].argsort()[::-1]:
        print(f'{X.columns[i]:<19} '
              f'{res[0][i]:.3f}'
              f' +/- {res[1][i]:.3f}')


Of course there already exist implemented functionality to perform feature permutation, for instance with **sklearn**.

See if your own results from feature permutation aligns with the ones produced in the cells below.

In [None]:
y_pred = clf.predict(X_test)
from sklearn.inspection import permutation_importance
r = permutation_importance(clf, X_test, y_test,
                           n_repeats=100,
                           random_state=0)

for i in r.importances_mean.argsort()[::-1]:
        print(f'{X.columns[i]:<19} '
              f'{r.importances_mean[i]:.3f}'
              f' +/- {r.importances_std[i]:.3f}')


### Reflection
- What do the results above mean?
- What conclusions/insights can we draw of the model based of off them?

Based on the results, when the correlation of the high importance features are removed to the model, the performance of the model has a bigger degradation. This means that those with higher importance are being relied to by the model.

Based on results, the features on `worst texture`, `worst radius`, `mean concave points`, and `worse concave points` are the most important features used by the model. Aside from importance, these features are also non-redundant this the model lost a lot of performance when they are shuffled.

## Regression
The same task as before lies ahead of you, but in this instance we instead look towards the Multi Layer Perceptron for regression. This model has been trained on a diabetes dataset where the target variable is the disease progression.

In [None]:
data = load_diabetes()
X = pd.DataFrame(data['data'])
y = data['target']
X.columns = data['feature_names']

X.head()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)

Load the black box model.

In [None]:
f = open('../data/model_diab.pkl', 'rb')
clf = pickle.load(f)
f.close()

If loading the black box model does not work, run the cell below to train the model yourself. It might take a while...

In [None]:
# # ONLY RUN IF YOU COULD NOT UPLOAD THE BLACK BOX MODELS
# parameters = {'solver': ['lbfgs'], 'max_iter': [2000], 'alpha': 10.0 ** -np.arange(1, 10), 'hidden_layer_sizes':np.arange(1,20), 'random_state':[42]}
# clf = GridSearchCV(MLPRegressor(), parameters, n_jobs=-1)
# clf.fit(X_train, y_train)
# clf = clf.best_estimator_
# # Stores the produced model for the students' later use
# with open('model_diab.pkl', 'wb') as out:
#     pickle.dump(clf, out)


### Todo
Implement feature permutation for the **clf** model and use the models built in scorer. Remember to permute the features multiple times as the results will vary. Present the results as the average result with standard deviations for each feature.

In [None]:
#Implement feature permutation
res = permutation_importance_(clf, X_test, y_test, k=100, seed=0)

for i in res[0].argsort()[::-1]:
        print(f'{X.columns[i]:<19} '
              f'{res[0][i]:.3f}'
              f' +/- {res[1][i]:.3f}')

In [None]:
r = permutation_importance(clf, X_test, y_test,
                           n_repeats=100,
                           random_state=0)

for i in r.importances_mean.argsort()[::-1]:
        print(f"{X.columns[i]:<19} "
              f"{r.importances_mean[i]:.3f}"
              f" +/- {r.importances_std[i]:.3f}")


### Reflection
- What do the results above mean?
- What conclusions can we draw of the model based of off them?

Based on the results, when the correlation of the high importance features are removed to the model, the performance of the model has a bigger degradation. This means that those with higher importance are being relied to by the model.

Based on results, the features on `s5`, `s1`, `bmi`, and `s2` are the most important features used by the model. Aside from importance, these features are also non-redundant this the model lost a lot of performance when they are shuffled.

**to be filled by student**

# Global Surrogate
In this part of the lab you will be tasked to train Support Vector Machines (SVMs) on two given datasets. The SVMs are then to be subject for the global surrogate approach, where you will train CART models on the input data and the output from the SVMs.

## Classification

Here, you create a SVM. THe SVM is then used to predict data which is then fed into the CART algorithm to train.
This mimics the model from the SVM into the cart model, which is cheaper to use.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

data = load_iris()
X = pd.DataFrame(data['data'])
X.columns = data['feature_names']
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)


In [None]:
clf = SVC()
clf.fit(X_train, y_train)

In [None]:
from sklearn.metrics import r2_score
surrogate = DecisionTreeClassifier()
surrogate.fit(X_train, clf.predict(X_train))
y_pred = surrogate.predict(X_test)

r2_score(y_test, y_pred)

Visualize the tree and try and see how the CART approximates the SVM.

In [None]:
def display_tree(model, feature_names):
    g = tree.export_graphviz(
        model, 
        feature_names=feature_names,
        filled=True
    )

    print(dict(zip(
        feature_names,
        model.feature_importances_
    )))
    return graphviz.Source(g)

display_tree(surrogate, X_train.columns)

* How is the SVM approximated?
* What features give the most importance to the model?
* How does it operate?

The SVM was approximated well given the result of the surrogate performance. The SVM was also approximated using a simpler tree given that there are only three labels and the features looks to be linearly separable.

The `petal length` is considered as the most important feature as it's the one that was used by the tree that divides the data into smaller categories that reduce the entropy.

## Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR

data = load_diabetes()
X = pd.DataFrame(data['data'])
X.columns = data['feature_names']
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)

In [None]:
clf = SVR()
clf.fit(X_train, y_train)

In [None]:
surrogate = DecisionTreeRegressor()
surrogate.fit(X_train, clf.predict(X_train))
y_pred = surrogate.predict(X_test)

r2_score(y_test, y_pred)

Visualize the tree and try and see how the CART approximates the SVR.

In [None]:
#Visualize the tree
display_tree(surrogate, X_train.columns)

* How is the SVM approximated?
* What features give the most importance to the model?
* How does it operate?

The SVM was approximated well given the result of the surrogate performance. However, the model results is not really easy to interpret. This is pretty much the result of doing tree regressions. The tree create steps even in some linear relationships.

The `s5`, `bmi`, and `bp` are considered as the most important feature as it's the one that was used by the tree that divides the data into smaller categories that reduce the entropy.

# Final reflections

Reflect about your laboration. This includes but is not limited to the example points below.

* How does the feature permutation give us insight to explain the model?
* What conclusions can you draw from it?
* The Global surrogate model takes another approach to explain the model. Is it useful?
* How can you use the surrogate to explain your model?


Feature permutation provides insight to the model by shuffling the features. This removes the correlation of the feature with the target value. Then the performance of the model on this shuffled features is compared with the original performance. Thus, the impact of the removal of the feature correlation with the target is measured.

Feature permutation shows the features that are non-redundant and provides impact to the model. Unfortunately, given it's one by one nature, highly correlated features might not be seen as important by the model.

Global surrogate models try to predict the model dynamics by using another interpretable model. It is useful given a model's certain complexity. Once the target model reached certain complexity, the surrogate model can also become complex, making it hard to interpret.

Global surrogate can be used depending on which surrogate model was picked, as the surrogate model is used as proxy to understand the target model dynamics.