# Training school "Data and Models" INRAE 2024

## Exercise 3: Introduction to ML - Explanations/Interpretation
In this part, we are going to check a few options for obtaining "explanations" of the behavior of ML models, that are normally "black boxes". Let's start (again) by installing the openml library and downloading a data set.

In [None]:
!pip install openml
import openml

# load another data set
dataset = openml.datasets.get_dataset(24)

# obtain the data set as a dataframe
df, *_ = dataset.get_data()

# preprocessing to obtain just matrices of values
target_feature = dataset.default_target_attribute
other_features = [c for c in df.columns if c != target_feature]
print("The dataset has %d samples and %d features" % (df.shape[0], len(other_features)))

# convert categorical features to numerical values
categorical_columns = df[other_features].select_dtypes(include=['category', 'object', 'string'])
print("Categorical columns found:", categorical_columns.columns)
for c in categorical_columns :
  df[c].replace({category : index for index, category in enumerate(df[c].astype('category').cat.categories)}, inplace=True)

class_labels = df[target_feature].unique()
dictionary = {class_label : i for i, class_label in enumerate(class_labels)} # creates a dictionary to go from label name to an integer
for class_label in class_labels :
  print("For class label \"%s\", there are %d samples" % (class_label, df[df[target_feature] == class_label].shape[0]))

# finally, obtain matrices of numerical
import numpy as np
X = df[other_features].values
y = np.array([dictionary[y_i] for y_i in df[target_feature].values])

Now, this time we are mostly interested in observing the explanations, so we will skip the k-fold cross-validation and just use a single train/test split of the data. We are going to train the model on the training set, and test it on the test set. Let's use a few different ML algorithms.

In [None]:
# split the data into train and test; since this is a classification problem, we
# are going to use a stratified split, so that each split preserve the same proportion
# of class labels
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y, shuffle=True)

# set up a few different classifiers
# our good old friend, Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=42)

# logistic regression is using a linear model for the decision boundaries
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

# support vector machines is another famous ML technique, creating a higher-dimensional
# space by combining the data set features, and then fitting a linear model in
# the higher-dimensional space
from sklearn.svm import SVC
svc = SVC()

# create a dictionary of classifiers, so that we can iterate over them
classifiers = {"Logistic Regression" : lr, "Random Forest" : rf, "Support Vector Machines" : svc}

In [None]:
# now, logistic regression and support vector machines require data normalization, or they crash horribly
# in order to properly perform the needed pre-processing, we need to add a normalization step,
# using another algorithm from scikit-learn, that rescales each feature to 0 mean and unit variance
# so that its values will be in [-1, 1]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# normalization is LEARNED ON THE TRAINING SET and APPLIED TO THE TEST SET
# this is important, applying normalization to the whole data set can lead to
# the ML algorithm having access to information it should not have (for example,
# if the highest value of a feature falls into the test set, there will be no
# values 1.0 in the training set for that feature)
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# iterate over the classifiers, train them and test them
from sklearn.metrics import f1_score
for classifier_name, classifier in classifiers.items() :
  classifier.fit(X_train, y_train)
  y_pred = classifier.predict(X_test)
  f1_value = f1_score(y_test, y_pred)

  print("For classifier \"%s\", F1=%.4f" % (classifier_name, f1_value))

## Global explanations
Now that we have three trained classifier models, we can check what is the relative importance of the features they used to get to the decision; since the data set has 22 features, which is quite a lot to visualize, we are only going to look at the top 10 (most important), visualized as a histogram.

In [None]:
# let's first take a look at Random Forest; we create a list with the name of the feature, and a value describing its relative importance
feature_importance_rf = [ [other_features[i], rf.feature_importances_[i]] for i in range(0, len(other_features))]
# sort the list from most important to least important, cut it at the first 10
feature_importance_rf = sorted(feature_importance_rf, reverse=True, key=lambda x : x[1])[:10]

# create the histogram
import matplotlib.pyplot as plt
import pandas as pd
fig, ax = plt.subplots()
x = [x for x in range(0, len(feature_importance_rf))]
ax.bar(x, [y[1] for y in feature_importance_rf])
ax.set_xticks(x, [y[0] for y in feature_importance_rf], rotation=90)
ax.set_title("Relative feature importance according to Random Forest")

So, for Random Forest, what is most important in discriminating a mushroom between poisonous or edible is its smell. Let's see if Logistic Regression agrees.

In [None]:
# logistic regression just uses the absolute values of the coefficients as relative importance
feature_importance_lr = [ [other_features[i], abs(lr.coef_[0,i])] for i in range(0, len(other_features))]
# sort the list from most important to least important, cut it at the first 10
feature_importance_lr = sorted(feature_importance_lr, reverse=True, key=lambda x : x[1])[:10]

# create the histogram
import matplotlib.pyplot as plt
import pandas as pd
fig, ax = plt.subplots()
x = [x for x in range(0, len(feature_importance_lr))]
ax.bar(x, [y[1] for y in feature_importance_lr])
ax.set_xticks(x, [y[0] for y in feature_importance_lr], rotation=90)
ax.set_title("Relative feature importance according to Logistic Regression")

They actually seem to disagree! Notice that the values on the y-axis make sense only for *the same algorithm*, and cannot be easily compared between different algorithms: what matters the most is the relative ranking of the features.

Well, let's see what Support Vector Machines thinks of the feature importance. SVM does not have a native way of returning the feature importance, because it is using "artificial features", created as combinations of the original features.

However, there is a way out: one of the utils in sklearn evaluates the relative importance of the features for _no matter what classifier_, by simply removing the features one by one, and checking when the performance of the classifier drops, and by how much. It's slower, because it requires several iterations, but it can be applied to any kind of trained model.

In [None]:
from sklearn.inspection import permutation_importance
result = permutation_importance(svc, X_test, y_test, n_repeats=10, random_state=42, n_jobs=2)

In [None]:
feature_importance_svc = [ [other_features[i], result["importances_mean"][i]] for i in range(0, len(other_features))]
# sort the list from most important to least important, cut it at the first 10
feature_importance_svc = sorted(feature_importance_svc, reverse=True, key=lambda x : x[1])[:10]

# create the histogram
import matplotlib.pyplot as plt
import pandas as pd
fig, ax = plt.subplots()
x = [x for x in range(0, len(feature_importance_svc))]
ax.bar(x, [y[1] for y in feature_importance_svc])
ax.set_xticks(x, [y[0] for y in feature_importance_svc], rotation=90)
ax.set_title("Relative feature importance according to Support Vector Machines")

Again, a disagreement, but we might notice that some of the features appear often in the top 10 of each algorithm.

What happens if we believe the classifiers, and retrain them just using the most common features that appear in the top 10? We expect having a small-to-negligable drop in performance. Let's see if it is true!

In [None]:
most_important_features = [f for f, v in feature_importance_svc if f in [y[0] for y in feature_importance_rf] and f in [y[0] for y in feature_importance_lr]]
print("Features in the top 10 of all classifiers:", most_important_features)

# get the matrix with the values for just those features
X_reduced = df[most_important_features].values

# let's try the classification again
X_train_reduced, X_test_reduced, y_train, y_test = train_test_split(X_reduced, y, test_size=0.2, random_state=42, stratify=y, shuffle=True)
scaler.fit(X_train_reduced)
X_train_reduced = scaler.transform(X_train_reduced)
X_test_reduced = scaler.transform(X_test_reduced)

for classifier_name, classifier in classifiers.items() :
  classifier.fit(X_train_reduced, y_train)
  y_pred = classifier.predict(X_test_reduced)
  f1_value = f1_score(y_test, y_pred)

  print("For classifier \"%s\", F1=%.4f" % (classifier_name, f1_value))

If the results of this notebook are repeatable, we went from 22 to just 4 features, with only a small drop in performance! 4 features make a model way easier to understand for a human. The process we just went through is sometimes called **feature selection**, and can be performed in a number of different ways, for either (i) ease human interpretation or (ii) improve model performance, by removing useless or deceptive features.

## Local explanations
But what if we were interested in knowing the relative importance of the features for the classification of ONE sample in particular? Well, we could build a linear piece-wise approximation of the decision boundary around a sample, and look at the weights for that linear function. *Easy*.

Luckily, someone already did it for us, in the LIME library.

In [None]:
!pip install lime

In [None]:
# create an 'explainer' object
from lime import lime_tabular
explainer = lime_tabular.LimeTabularExplainer(X_train, mode='classification', training_labels=y_train, feature_names=other_features)

# re-train the algorithms on the original (non-reduced) data
lr = LogisticRegression()
rf = RandomForestClassifier(random_state=42)

lr.fit(X_train, y_train)
rf.fit(X_train, y_train)

In [None]:
# obtain an explanation for a test sample from logistic regression; we use a function
# of logistic regression called "predict_proba" that shows pseudo-probabilities of a
# sample to belong to each class; don't worry too much about it
test_sample = X_test[100]
exp = explainer.explain_instance(test_sample, lr.predict_proba, num_features=5)
exp.show_in_notebook(show_table=True)

Interesting! Does Random Forest agree with Linear Regression on where this sample should be classified, and why? Let's check.

In [None]:
exp = explainer.explain_instance(test_sample, rf.predict_proba, num_features=5)
exp.show_in_notebook(show_table=True)

Now, go back to the cell with the line
```
test_sample = X_test[0]
```
change it to evaluate another test sample, for example:
```
test_sample = X_test[100]
```
and re-run the last two cells. You can compare the evaluations of Random Forest and Support Vector Machines on several samples, observing how they agree or disagree on classification, and why.