
<br>
================================================================<br>
Permutation Importance vs Random Forest Feature Importance (MDI)<br>
================================================================<br>
In this example, we will compare the impurity-based feature importance of<br>
:class:`~sklearn.ensemble.RandomForestClassifier` with the<br>
permutation importance on the titanic dataset using<br>
:func:`~sklearn.inspection.permutation_importance`. We will show that the<br>
impurity-based feature importance can inflate the importance of numerical<br>
features.<br>
Furthermore, the impurity-based feature importance of random forests suffers<br>
from being computed on statistics derived from the training dataset: the<br>
importances can be high even for features that are not predictive of the target<br>
variable, as long as the model has the capacity to use them to overfit.<br>
This example shows how to use Permutation Importances as an alternative that<br>
can mitigate those limitations.<br>
.. topic:: References:<br>
   [1] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32,<br>
       2001. https://doi.org/10.1023/A:1010933404324<br>


In [None]:
print(__doc__)
import matplotlib.pyplot as plt
import numpy as np

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

############################################################################<br>
Data Loading and Feature Engineering<br>
------------------------------------<br>
Let's use pandas to load a copy of the titanic dataset. The following shows<br>
how to apply separate preprocessing on numerical and categorical features.<br>
<br>
We further include two random variables that are not correlated in any way<br>
with the target variable (``survived``):<br>
<br>
- ``random_num`` is a high cardinality numerical variable (as many unique<br>
  values as records).<br>
- ``random_cat`` is a low cardinality categorical variable (3 possible<br>
  values).

In [None]:
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
rng = np.random.RandomState(seed=42)
X['random_cat'] = rng.randint(3, size=X.shape[0])
X['random_num'] = rng.randn(X.shape[0])

In [None]:
categorical_columns = ['pclass', 'sex', 'embarked', 'random_cat']
numerical_columns = ['age', 'sibsp', 'parch', 'fare', 'random_num']

In [None]:
X = X[categorical_columns + numerical_columns]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=42)

In [None]:
categorical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
numerical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

In [None]:
preprocessing = ColumnTransformer(
    [('cat', categorical_pipe, categorical_columns),
     ('num', numerical_pipe, numerical_columns)])

In [None]:
rf = Pipeline([
    ('preprocess', preprocessing),
    ('classifier', RandomForestClassifier(random_state=42))
])
rf.fit(X_train, y_train)

############################################################################<br>
Accuracy of the Model<br>
---------------------<br>
Prior to inspecting the feature importances, it is important to check that<br>
the model predictive performance is high enough. Indeed there would be little<br>
interest of inspecting the important features of a non-predictive model.<br>
<br>
Here one can observe that the train accuracy is very high (the forest model<br>
has enough capacity to completely memorize the training set) but it can still<br>
generalize well enough to the test set thanks to the built-in bagging of<br>
random forests.<br>
<br>
It might be possible to trade some accuracy on the training set for a<br>
slightly better accuracy on the test set by limiting the capacity of the<br>
trees (for instance by setting ``min_samples_leaf=5`` or<br>
``min_samples_leaf=10``) so as to limit overfitting while not introducing too<br>
much underfitting.<br>
<br>
However let's keep our high capacity random forest model for now so as to<br>
illustrate some pitfalls with feature importance on variables with many<br>
unique values.

In [None]:
print("RF train accuracy: %0.3f" % rf.score(X_train, y_train))
print("RF test accuracy: %0.3f" % rf.score(X_test, y_test))

############################################################################<br>
Tree's Feature Importance from Mean Decrease in Impurity (MDI)<br>
--------------------------------------------------------------<br>
The impurity-based feature importance ranks the numerical features to be the<br>
most important features. As a result, the non-predictive ``random_num``<br>
variable is ranked the most important!<br>
<br>
This problem stems from two limitations of impurity-based feature<br>
importances:<br>
<br>
- impurity-based importances are biased towards high cardinality features;<br>
- impurity-based importances are computed on training set statistics and<br>
  therefore do not reflect the ability of feature to be useful to make<br>
  predictions that generalize to the test set (when the model has enough<br>
  capacity).

In [None]:
ohe = (rf.named_steps['preprocess']
         .named_transformers_['cat']
         .named_steps['onehot'])
feature_names = ohe.get_feature_names(input_features=categorical_columns)
feature_names = np.r_[feature_names, numerical_columns]

In [None]:
tree_feature_importances = (
    rf.named_steps['classifier'].feature_importances_)
sorted_idx = tree_feature_importances.argsort()

In [None]:
y_ticks = np.arange(0, len(feature_names))
fig, ax = plt.subplots()
ax.barh(y_ticks, tree_feature_importances[sorted_idx])
ax.set_yticklabels(feature_names[sorted_idx])
ax.set_yticks(y_ticks)
ax.set_title("Random Forest Feature Importances (MDI)")
fig.tight_layout()
plt.show()

############################################################################<br>
As an alternative, the permutation importances of ``rf`` are computed on a<br>
held out test set. This shows that the low cardinality categorical feature,<br>
``sex`` is the most important feature.<br>
<br>
Also note that both random features have very low importances (close to 0) as<br>
expected.

In [None]:
result = permutation_importance(rf, X_test, y_test, n_repeats=10,
                                random_state=42, n_jobs=2)
sorted_idx = result.importances_mean.argsort()

In [None]:
fig, ax = plt.subplots()
ax.boxplot(result.importances[sorted_idx].T,
           vert=False, labels=X_test.columns[sorted_idx])
ax.set_title("Permutation Importances (test set)")
fig.tight_layout()
plt.show()

############################################################################<br>
It is also possible to compute the permutation importances on the training<br>
set. This reveals that ``random_num`` gets a significantly higher importance<br>
ranking than when computed on the test set. The difference between those two<br>
plots is a confirmation that the RF model has enough capacity to use that<br>
random numerical feature to overfit. You can further confirm this by<br>
re-running this example with constrained RF with min_samples_leaf=10.

In [None]:
result = permutation_importance(rf, X_train, y_train, n_repeats=10,
                                random_state=42, n_jobs=2)
sorted_idx = result.importances_mean.argsort()

In [None]:
fig, ax = plt.subplots()
ax.boxplot(result.importances[sorted_idx].T,
           vert=False, labels=X_train.columns[sorted_idx])
ax.set_title("Permutation Importances (train set)")
fig.tight_layout()
plt.show()