In [1]:
## For data handling
import pandas as pd
import numpy as np

## For plotting
import matplotlib.pyplot as plt
from seaborn import set_style

## This sets the plot style
## to have a grid on a white background
set_style("whitegrid")

## Random forest for feature importances

Random forests can also provide feature importance scores. 

The `sklearn` algorithm measures importance in the following way. For each feature it looks at every tree and identifies the nodes using that feature to make a cut. It then measures how much those cuts reduced impurity and averages that value over all the trees in the forest. After getting the average impurity reduction for each feature, `sklearn` scales the results so that the sum of all feature importances is equal to $1$.

We will demonstrate this on the `iris` data set.

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

In [None]:
# load the data


In [13]:
iris = load_iris(as_frame=True)

X = iris['data']
X = X.rename(columns={'sepal length (cm)':'sepal_length',
                         'sepal width (cm)':'sepal_width',
                         'petal length (cm)':'petal_length',
                         'petal width (cm)':'petal_width'})
y = iris['target']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X.copy(),y.copy(),
                                                       shuffle=True,
                                                       random_state=153,
                                                       stratify=y,
                                                       test_size=.2)

In [17]:
forest = RandomForestClassifier(n_estimators=500, 
                                max_depth=4,
                                random_state = 8973489)

forest.fit(X_train, y_train)

RandomForestClassifier(max_depth=4, n_estimators=500, random_state=8973489)

The `sklearn` scaled impurity reduction can be found with `feature_importances_`.

In [16]:
forest.feature_importances_

array([0.09352778, 0.01868163, 0.45875146, 0.42903913])

We can make it a little more readable with a dataframe.

In [18]:
score_df = pd.DataFrame({'feature':X_train.columns,
                            'importance_score': forest.feature_importances_})

score_df.sort_values('importance_score',ascending=False)

Unnamed: 0,feature,importance_score
2,petal_length,0.458751
3,petal_width,0.429039
0,sepal_length,0.093528
1,sepal_width,0.018682


This is a nice feature of random forests, it allows us to understand what variables are most important, which can help us explain the algorithm. It is also useful as another method for feature selection.

##### Extra Trees

Extra trees classifiers also has the ability to be used for feature importance scores.

In [18]:
et = ExtraTreesClassifier(n_estimators=500, 
                          max_depth=4,
                         random_state =38383)

et.fit(X_train, y_train)

ExtraTreesClassifier(max_depth=4, n_estimators=500, random_state=38383)

In [19]:
et_score_df = pd.DataFrame({'feature':X_train.columns,
                            'importance_score': et.feature_importances_})

et_score_df.sort_values('importance_score',ascending=False)

Unnamed: 0,feature,importance_score
2,petal_length,0.439881
3,petal_width,0.40762
0,sepal_length,0.102791
1,sepal_width,0.049709


--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)