 # More About Random Forest
 
In this notebook, we will explore two topics: 

1. How to "measure" the relative importance of the features in our data:
     * Random forest feature importance
     * Permutation importance
2. Visualizing the decision tree of a trained classifier



We are going to use the dataset **Iron Ore**. Note that:

* It was explored in week **05**
* We have built SVM, KNN and RF classifiers in week **08**

The first part (loading the model until building the RF classifier) was slightly modified from the week 08 notebook **am1-iron-ore-dataset**.

In [None]:
# Standard libraries
import numpy as np  # written in C, is faster and robust library for numerical and matrix operations
import pandas as pd # data manipulation library, it is widely used for data analysis and relies on numpy library.
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # plot nicely =)

# Auxiliar functions
from utils import *

# the following to lines will tell to the python kernel to always update the kernel for every utils.py
# modification, without the need of restarting the kernel.
%load_ext autoreload
%autoreload 2

# using the 'inline' backend, your matplotlib graphs will be included in your notebook, next to the code
%matplotlib inline

## Loading data

In [None]:
# reading dataset
df = pd.read_csv('../../data/iron_ore_study.csv')
df.head()

In [None]:
# adding label column

# Splits from oscar Fe>60%, SiO2<9, Al2O3<2, P<0.08
split_points = [
    ('FE', 60, [False, True]),
    ('SIO2', 9, [True, False]),
    ('AL2O3', 2, [True, False]),
    ('P', 0.08, [True, False]),  
]

# It's ore if everything is True
df['is_ore'] = np.vstack([
    pd.cut(df[elem], bins=[0, split, 100], labels=is_ore)
    for elem, split, is_ore in split_points
]).sum(axis=0) == 4

df.tail()

Inspecting data balance. 

In [None]:
sns.countplot(x='is_ore', data=df);

Storing features and labels.

In [None]:
# Storing features and labels

X = df.iloc[:,:-1].copy(deep=True)  # our features: all columns but the last
y = df["is_ore"].values            # respective labels

unique, counts = np.unique(y, return_counts=True)

print('is ore == {}:'.format(unique[0]), counts[0])
print('is ore == {}:'.format(unique[1]), counts[1])
print('Proportion:', round(counts[0] / counts[1], 2), ': 1')

Now the features (variables) are stored in the Pandas data frame ```X```, and the associated labels are stored in the Numpy 1-D array ```y```. 

In [None]:
# sanity check!

display(X.head()) # features (or variables)

In [None]:
# sanity check! 

y # labels

### Splitting the Data into Train and Test Sets

In [None]:
from sklearn.model_selection import train_test_split 

x_train, x_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3,  random_state=42
)

In [None]:
# Sanity check: train test split distribution

fig, axes = plt.subplots(1,2,figsize=(9,4), sharey=True, constrained_layout=True)

i = 0
axes[i].set_title("y_train", fontsize=20)
sns.countplot(x=y_train, ax=axes[i])

i += 1
axes[i].set_title("y_test", fontsize=20)
sns.countplot(x=y_test, ax=axes[i]);

## Random Forest classifier

Remember that the RF does not use the concept of distances in a $d$-dimensional space (with $d=$ number of features). So, we do **not** need to standardize the data. 

In [None]:
from sklearn.ensemble import RandomForestClassifier # implements random forest.

n_trees = 2 # number of trees in the forest.

# model definition
model = RandomForestClassifier(n_estimators=n_trees, random_state=42)      

# model training
model.fit(x_train, y_train)

Prediction using test data:

In [None]:
y_pred = model.predict(x_test)  

## Model evaluation

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report  

In [None]:
print(f"RF train accuracy: {model.score(x_train, y_train):.3f}")
print(f"RF test accuracy: {model.score(x_test, y_test):.3f}")

We can print a detailed model report!

In [None]:
# Detailed model report

from sklearn import  metrics

print(f"Classification report for the classifier\n"
      f"{classification_report(y_test, y_pred)}\n")

The above metrics just summarize different nuances of the confusion matrix. Remember to plot that matrix to get more insight!

In [None]:
# Confusion matrix

cm = confusion_matrix(y_test,y_pred)

sns.heatmap(cm,annot=True,fmt="d")
plt.xlabel("Actual class")
plt.ylabel("predicted class");

# "Feature Importance"

For a in-depth discution about Permutation Importance *vs* Random Forest Feature Importance follow this [link](https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html#sphx-glr-auto-examples-inspection-plot-permutation-importance-py).

## Random Forest Feature Importance

In [None]:
# Random Forest feature importance 
#   from Mean Decrease in Impurity (MDI)

feature_names = X.columns.to_numpy()

# computing and storing result in a Pandas Series
mdi_importances = pd.Series(
    model.feature_importances_, index=feature_names
).sort_values(ascending=True)

# plotting
ax = mdi_importances.plot.barh()
ax.set_title("Random Forest Feature Importances (MDI, training set)")
ax.figure.tight_layout()

## Permutation Importance

In [None]:
# Permutation Importance

from sklearn.inspection import permutation_importance

# computing permutation importance
pmi_res = permutation_importance(
    model, x_test, y_test, n_repeats=10, 
    random_state=42
)

# Storing result
sorted_importances_idx = pmi_res.importances_mean.argsort()
pmi_importances = pd.DataFrame(
    pmi_res.importances[sorted_importances_idx].T,
    columns=feature_names[sorted_importances_idx],
)

# plotting

ax = pmi_importances.plot.box(vert=False, whis=10)
ax.set_title("Permutation Importances (test set)")
ax.axvline(x=0, color="k", linestyle="--")
ax.set_xlabel("Decrease in accuracy score")
ax.figure.tight_layout()

# Visualizing the decision tree

In [None]:
# Visualizing the decision tree

from sklearn import tree

cn= ["%s"%i for i in np.unique(y_train)] #["False","True"]

fig, axes = plt.subplots(1,2,figsize = (16,10))

# plotting only two trees!
for index in range(0, 2):
    tree.plot_tree(model.estimators_[index],
                   feature_names = feature_names, 
                   class_names=cn,
                   filled = True,
                   ax = axes[index]);
    
    axes[index].set_title('Estimator: ' + str(index), fontsize = 11)
    
fig.savefig("plotting_decision_trees.pdf")