<a href="https://colab.research.google.com/github/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning/blob/main/GB888_IV_10_RandomForestAndBoostingForClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Random Forest And Boosting For Classification


In this tutorial, we then use random forests and boosted trees in our case study example for the Caravan Insurance purchases, analyzing whether they can improve on the learners considered so far.

As usually, let's start with loading the relevant libaries.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from io import StringIO
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error,confusion_matrix, classification_report, roc_curve, auc

And this function, which creates images of tree models using pydot, as the package sklearn doesn't offer graphs of the trees

In [None]:
import pydot
from IPython.display import Image
def print_tree(estimator, features, class_names=None, filled=True):
  tree = estimator
  names = features
  color = filled
  classn = class_names
  dot_data = StringIO()
  export_graphviz(estimator, out_file=dot_data, feature_names=features, class_names=classn, filled=filled)
  graph = pydot.graph_from_dot_data(dot_data.getvalue())
  return(graph)

## Case Study: Caravan Insurance Purchases

Let's go back to the `Caravan` insurance data:

In [None]:
!git clone https://github.com/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning.git

In [None]:
Caravan = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB888_III_7_CaravanData.csv', index_col=0)

Let's split the dataset:

In [None]:
Caravan.Purchase = (Caravan.Purchase=='Yes')
train, test = train_test_split(Caravan, test_size=0.30, random_state=1)

X = train.drop(['Purchase'], axis=1)
y = train['Purchase']
Xtest = test.drop(['Purchase'], axis=1)
ytest = test['Purchase']

### Boosting

Let's run a boosting model:

In [None]:
boost = GradientBoostingRegressor(n_estimators=5000, learning_rate=0.01,random_state=1)
boost.fit(X, y)

To appraise what features matter, let's consider feature importance scores:

In [None]:
feature_importance = boost.feature_importances_*100
rel_imp = pd.Series(feature_importance, index=X.columns).sort_values(ascending=False, inplace=False)
rel_imp = rel_imp[0:20]
print(rel_imp)
rel_imp.plot(kind='barh', color='b', ).invert_yaxis()
plt.xlabel('Variable Importance')

The predictions are:

In [None]:
pred_boost = boost.predict(Xtest)

Resulting in the following ROC curve and AUC:

In [None]:
fpr, tpr, threshold = roc_curve(ytest, pred_boost)
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

So the performance looks decent.

###Random Forest

Let's also run a random forest:

In [None]:
rf = RandomForestRegressor(n_estimators=300, random_state=1)
rf.fit(X, y)

Feature importance scores are:

In [None]:
Importance_ = pd.DataFrame({'Importance':rf.feature_importances_*100}, index=X.columns)
Importance = Importance_.sort_values('Importance', axis=0, ascending=False)[0:20]
Importance.plot(kind='barh', color='b', ).invert_yaxis()
plt.xlabel('Variable Importance')
plt.gca().legend_ = None

So quite a difference.

Let's look at the predictions:

In [None]:
pred_rf = rf.predict(Xtest)

And ROC curve/AUC:

In [None]:
fpr, tpr, threshold = roc_curve(ytest, pred_rf)
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

So not quite the same performance as the boosted model.