<a href="https://colab.research.google.com/github/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning/blob/main/GB888_IV_10_RandomForestAndBoostingForClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Random Forest And Boosting For Classification


In this tutorial, we then use random forests and boosted trees in our case study example for the Caravan Insurance purchases, analyzing whether they can improve on the learners considered so far.

As usually, let's start with loading the relevant libaries.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import mean_squared_error,confusion_matrix, classification_report, roc_curve, auc

## Case Study: Caravan Insurance Purchases

Let's go back to the `Caravan` insurance data:

In [None]:
!git clone https://github.com/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning.git

In [None]:
Caravan = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB888_III_7_CaravanData.csv', index_col=0)

Let's split the dataset, using the same approach as we did in the previous module:

In [None]:
Caravan.Purchase = (Caravan.Purchase=='Yes')
train, test = train_test_split(Caravan, test_size=0.25, random_state=1)

X = train.drop(['Purchase'], axis=1)
y = train['Purchase']
Xtest = test.drop(['Purchase'], axis=1)
ytest = test['Purchase']

To recall, we previously considered a logistic regression model. It produced an AUC of .71 but did not determine a single true positive for a 50% cutoff. We also considered a pruned tree which produced an AUC of .72, which was able to determine a few true positives!

###Random Forest

Let's start with a random forest (with default parameters, so no tuning for now):

In [None]:
rf = RandomForestClassifier(random_state=1)
rf.fit(X, y)

To appraise what features matter, let's consider feature importance scores:

In [None]:
Importance_ = pd.DataFrame({'Importance':rf.feature_importances_*100}, index=X.columns)
Importance = Importance_.sort_values('Importance', axis=0, ascending=False)[0:20]
Importance.plot(kind='barh', color='b', ).invert_yaxis()
plt.xlabel('Variable Importance')
plt.gca().legend_ = None



Let's look at the predictions:

In [None]:
pred_rf = rf.predict_proba(Xtest)
pred_rf = pred_rf[:,1]

And ROC curve/AUC:

In [None]:
fpr, tpr, threshold = roc_curve(ytest, pred_rf)
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
threshold = 0.5
y_pred_class = (pred_rf > threshold).astype(int)

conf_matrix = confusion_matrix(ytest, y_pred_class)
print("Confusion Matrix:")
print(conf_matrix)

So not quite the same performance as the pruned (!) tree or the logistic regression model. Let's try a second random forest with alternate parameters, a few more trees and 45 features sampled per tree:

In [None]:
rf = RandomForestClassifier(n_estimators=400, max_features=45, random_state=1)
rf.fit(X, y)
pred_rf = rf.predict_proba(Xtest)
pred_rf = pred_rf[:,1]

Let's look at the AUC:

In [None]:
fpr, tpr, threshold = roc_curve(ytest, pred_rf)
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

So it gets a little better but not much...

### Boosting

Let's run a gradieny boosting model, again with the standard parameters:

In [None]:
boost = GradientBoostingClassifier(random_state=1)
boost.fit(X, y)

To appraise what features matter, let's consider feature importance scores:

In [None]:
feature_importance = boost.feature_importances_*100
rel_imp = pd.Series(feature_importance, index=X.columns).sort_values(ascending=False, inplace=False)
rel_imp = rel_imp[0:20]
print(rel_imp)
rel_imp.plot(kind='barh', color='b', ).invert_yaxis()
plt.xlabel('Variable Importance')

So here the important features are quite different.

The predictions are:

In [None]:
pred_boost = boost.predict_proba(Xtest)
pred_boost = pred_boost[:,1]

Resulting in the following ROC curve and AUC:

In [None]:
fpr, tpr, threshold = roc_curve(ytest, pred_boost)
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

So, the performance improves quite a bit, and we are also beating the pruned tree and the logistic regression model.

Let's decrease the learning rate and use a few more trees:

In [None]:
boost = GradientBoostingClassifier(n_estimators=1000, learning_rate=0.001,random_state=1)
boost.fit(X, y)

pred_boost = boost.predict_proba(Xtest)
pred_boost = pred_boost[:,1]

fpr, tpr, threshold = roc_curve(ytest, pred_boost)
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

So the performance improves a little. Let's look at the confusion matrix:

In [None]:
threshold = 0.2 # 0.5 generates only zeros
y_pred_class = (pred_boost > threshold).astype(int)

conf_matrix = confusion_matrix(ytest, y_pred_class)
print("Confusion Matrix:")
conf_matrix

### ADA Boosting

Let's finally look at the ADA boost

In [None]:
ada = AdaBoostClassifier(random_state=1)  # Adjust n_estimators as needed
ada.fit(X, y)

With predictions:

In [None]:
pred_ada = ada.predict_proba(Xtest)[:, 1]

And let's generate an ROC curve:

In [None]:
fpr, tpr, threshold = roc_curve(ytest, pred_ada)
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic (AdaBoost)')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

So again a pretty decent performance with an AUC of .75.

Let's also go with a lower learning rate but more trees:

In [None]:
ada = AdaBoostClassifier(n_estimators=1000,learning_rate=0.04,random_state=1)  # Adjust n_estimators as needed
ada.fit(X, y)

# Predictions
pred_ada = ada.predict_proba(Xtest)[:, 1]  # Probability of positive class

# ROC curve and AUC
fpr, tpr, threshold = roc_curve(ytest, pred_ada)
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic (AdaBoost)')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

So not really an improvement here.

Of course, we could tune by more systematically going through the hyper-parameters, but we observe that also the standard parameters produce decent models.