<a href="https://colab.research.google.com/github/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning/blob/main/GB888_III_7_ClassificationTree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Classification Trees

In this tutorial, we will go throug an example of using a classification tree for prediction.

As usually, let's start with loading the relevant libaries.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import graphviz
import pydot
from io import StringIO

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc

The following function creates images of tree models using pydot:

In [None]:
import pydot
from IPython.display import Image
def print_tree(estimator, features, class_names=None, filled=True):
    tree = estimator
    names = features
    color = filled
    classn = class_names

    dot_data = StringIO()
    export_graphviz(estimator, out_file=dot_data, feature_names=features, class_names=classn, filled=filled)
    graph = pydot.graph_from_dot_data(dot_data.getvalue())
    return(graph)

## Case Study: Caravan Insurance Purchases

We look at the `Caravan` insurance data set included in the ISLR textbook. As indicated in Section 4.6.6, it is a dataset that includes 85 predictors that measure demographic characteristics for 5,822 individuals and "Purchase," which indicates whether or not a given individual purchases a caravan insurance policy.

As usual, let's load some relevant libraries:

Let's load our dataset:

In [None]:
!git clone https://github.com/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning.git

In [None]:
Caravan = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB888_III_7_CaravanData.csv', index_col=0)

### Some Exploration

In [None]:
Caravan.head()

Variables 1-43 represent sociodemographic data, variables 44-86 describe product ownership, and Variable 86 (Purchase) indicates whether the customer purchased a caravan insurance policy.

Let's consider some aggregate statistics:

In [None]:
Caravan.describe()

And check how many people purchase insurance:

In [None]:
Caravan['Purchase'].value_counts()

So only roughly 6% of all people buy caravan insurance.  That will be costly for an insurance agent because for every client she or he visits, only 6 in 100 will purchase insurance.  So let's use our knowledge about classification to help out the sales force, and let's try to determine individuals (based on their characteristics) that are more likely to purchase a policy.

## Predictive Modeling

Let's split into a training and test set to get going



In [None]:
Train, Test = train_test_split(Caravan, test_size=0.25, random_state=1)

X_train = Train.drop(['Purchase'], axis=1)
y_train = Train['Purchase']
X_test = Test.drop(['Purchase'], axis=1)
y_test = Test['Purchase']

### Logistic Regression

Let's start with a vanilla logistic regression model:

In [None]:
logistic_model = LogisticRegression(fit_intercept=True, max_iter=1000)
logistic_model.fit(X_train,y_train)
y_pred_logistic = logistic_model.predict(X_test)

Let's look at the confusion matrix resulting from our predictions (here the predicted probabilities are already coerced to classes):

In [None]:
confusion_matrix(y_test, y_pred_logistic)

We don't get a single positive one right -- so not great performance.  Of course, we could choose a different cutoff.  Let's evaluate the AUC, where we first have to convert the predictions to probabilities:

In [None]:
y_pred_logistic = logistic_model.predict_proba(X_test)
def Extract(lst):
    return [item[0] for item in lst]
y_pred_logistic = Extract(y_pred_logistic)

In [None]:
fpr, tpr, threshold = roc_curve((Test['Purchase'] == 'No'), y_pred_logistic)
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

### Classification Tree

Let's try a classification tree, with the caveat that we have to change the default parameters. The standard value of the so-called complexity parameter `cp` is insufficient to generate sufficient splits, because a split only happens if there is sufficient heterogeneity in the nodes.  We set it to 0.001 but we can generate an even larger tree by a lower choice.

In [None]:
Caravan.Purchase = Caravan.Purchase=='Yes'
Car_tree_first = DecisionTreeClassifier(max_leaf_nodes=4)
Car_tree_first.fit(X_train, y_train)
graph, = print_tree(Car_tree_first, features=X_train.columns)
Image(graph.create_png())

The issue with growing the tree is that there are few positives, leading to substantial "note purity" even after a few modeling steps. We have to adjust the parameters to build a larger tree:

In [None]:
Car_tree = DecisionTreeClassifier(min_samples_split=5,min_impurity_decrease=0.0001)
Car_tree.fit(X_train, y_train)
graph, = print_tree(Car_tree, features=X_train.columns)
Image(graph.create_png())

Let's look at the top features by importance scores (we will come back to these in the next module):

In [None]:
summary_tree = pd.DataFrame({'Features':X_train.columns,'Importance':Car_tree.feature_importances_})
summary_tree.sort_values(by=['Importance'], ascending=False)[0:10]

The tree has the following final nodes:

In [None]:
Car_tree.tree_.node_count

Let's look at the number of "purchases":

In [None]:
yhat = Car_tree.predict(X_test)
np.sum(yhat == "Yes")

And the confusion table is:

In [None]:
conf_matrix = confusion_matrix(y_test, yhat)

# Create a DataFrame for better visualization
conf_matrix_df = pd.DataFrame(conf_matrix,
                             index=['Actual No', 'Actual Yes'],
                             columns=['Predicted No', 'Predicted Yes'])
conf_matrix_df

So we are getting at least a few positives right!

Let's look at the ROC curve:

In [None]:
# prompt: Make an ROC curve with the predicted probabilities of the tree in the test set

y_pred_tree = Car_tree.predict_proba(X_test)
y_pred_tree = Extract(y_pred_tree)
fpr_tree, tpr_tree, threshold_tree = roc_curve((Test['Purchase'] == 'No'), y_pred_tree)
roc_auc_tree = auc(fpr_tree, tpr_tree)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr_tree, tpr_tree, 'b', label = 'AUC = %0.2f' % roc_auc_tree)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Now this looks a bit strange... So overall it's unclear if the tree is improving the situation...