# Initial Setup

We import key functions to train a machine learning model as well as download the dataset we want to use.

Last but not least, we define a function that we use to plot a decision tree

In [None]:
#Importing Functions
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

#Downloading the dataset
caravan = pd.read_csv("https://raw.githubusercontent.com/casbdai/datasets/main/caravan_data.csv", sep=";")

#Defining a function for plotting decision trees
def plot_tree_classification(treemodel, X):
    from sklearn import tree
    import matplotlib.pyplot as plt
    fig = plt.figure(figsize=(60,20))
    _ = tree.plot_tree(treemodel,filled=True,class_names=['0','1'],feature_names = X.columns,proportion=True,precision=2, impurity=False)

# Inspect the Data Set

In [None]:
caravan

# Lab Session 1: Training a first Decision Tree

The dataset "caravan" consists of the label "CARAVAN" and 85 other features. To train a mashine learning model, we need to separate them into two objects. The label is called "y" and the features are called "X".


In [None]:
X = caravan.drop("CARAVAN",axis=1) #Create a copy of the "caravan" data set with deleting "CARAVAN" feature.
y = caravan["CARAVAN"] #Saving the feature CARAVAN as"y"

In [None]:
y

We first train a decision tree with a "max_depth" of 2. After splitting twice, the process is stopped.

In [None]:
tree = DecisionTreeClassifier(max_depth=2).fit(X,y)

Plotting the trained decision tree

In [None]:
plot_tree_classification(tree, X)

We create predictions: "predict_proba" gives us the predictions. The first column gives the probability that a customer **will not** purchase a caravan insurance. The second column gives the probability that a customer is buying caravan insurance. Logically, the values add up to 1.

In [None]:
tree.predict_proba(X)

"predict" provides us with the assignment to the classes "purchase" and "no purchase". 1 means that a customer will purchase a caravan insurance, 0 means the customer will not purchase. All customers with an estimated purchasing probability of >50% are set to the value 1.

In [None]:
tree.predict(X)

## Let's play around with Hyperparameters

Hyperparameters to be varied:

- **max_depth:** Maximum number of splits allowed.
- **min_samples_leaf:** The minimum number of clients that must be present in a leaf node.

In [None]:
tree = DecisionTreeClassifier(max_depth=2, min_samples_leaf=1).fit(X,y)
plot_tree_classification(tree, X)

# Lab Session 2: Determining the Accuracy of our Predictions

To determine if the predictions made are accurate, we need to split our data into test and training data. We have already created two partial datasets in a previous step: y and X.

These are now transformed into a test and a training part respectively

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

We train again a new decision tree with a max_depth of 3

In [None]:
tree = DecisionTreeClassifier(max_depth=6, random_state=1).fit(X_train, y_train)

We create predictions and store them in the "y_pred" object.

In [None]:
y_pred = tree.predict(X_test)
y_pred

In [None]:
y_test

Determine the accuracy of a model:

In [None]:
accuracy_score(y_test,y_pred)