This notebook will be used with the aim of showing how a decision tree works:

# 1. Set up

# 2. Import necessary libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, roc_curve

import category_encoders as ce

In [3]:
import warnings

warnings.simplefilter("ignore")

# 3. Define global variables

In [4]:
INPUT_PATH = "../../data/credit_card_data/data_modified_binary_classification.csv"

# 4. Functions

# 5. Code

We are going to make use of some credit card details data. The data is calculated in the notebook *00_transform_data_binary_classification.ipynb*. All needed information about the data is sotred there.

We'll proceed the same way as with the logistic regression in order to give the reader consistency when reading these notebooks

## 5.1. Load and transform data

First of all we are going to load the both the data and the target variables making use of pandas library

In [12]:
data = pd.read_csv(INPUT_PATH, sep=";")
data.head()

Unnamed: 0,car_owner,propert_owner,children,type_income,education,marital_status,housing_type,employed_days,mobile_phone,work_phone,phone,email_id,family_members,target
0,Y,Y,0,Pensioner,Higher education,Married,House / apartment,365243,1,0,0,0,2,1
1,Y,N,0,Commercial associate,Higher education,Married,House / apartment,-586,1,1,1,0,2,1
2,Y,N,0,Commercial associate,Higher education,Married,House / apartment,-586,1,1,1,0,2,1
3,Y,N,0,Commercial associate,Higher education,Married,House / apartment,-586,1,1,1,0,2,1
4,Y,N,0,Pensioner,Higher education,Married,House / apartment,-586,1,1,1,0,2,1


### Train / test split

Before doing transformations in the data, we need to divide it into train and test. **Let's remember that all transformations must be fitted in the training dataset**

In [13]:
X = data.drop("target", axis=1)
y = data["target"]

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

We need to transform all categorical columns to numeric. First, let's detect them:

In [15]:
categorical_cols = data.select_dtypes(include="object").columns

We are going to use target encoder in order to transform the categorical data:

In [16]:
target_encoder = ce.TargetEncoder(cols = categorical_cols)

In [17]:
X_train_processed = target_encoder.fit_transform(X_train, y_train)
X_test_processed = target_encoder.transform(X_test)

This time, unlike when we trained our logistic regression, we do not need to scale variables. This is because of some properties:

- Tree-based algorithms do not rely on distance metrics between data points. 
- Tree-based algorithms do not have coefficients in their variables in order to compare their impact.
- Tree-based algorithms make binary decisions at each node based on a single feature and its threshold. The decision to split a node is based solely on how well it separates the data into different classes.

## 5.2. Training

Let's initialize the model and train it then:

In [19]:
model = DecisionTreeClassifier(random_state=42)

model.fit(X_train_processed, y_train)

DecisionTreeClassifier(random_state=42)

Once the model is trained, let's calculate predictions of the X_test_precessed data. Before calculating these predictions, we'll set the threshold to 0.1 as done in the logistic regression:

In [21]:
y_pred = model.predict(X_test_processed)
y_pred_proba = model.predict_proba(X_test_processed)[:, 1]

## 5.3. Metrics calculation

First, let's calculate some metrics and show them:

In [None]:
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

In [None]:
print(f"The accuracy error value is: {accuracy}")
print(f"The roc_auc value is: {roc_auc}")

Now we will plot both the confusion matrix and the roc curve:

**Confusion matrix:**

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred, normalize="true")

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, cmap='Blues', fmt='g', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

We can observe a very bad result for the confusion matrix. This is because of the threshold considered when calculating y_pred. We need to modify this threshold. We are considering the value 0.1 because is the default imbalance value:

In [None]:
y_pred_new = np.where(y_pred_proba >= 0.1, 1, 0)

In [None]:
accuracy = accuracy_score(y_test, y_pred_new)
print(f"The accuracy error value is: {accuracy}")

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred_new, normalize="true")

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, cmap='Blues', fmt='g', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

We can observe a better performance of the model now. However, let's remember that this is for logistic regression understanding purposes only. If we would like to find the best model, we should have to be more cautious and more precise when selecting the threshold

**ROC Curve:**

In [None]:
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC Curve (AUC = {round(roc_auc, 2)})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()

A 0.61 value in AUC gives us the clue that the model has some ability to discriminate between positive and negative instances. It is not a very sharp model, but the model's discriminatory power is slightly better than random guessing. 

## 5.4. Interpretion

Let's attend at the coefficient values a moment:

In [None]:
coef_df = pd.DataFrame({'Feature': list(X_train.columns) + ["intercept"], 
                        'Coefficient': list(model.coef_[0]) + [model.intercept_[0]]})
coef_df

The strongest coefs are the type_income and family_members variables. This makes a lot of sense considering that our dataset is about credit card application approved or not. 

Results show that we are in front of a not so good model but not so bad model also. Maybe a consideration of hyperparameter tunning or a tree based algorithm would give us the boost in order to get better results.