# Day 55 - Decision Tree Classifier

## Introduction

In this notebook, I focus on the **Decision Tree Classifier**, a supervised learning algorithm that splits data into branches based on feature values to make predictions.  

I begin with a short theory on how decision trees work, covering key concepts such as **entropy, information gain, Gini impurity, and pruning**. 

Then I implement a Decision Tree classifier on a dataset, training and evaluating models:  
- With and without scaling (to show that scaling is not necessary for decision trees)  
- With different tree depths (`max_depth=3` and `max_depth=10`) to compare performance and generalization  

By the end of this notebook, it becomes clear why decision trees are scale-invariant, how tree depth impacts overfitting vs generalization, and how pruning parameters help control model complexity.

---

## 1. Decision Tree Classifier 

A **Decision Tree** is a supervised machine learning algorithm that is both powerful and intuitive. It models decisions by splitting data into subsets based on the values of its features, creating a flowchart-like structure. The goal is to build a tree that can predict a target variable by asking a series of questions.

---

### 1.1 Entropy

**Entropy** is a measure of the impurity or randomness of a dataset. In the context of a decision tree, a node has high entropy if its data points are a mix of different classes (e.g., half 'Up' and half 'Down'). A node has zero entropy if all its data points belong to the same class, meaning it is perfectly pure.

The formula for Entropy is:

$$E(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)$$

Where:

  * $S$ is a set of training examples.
  * $c$ is the number of classes.
  * $p_i$ is the proportion of examples belonging to class $i$.

### 1.2 Gini Index (Gini Impurity)

The **Gini Index** is an alternative to Entropy for measuring impurity. A Gini Index of 0 means the node is perfectly pure (all data points are of the same class), while a Gini Index close to 1 means the node is highly impure. The algorithm seeks to minimize Gini Impurity at each split.

The formula for the Gini Index is:

$$Gini(S) = 1 - \sum_{i=1}^{c} p_i^2$$

Both Entropy and Gini Index achieve the same goal, but the Gini Index is often preferred in practice as it is computationally faster.

### 1.3 Information Gain (IG)

**Information Gain (IG)** is the main criterion used by the decision tree algorithm to select the best feature for a split. It measures the reduction in entropy (or impurity) achieved by splitting the data on a particular feature. The algorithm always chooses the feature with the **highest Information Gain** to make the split, as this results in the purest possible child nodes.

The formula for Information Gain is:

$$IG(S, A) = E(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} E(S_v)$$

Where:

  * $S$ is the dataset before the split.
  * $A$ is the feature being evaluated for the split.
  * $Values(A)$ is the set of all unique values in feature $A$.
  * $S_v$is the subset of $S$ where feature $A$ has the value $v$.
  * $|S_v|$ and $|S|$ are the number of elements in the respective sets.

### 1.4 Pruning

**Pruning** is a technique used to combat **overfitting** in a decision tree. A tree that is too deep can become overly complex, memorizing the training data instead of learning general patterns. Pruning simplifies the tree by removing branches that are not essential.

There are two main types of pruning:

  * **Pre-pruning**: Stopping the tree from growing before it's fully developed by setting hyperparameters like `max_depth` or `min_samples_leaf`.
  * **Post-pruning**: Growing a full tree and then cutting back branches that do not contribute significantly to the model's performance.

**Common pruning parameters in scikit-learn:**
- `max_depth`: maximum number of levels in the tree  
- `min_samples_split`: minimum samples required to split a node  
- `min_samples_leaf`: minimum samples required at a leaf node  
- `max_leaf_nodes`: maximum number of leaves  

These parameters control complexity and improve generalization.

---

## 2. Steps to Build a Decision Tree

The process of building a decision tree from a dataset based on Information Gain follows these steps:

1.  **Understand the Dataset**: Analyze the features (independent variables) and the target (dependent variable).
2.  **Calculate Total Entropy**: Calculate the entropy of the target variable to understand the overall impurity of the dataset.
3.  **Calculate Information Gain**: For each feature, calculate its Information Gain.
4.  **Select the Root Node**: Choose the feature with the highest Information Gain to be the root node of the tree.
5.  **Repeat**: Repeat the process for each branch until all leaf nodes are pure (entropy is 0) or a stopping criterion is met (e.g., `max_depth`).

---


## Import Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

## Load the dataset

In [2]:
dataset = pd.read_csv(r"C:\Users\Arman\Downloads\dataset\logit classification.csv")

In [3]:
dataset

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0
...,...,...,...,...,...
395,15691863,Female,46,41000,1
396,15706071,Male,51,23000,1
397,15654296,Female,50,20000,1
398,15755018,Male,36,33000,0


## Feature Selection
## Split into features (X) and target (y)
- X: Features (Age, EstimatedSalary)
- y: Target (Purchased)

In [4]:
X = dataset[["Age", "EstimatedSalary"]].values
y = dataset["Purchased"].values

## Splitting the dataset into the Training set and Test set¶

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

## Feature Scaling
## Apply StandardScaler

In [6]:
sc = StandardScaler() 
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

## Training and Evaluating Decision Tree Model

## With max_depth (3) Without Scaling

In [7]:
classifier1 = DecisionTreeClassifier(max_depth=3)
classifier1.fit(X_train, y_train)

y_pred1 = classifier1.predict(X_test)

print("Decision Tree (max_depth=3) without Scaling")
print("Accuracy:", accuracy_score(y_test, y_pred1))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred1))

Decision Tree (max_depth=3) without Scaling
Accuracy: 0.94
Confusion Matrix:
 [[64  4]
 [ 2 30]]


## With max_depth (10) Without Scaling

In [8]:
classifier2 = DecisionTreeClassifier(max_depth=10)
classifier2.fit(X_train, y_train)

y_pred2 = classifier2.predict(X_test)

print("Decision Tree (max_depth=10) without Scaling")
print("Accuracy:", accuracy_score(y_test, y_pred2))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred2))

Decision Tree (max_depth=10) without Scaling
Accuracy: 0.92
Confusion Matrix:
 [[64  4]
 [ 4 28]]


## With max_depth (3) With Scaling

In [9]:
classifier3 = DecisionTreeClassifier(max_depth=3)
classifier3.fit(X_train_sc, y_train)

y_pred3 = classifier3.predict(X_test_sc)

print("Decision Tree (max_depth=3) with Scaling")
print("Accuracy:", accuracy_score(y_test, y_pred3))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred3))

Decision Tree (max_depth=3) with Scaling
Accuracy: 0.94
Confusion Matrix:
 [[64  4]
 [ 2 30]]


## With max_depth (10) With Scaling

In [10]:
classifier4 = DecisionTreeClassifier(max_depth=10)
classifier4.fit(X_train_sc, y_train)

y_pred4 = classifier4.predict(X_test_sc)

print("Decision Tree (max_depth=10) with Scaling")
print("Accuracy:", accuracy_score(y_test, y_pred4))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred4))

Decision Tree (max_depth=10) with Scaling
Accuracy: 0.93
Confusion Matrix:
 [[64  4]
 [ 3 29]]


---
## Interpretation

From the results, we observe that **Decision Tree Classifier does not require feature scaling**.  
- Accuracy remains the same with and without scaling (0.94 at depth=3, 0.93–0.92 at depth=10).  
- This is because decision trees split data based on feature thresholds (e.g., Age > 30), and these thresholds are **not affected by scaling**.  

When comparing different tree depths:  
- **Max Depth = 3:** Achieved the best performance (Accuracy = 0.94) with balanced confusion matrix results.  
- **Max Depth = 10:** Accuracy slightly decreased (0.92–0.93), showing that very deep trees may **overfit** the training data and generalize worse.  

**Conclusion:** Decision Trees are scale-invariant, and controlling the tree depth is more important than scaling for good performance.

---

## Summary

In this notebook, I studied the **Decision Tree Classifier**, beginning with the theoretical foundation of entropy, information gain, Gini impurity, and pruning. I then implemented and evaluated decision trees on the dataset under different settings:  
- With and without scaling  
- With maximum depths of 3 and 10  

The results showed that **feature scaling does not affect decision tree performance**, as accuracy remained the same in both scaled and unscaled data. However, tree depth had a direct impact on performance. A shallow tree (max_depth=3) provided slightly better generalization, while a deeper tree (max_depth=10) led to a minor drop in test accuracy due to overfitting.

---

## Key Takeaways

- **Decision Trees do not require feature scaling**, since splits are based on raw threshold values of features, not distances.  
- **Entropy and Information Gain** (or Gini) are used to decide the best splits in building the tree.  
- **Tree depth matters**:  
  - Shallow trees (e.g., depth=3) generalize better.  
  - Deeper trees (e.g., depth=10) may overfit and reduce test accuracy.  
- **Pruning parameters** like `max_depth`, `min_samples_split`, and `min_samples_leaf` are essential to prevent overfitting.  
- Decision Trees are **easy to interpret and visualize**, making them highly useful for understanding model decisions.  

