# **📘 [LDATS2350] - DATA MINING**

## **📊 Python22 - Decision Tree Regressor**

**Prof. Robin Van Oirbeek**  

<br/>

**🧑‍🏫 Guillaume Deside** *(guillaume.deside@uclouvain.be)*  

---

## 🌳 What is a Decision Tree Regressor?
A **Decision Tree Regressor** is a machine learning model that predicts continuous values by recursively splitting the dataset into smaller subsets based on feature conditions.

### 🔹 Key Features:
- Works well for **non-linear relationships**.
- Handles **both numerical and categorical data**.
- Can capture **complex interactions** between features.
- Sensitive to **overfitting**, requiring proper pruning.

---

## 🏗️ How Does a Decision Tree Regressor Work?
1️⃣ **Splitting**: The algorithm selects the best feature and value to split the dataset, minimizing prediction error.  
2️⃣ **Recursive Partitioning**: The dataset is split further until a stopping criterion is met (e.g., minimum samples per leaf).  
3️⃣ **Prediction**: The average of the target values in the final leaf node is used as the prediction.  

### 🎯 Example of a Decision Tree Regression:
![Decision Tree Example](https://upload.wikimedia.org/wikipedia/commons/f/f3/Decision_Tree.jpg)

---

## 📊 Splitting Criteria for Regression
Unlike classification trees, regression trees use **variance reduction** to determine the best splits.

### 1️⃣ **Mean Squared Error (MSE)**

$$MSE = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 $$
- Measures **variance** within each split.
- The split is chosen to **minimize MSE**.

### 2️⃣ **Mean Absolute Error (MAE)**

$$ MAE = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i| $$
- Measures **absolute deviation** in predictions.

---

## 🎯 **Advantages and Disadvantages of Decision Tree Regression**
### ✅ Advantages
✔️ Easy to interpret and visualize.  
✔️ Captures **non-linear** relationships.  
✔️ No need for feature scaling.  

### ❌ Disadvantages
⚠️ Prone to **overfitting** if not pruned properly.  
⚠️ **Sensitive to small data variations**, leading to different splits.  
⚠️ Struggles with **extrapolation** beyond training data.  

---

# Data loading

In [3]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

import os
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure output directory exists
os.makedirs("figures/decision_tree", exist_ok=True)

In [4]:
from keras.datasets import boston_housing

# Load the Boston Housing dataset
(X_train, y_train), (X_test, y_test) = boston_housing.load_data()

X_train=X_train[y_train<50]
y_train=y_train[y_train<50]

X_test=X_test[y_test<50]
y_test=y_test[y_test<50]

# Print the shape of the training and test datasets
print("Training data shape:", X_train.shape)
print("Training targets shape:", y_train.shape)
print("Test data shape:", X_test.shape)
print("Test targets shape:", y_test.shape)

Training data shape: (391, 13)
Training targets shape: (391,)
Test data shape: (99, 13)
Test targets shape: (99,)


### **📌 Exercise: Hyperparameter Tuning for Decision Tree Regression**

#### **📝 Instructions:**
1. **Perform GridSearchCV** to find the best hyperparameters for a `DecisionTreeRegressor`.  
2. **Evaluate model performance** using **Mean Absolute Error (MAE)**, **Mean Squared Error (MSE)**, **Root Mean Squared Error (RMSE)**, and **R² score**.  
3. **Plot residuals** to visualize model performance.  

---

### **🔹 Step 1: Import Necessary Libraries**


### **🔹 Step 2: Define Regressor and Hyperparameters**

In [2]:


# Define Hyperparameter Grid
parameters = {
    "max_depth": [3, 5, 7, 10],  # Increase range of depth
    "min_samples_leaf": [5, 10, 20, 30]  # More variations for tuning
}



### **🔹 Step 3: Analyze GridSearch Results**

### **🔹 Step 4: Evaluate Model Performance**

#### **📌 Compute Regression Metrics**

### **🔹 Step 5: Visualizing Model Performance**
#### **📌 Residual Plot**

![actual_vs_predicted.png](attachment:007b276d-c674-4400-bf16-d3928332004b.png)

![residuals_plot.png](attachment:e90822c1-1d08-4f0a-a405-847a81f9f4bf.png)