# **📘 [LDATS2350] - DATA MINING**

## **📊 Python19 - Multiple Regression**

**Prof. Robin Van Oirbeek**  

<br/>

**🧑‍🏫 Guillaume Deside** *(guillaume.deside@uclouvain.be)*  

---


## **🔹 Multiple Linear Regression**
When multiple independent variables are used, we generalize the equation:


$$Y = w_1 X_1 + w_2 X_2 + ... + w_n X_n + b$$

where:
- $ X_1, X_2, ..., X_n $ are input features.
- $ w_1, w_2, ..., w_n $ are the weights (coefficients) of each feature.



In [4]:
import warnings
import os

warnings.filterwarnings("ignore")
# Create subfolder for multiple regression figures
os.makedirs("figures/multiple_regression", exist_ok=True)


from keras.datasets import boston_housing

# Load the Boston Housing dataset
(X_train, y_train), (X_test, y_test) = boston_housing.load_data()

# Print the shape of the training and test datasets
print("Training data shape:", X_train.shape)
print("Training targets shape:", y_train.shape)
print("Test data shape:", X_test.shape)
print("Test targets shape:", y_test.shape)


Training data shape: (404, 13)
Training targets shape: (404,)
Test data shape: (102, 13)
Test targets shape: (102,)


### 🎓 **Exercise: Linear Regression & Residual Analysis**

You are provided with training and test datasets. Your task is to train a linear regression model, evaluate its performance, and analyze the residuals.

#### 📌 **Instructions**

#### 1. **Data Preparation**
- Check the correlation between features using a heatmap.

#### 2. **Linear Regression Model**
- Create a `LinearRegression` model from `sklearn.linear_model`.
- Use `GridSearchCV` with 3-fold cross-validation to train the model (no hyperparameters needed here).
- Display:
  - Best cross-validated score.
  - Intercept and coefficients of the model.

#### 3. **Model Evaluation**
- Predict on both training and test sets.
- Calculate the following metrics on both sets:
  - **MAE** (Mean Absolute Error)
  - **MSE** (Mean Squared Error)
  - **RMSE** (Root Mean Squared Error)
  - **R² Score**

#### 4. **Residual Analysis**
- Compute residuals (difference between predictions and actual values) for the **training set**.
- Plot a histogram of residuals.
- Scale the residuals using `StandardScaler`.

#### 5. **Normality Check of Residuals**
- Fit a **normal distribution** to the residuals using `scipy.stats.norm.fit`.
- Generate a **QQ plot** to compare residuals to the fitted normal distribution.

#### 6. **Distribution Comparison**
- Overlay the histogram of residuals with the PDF of the fitted normal distribution.



![residuals_vs_normal.png](attachment:f4b8dfa4-9b9f-4c17-aced-889d9824ff1c.png)
![qq_plot_residuals.png](attachment:20392574-d117-4b2c-a9bc-3d16ec7ae3dc.png)
![residual_histogram_train.png](attachment:7a2ee435-aa16-4d48-91d5-403036220264.png)
![qq_percentile_residuals.png](attachment:5fb9abb1-5ebe-4a93-8447-cb71ab4f94d4.png)

![correlation_heatmap.png](attachment:e6729a0f-aab6-4ee3-ae4f-a8a6c8af3891.png)