# üß™ LAB 4A ‚Äî Regression Models (Marketing Dataset) (Student Version)
### Bologna Business School ‚Äî Machine Learning Lab

**Dataset:** `regression_example.csv`

## üéØ Objectives
- Load, explore, and clean the dataset (handle missing values)
- Visualise predictors vs target and compute correlations
- Univariate linear regression on the strongest predictor
- Multivariate linear regression on all predictors
- Decision Tree regression with cross-validated `max_depth`
- Random Forest regression with cross-validated `max_depth`
- Compare models using RMSE and R¬≤

---

In [None]:
# üõ†Ô∏è Environment setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats

from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score

random_state = 42
np.random.seed(random_state)

---
# PART I ‚Äî Multivariate Regression (Marketing Dataset)


## 1Ô∏è‚É£ Load the dataset
We load the dataset and inspect the first rows. The target variable is `response`, and the predictors are `F0`‚Äì`F7`. The column `idx` is an identifier and will not be used for modelling.

In [None]:
df = pd.read_csv("regression_example.csv")
df.head()

## 2Ô∏è‚É£ Explore the dataset
We examine summary statistics and data types, and we verify whether there are missing values (NaNs).

In [None]:
df.describe()

In [None]:
df.info()

## 3Ô∏è‚É£ Handle missing values
Regression estimators in scikit-learn do not accept NaNs by default, so we remove rows containing missing values.

**Note:** In real projects you may consider imputation; here we follow the lab instruction to drop unusable rows.

In [None]:
n_before = df.shape[0]
df_clean = df.dropna()
n_after = df_clean.shape[0]

print(f"Removed {n_before - n_after} rows with missing values")
print("Final shape:", df_clean.shape)

## 4Ô∏è‚É£ Split into features (X) and target (y)
`idx` is removed because it is an identifier, not a predictive feature.

In [None]:
X = df_clean.drop(columns=["idx", "response"])
y = df_clean["response"]

print("X shape:", X.shape)
print("y shape:", y.shape)

## 5Ô∏è‚É£ Visual exploration: scatter plots (feature vs target)
These plots provide an initial qualitative understanding of how each feature relates to the target.

In [None]:
# TODO (Student): Implement this step.
# Hint: Use matplotlib scatter plots to visualise feature vs target.


## 6Ô∏è‚É£ Correlation analysis
We compute the correlation matrix and identify the feature with the highest absolute correlation with the target. This will be used for **univariate linear regression**.

In [None]:
# TODO (Student): Implement this step.
# Hint: Use plt.imshow(corr_matrix) to visualise correlations.


## 7Ô∏è‚É£ Train/test split
We split the dataset into a training set (fit models) and a test set (evaluate generalisation).

**Important:** We set `random_state` for reproducibility.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=random_state
)

print("Training samples:", X_train.shape[0])
print("Test samples:", X_test.shape[0])

---
## 8Ô∏è‚É£ Experiment 1 ‚Äî Univariate Linear Regression
We fit a linear regression using only the strongest correlated feature (computed above).

In [None]:
# TODO (Student): Implement this step.
# Hint: Fit the model on the training set, then predict on the test set.


### Optional: statistical significance (simple univariate t-test)
For a single predictor, we can evaluate whether the slope differs significantly from zero. This is a classical OLS-style test (educational).

In [None]:
# TODO (Student): Implement this step.
# Hint: Follow the instructions in the markdown above this cell.


---
## 9Ô∏è‚É£ Experiment 2 ‚Äî Multivariate Linear Regression
We now use all predictors. We inspect coefficients and compute RMSE and R¬≤.

In [None]:
# TODO (Student): Implement this step.
# Hint: Fit the model on the training set, then predict on the test set.


### Cross-validation check (robust comparison)
A single train/test split can be noisy. We estimate average RMSE via 5-fold cross-validation for univariate and multivariate linear models.

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=random_state)

scores_uni = cross_val_score(
    LinearRegression(), X[[best_feature]], y,
    scoring="neg_root_mean_squared_error", cv=cv
)
scores_multi = cross_val_score(
    LinearRegression(), X, y,
    scoring="neg_root_mean_squared_error", cv=cv
)

print("CV RMSE (univariate): ", -scores_uni.mean(), "+/-", scores_uni.std())
print("CV RMSE (multivariate):", -scores_multi.mean(), "+/-", scores_multi.std())

---
## üîü Experiment 3 ‚Äî Decision Tree Regression
We fit a decision tree and tune the `max_depth` hyperparameter using cross-validation.

In [None]:
dt = DecisionTreeRegressor(random_state=random_state)
dt.fit(X_train, y_train)

print("Depth of unconstrained tree:", dt.get_depth())

In [None]:
param_grid = {"max_depth": range(1, 15)}
grid = GridSearchCV(
    DecisionTreeRegressor(random_state=random_state),
    param_grid,
    cv=5,
    scoring="neg_mean_squared_error"
)
grid.fit(X_train, y_train)

best_dt = grid.best_estimator_
print("Best max_depth:", grid.best_params_["max_depth"])

In [None]:
# TODO (Student): Implement this step.
# Hint: Compute RMSE = sqrt(mean_squared_error(...)) and R¬≤ = r2_score(...).


### Visualise the tuned decision tree

In [None]:
# TODO (Student): Implement this step.
# Hint: Use sklearn.tree.plot_tree(best_dt, feature_names=..., filled=True).


---
## 1Ô∏è‚É£1Ô∏è‚É£ Experiment 4 ‚Äî Random Forest Regression
We fit a Random Forest regressor. For simplicity, we tune `max_depth` and keep a moderately large number of trees.

In [None]:
rf_param_grid = {"max_depth": range(1, 11)}
rf_grid = GridSearchCV(
    RandomForestRegressor(n_estimators=200, random_state=random_state),
    rf_param_grid,
    cv=5,
    scoring="neg_mean_squared_error",
    n_jobs=-1
)
rf_grid.fit(X_train, y_train)

best_rf = rf_grid.best_estimator_
print("Best max_depth:", rf_grid.best_params_["max_depth"])

In [None]:
# TODO (Student): Implement this step.
# Hint: Compute RMSE = sqrt(mean_squared_error(...)) and R¬≤ = r2_score(...).


---
## 1Ô∏è‚É£2Ô∏è‚É£ Model comparison (Marketing dataset)
We summarise RMSE and R¬≤ for all models and plot predictions vs true values on the test set.

In [None]:
comparison_marketing = pd.DataFrame({
    "Model": [
        f"Linear (univariate: {best_feature})",
        "Linear (multivariate)",
        "Decision Tree (tuned)",
        "Random Forest (tuned)"
    ],
    "RMSE": [rmse_uni, rmse_multi, rmse_dt, rmse_rf],
    "R¬≤": [r2_uni, r2_multi, r2_dt, r2_rf]
})
comparison_marketing

In [None]:
# TODO (Student): Implement this step.
# Hint: Follow the instructions in the markdown above this cell.


---
## ‚úÖ Submission checklist (Student)
- All TODO cells completed
- All figures rendered
- Metrics reported (RMSE, R¬≤) where required
- Short answers to control questions included
