# üß™ LAB 4B ‚Äî Polynomial Regression (Temperature vs Power Demand) (Student Version)
### Bologna Business School ‚Äî Machine Learning Lab

**Dataset:** `power_demand_vs_temperature.csv`

## üéØ Objectives
- Load and inspect the dataset
- Build a baseline linear regression model
- Fit polynomial regression models (degrees 2, 3, 4)
- Evaluate models using RMSE and R¬≤ (plus F-statistic and p-value)
- Visualise fitted curves and discuss overfitting risk

---

In [None]:
# üõ†Ô∏è Environment setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats

from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score

random_state = 42
np.random.seed(random_state)

## 1Ô∏è‚É£ Load the dataset
We load the CSV and set the date as the index (time information is not directly used for regression here).

In [None]:
df_temp = pd.read_csv("power_demand_vs_temperature.csv", parse_dates=["date"])
df_temp.set_index("date", inplace=True)
df_temp.head()

## 2Ô∏è‚É£ Inspect the data
We check summary statistics and confirm there are no missing values.

In [None]:
display(df_temp.describe())
print("\nMissing values:")
display(df_temp.isna().sum())

## 3Ô∏è‚É£ Define features (X) and target (y)
- Feature: `temp`
- Target: `demand`

In [None]:
X2 = df_temp[["temp"]].values
y2 = df_temp["demand"].values

print("X shape:", X2.shape)
print("y shape:", y2.shape)

## 4Ô∏è‚É£ Train/test split

In [None]:
X2_train, X2_test, y2_train, y2_test = train_test_split(
    X2, y2, test_size=0.3, random_state=random_state
)

print("Training samples:", X2_train.shape[0])
print("Test samples:", X2_test.shape[0])

## 5Ô∏è‚É£ Evaluation function
We compute RMSE and R¬≤, and also report an F-statistic and p-value for educational purposes.

**Interpretation note:** The classical F-test relies on linear-model assumptions. Polynomial regression is linear in parameters, so it fits within the same framework, but results should still be interpreted cautiously in real applications.

In [None]:
# TODO (Student): Implement this step.
# Hint: Compute RMSE = sqrt(mean_squared_error(...)) and R¬≤ = r2_score(...).


---
## 6Ô∏è‚É£ Experiment 1 ‚Äî Linear regression
We start with a linear model and observe its limitations when the relationship is clearly non-linear.

In [None]:
# TODO (Student): Implement this step.
# Hint: Fit the model on the training set, then predict on the test set.


---
## 7Ô∏è‚É£ Polynomial regression (degrees 2, 3, 4)
We expand the feature space using polynomial features and fit linear regression in the expanded space.

In [None]:
# TODO (Student): Implement this step.
# Hint: Fit the model on the training set, then predict on the test set.


## 8Ô∏è‚É£ Visualise fitted curves
We plot the data and the fitted curves for linear and polynomial models. This provides an intuitive understanding of why polynomial regression improves performance here.

In [None]:
# TODO (Student): Implement this step.
# Hint: Use matplotlib scatter plots to visualise feature vs target.


---
# üß† Final observations
- Linear regression performs well only if the relationship is approximately linear.
- In the marketing dataset, one feature often dominates (identified via correlation/coefficients).
- Tree-based models capture non-linear interactions but can overfit; cross-validation helps select complexity.
- Polynomial regression can model curved relationships effectively, but higher degrees may overfit.

## ‚ùì Control questions (submit short answers)
1. Which feature is the strongest predictor in the marketing dataset? How did you decide?
2. Compare univariate vs multivariate linear regression: when does adding features help or hurt?
3. Why does tuning `max_depth` affect overfitting in a decision tree?
4. Why do polynomial models improve the temperature‚Äìdemand fit? What are the risks of high degrees?

## üì¶ Deliverables
Submit:
- The cleaned dataset size and the number of removed rows
- The correlation ranking vs the target
- The model comparison table (RMSE and R¬≤)
- The fitted-curve plot for polynomial regression
- Answers to the control questions


---
## ‚úÖ Submission checklist (Student)
- All TODO cells completed
- All figures rendered
- Metrics reported (RMSE, R¬≤) where required
- Short answers to control questions included
