Github Repository: https://github.com/catebros/ML-fundamentals-2025/tree/main

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import scipy
import kagglehub

path = 'hour.csv'

  from .autonotebook import tqdm as notebook_tqdm


# Task 1: Exploratory Data Analysis (EDA)
Lecture material: Lecture 6 (slides 4–16), Lecture 7(slides 3–9), Lecture 8(slides 2–5)
- Load the hour.csv dataset. (Done)
- Examine the target variable (cnt) distribution and identify its skewness.
- Explore analytically the influence of temporal (hr, weekday, mnth, season), binary (holiday, workingday), and weather-related features(temp, atemp, hum, windspeed, weathersit) on cnt
- Visualize relationships using, forexample, scatterplots, boxplots, and line plots grouped by hour,day,and season, or any other analysis/plot you deem necessary
- Identify any suspicious patterns, outliers, or anomalies.
- Consider dropping the columns instant, dteday, casual, and registered

In [4]:
df = pd.read_csv(path)

In [5]:
df.shape

(17379, 17)

In [6]:
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [7]:
df.describe()

Unnamed: 0,instant,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0
mean,8690.0,2.50164,0.502561,6.537775,11.546752,0.02877,3.003683,0.682721,1.425283,0.496987,0.475775,0.627229,0.190098,35.676218,153.786869,189.463088
std,5017.0295,1.106918,0.500008,3.438776,6.914405,0.167165,2.005771,0.465431,0.639357,0.192556,0.17185,0.19293,0.12234,49.30503,151.357286,181.387599
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0
25%,4345.5,2.0,0.0,4.0,6.0,0.0,1.0,0.0,1.0,0.34,0.3333,0.48,0.1045,4.0,34.0,40.0
50%,8690.0,3.0,1.0,7.0,12.0,0.0,3.0,1.0,1.0,0.5,0.4848,0.63,0.194,17.0,115.0,142.0
75%,13034.5,3.0,1.0,10.0,18.0,0.0,5.0,1.0,2.0,0.66,0.6212,0.78,0.2537,48.0,220.0,281.0
max,17379.0,4.0,1.0,12.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0


In [8]:
df.dtypes

instant         int64
dteday         object
season          int64
yr              int64
mnth            int64
hr              int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp          float64
atemp         float64
hum           float64
windspeed     float64
casual          int64
registered      int64
cnt             int64
dtype: object

 **Variables of the dataset**
 - list

# Task 2: Data Splitting
Lecture Material: Lecture 8 (slide 4), Lecture 12 (slides 4–5)

- Split the dataset into: Training set (60%), Validation set (20%), Test set (20%). If necessary, reevaluate these percentages when tuning the model.
- Use a random split while preserving temporal order if possible.
- Apply the split before performing any feature engineering or scaling to avoid leakage.

# Task 3: Feature Engineering
Lecture Material: Lecture 9 (slides 2–8), Lecture 12 (slides 4–9)
- Encode cyclical features (hr, weekday) using sine and cosine transforms.
- One-hot encode categorical variables: season, weathersit, and mnth.
- Apply scaling (e.g., StandardScaler) to continuous features: temp, atemp, hum, and windspeed.
- Fit all transformations using only the training set, and apply them to validation and test sets.
- Consider interaction terms such as temp × humidity if they are justified by EDA.
- Remove leaky or redundant features (e.g., atemp if highly collinear with temp).

# Task 4: Baseline Model – Linear Regression
Lecture Material: Lecture 9 (slides 4–7), Lecture 11 (slides 2–4)

- Train a Linear Regression model.
- Evaluate on the validation set using, at least:
    - Mean Squared Error (MSE)
    - Mean Absolute Error (MAE)
    - R² Score
- Plot at least the residuals and analyze their distribution.
- Reflect on bias and variance characteristics of this model.

Note: Make the training and evaluation of the models as uniform as possible to enable proper comparison. Use the same features to train all the models (before refinement and tuning) and use at least MSE, MAE, and R² Score.

# Task 5: Random Forest Regressor – Model Specification and Training
Lecture Material: Lecture 11 (slides 5–7), Lecture 12 (slides 4–5)
- Train a Random Forest Regressor.
- Use default or initial parameters (e.g., 100 trees, no depth limit) to establish a baseline.
- Evaluate using the same metrics as above.
- Compare with the baseline model and explain observed differences.
- Include at least a feature importance plot and comment on top predictors.



# Task 6: Gradient Boosting Regressor – Model Specification and Training
Lecture Material: Lecture 12 (slides 4–7), Class 10 Training Notebook
- Train a Gradient Boosting Regressor (e.g., XGBoost or LightGBM).
- Use basic parameters to establish initial results.
- Plot at least residuals and compare performance with previous models.
- Note any early signs of overfitting or high variance.

# Task 7: Hyperparameter Tuning
Lecture Material: Lecture 12 (slides 6–9), Class 10 Training Notebook
- Tune the Random Forest Regressor:
    - Use Randomized Search CV with 5-fold cross-validation.
    - Tune the following hyperparameters: n_estimators, max_depth, min_samples_split, min_samples_leaf
    - Report: Best parameter combination, Validation performance, Updated feature importance

- Tune the Gradient Boosting Regressor:
    - Use Bayesian Optimization (e.g., via BayesSearchCV).
    - Tune the following hyperparameters: learning_rate, n_estimators, max_depth, subsample
    - Visualize convergence of the optimizer if possible.
    - Report: Best parameters, Cross-validated performance, Impact of tuning on generalization

- Explain whether tuning significantly improved performance or not, and hypothesize why (e.g., model variance, overfitting, flat loss surface, etc.).

Note: Compare pre- and post-tuning performance. Highlight overfitting, underfitting, or convergence issues. This task is part of a broader iterative loop — feel free to return to earlier tasks if the results are suboptimal.

# Task 8: Iterative Evaluation and Refinement
Lecture Material: Lecture 9 (slides 6–7), Lecture 11 (slides 3–4), Lecture 12 (slide 9)
- Based on model results, revisit EDA and feature engineering if needed. For example:
    - Do new interaction terms help?
    - Should you drop or transform certain features?
    - Are there outliers that are harming performance?

- Model tuning and evaluation (Tasks 4–8) are iterative. Based on performance, you may revisit:
    - Task 3 (Feature Engineering)
    - Add new transformations
    - Adjust model complexity
    - Document all iterations and reasoning thoroughly in the notebook.

Retrain, re-evaluate, and re-tune your models as needed.