## Baseline Model Training – Notebook Overview

This notebook serves as a starting point for evaluating predictive performance using two simple regression models:  
- **Linear Regression** (as a statistical baseline)  
- **Random Forest Regressor** (as a basic tree-based baseline)

The goal is to establish reference scores for comparison with more advanced models (e.g., XGBoost, CatBoost) later in the pipeline.

### Dataset Splitting Strategy
The dataset is first split into **features (`X`)** and **target (`y`)**, then partitioned into:
- **Training set (80%)** – used to train the model  
- **Testing set (20%)** – held out to evaluate generalization

This split is performed using `train_test_split` from `sklearn.model_selection` with a fixed `random_state` to ensure reproducibility.

### Models Trained
1. **Linear Regression**  
   - A simple model that assumes a linear relationship between inputs and the target.
   - Useful to detect underfitting or feature quality issues.

2. **Random Forest Regressor**  
   - An ensemble of decision trees that typically performs better than linear models by capturing nonlinear patterns.
   - Provides a stronger, but still untuned, performance baseline.

### Evaluation Metrics
Both models are evaluated using:
- **Mean Absolute Error (MAE)**
- **Root Mean Squared Error (RMSE)**
- **R<sup>2</sup> Score (Coefficient of Determination)**

These metrics give an initial understanding of how well the models fit the data.  
All future models in the pipeline should **outperform these baselines** to justify their added complexity.

---


# Load the preprocesed ML file


In [1]:
import sys, os

# Add the project root to the Python path
project_root = os.path.abspath("../..")
sys.path.append(project_root)

# Imports from local modules
import pandas as pd
from utils.data_cleaner import DataCleaner
from utils.data_loader import DataLoader
from utils.constants import  ML_READY_DATA_FILE

# Import standard libraries
from sklearn.model_selection import train_test_split

# Load the dataset
loader = DataLoader(ML_READY_DATA_FILE)
df = loader.load_data()

df.head(10)


Unnamed: 0,bedroomCount,bathroomCount,postCode,habitableSurface,buildingConstructionYear,facedeCount,toiletCount,log_price,is_big_property,room_count,...,epcScore_A++,epcScore_B,epcScore_C,epcScore_D,epcScore_E,epcScore_F,epcScore_G,hasLivingRoom,hasTerrace,price
0,2.0,1.0,1040.0,100.0,2004.0,1.0,1.0,12.896719,0.0,3.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,399000.0
1,2.0,1.0,1040.0,87.0,1970.0,2.0,1.0,13.049795,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,465000.0
2,1.0,1.0,1040.0,71.0,1906.0,2.0,1.0,12.574185,0.0,2.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,289000.0
3,2.0,1.0,1040.0,90.0,1958.0,2.0,1.0,12.834684,0.0,3.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,375000.0
4,1.0,1.0,1040.0,93.0,1947.0,2.0,1.0,12.601491,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,297000.0
5,2.0,1.0,1040.0,120.0,1932.0,2.0,1.0,12.983104,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,435000.0
6,3.0,1.0,1040.0,119.0,1944.0,2.0,1.0,12.821261,0.0,4.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,370000.0
7,2.0,1.0,1040.0,137.0,2024.0,2.0,1.0,13.507627,0.0,3.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,735000.0
8,2.0,1.0,1040.0,110.0,1946.0,2.0,1.0,12.793862,0.0,3.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,360000.0
9,3.0,1.0,1040.0,100.0,2014.0,2.0,1.0,13.01478,0.0,4.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,449000.0


# Dataset Splitting Strategy

In this step, we split the full dataset into:
- **Features (`X`)**: all columns except the target variable (`price`)
- **Target (`y`)**: the column we want to predict (here, `price`)

We use `train_test_split` from `sklearn.model_selection` to divide the data into:
- **Training set (80%)** — used to train the model
- **Testing set (20%)** — used to evaluate performance on unseen data

The `random_state=42` ensures that the split is reproducible.  
Shapes of the resulting datasets are displayed for quick verification.


In [2]:
# Define features and target
# Replace 'price' with your actual target column name if different
X = df.drop(columns=["price"])
y = df["price"]

# Split into training and testing sets (80% / 20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Show shape for verification
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (15210, 2837)
X_test shape: (3803, 2837)
y_train shape: (15210,)
y_test shape: (3803,)


### Linear Regression Model – Training & Evaluation

In this step, we train a **Linear Regression** model on the training set to establish a simple statistical baseline.

The process includes:
- **Model initialization and training** using `LinearRegression()` from scikit-learn.
- **Prediction** on the test set to evaluate generalization performance.
- **Evaluation** using:
  - Mean Absolute Error (MAE)
  - Root Mean Squared Error (RMSE)
  - R-squared score (R<sup>2</sup>)

This baseline helps determine whether future machine learning models (e.g., Random Forest, XGBoost) provide significant improvements.


In [None]:
# Import Linear Regression and evaluation metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Initialize and train the Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Predict on test set
y_pred_lr = lr_model.predict(X_test)

# Evaluate performance
mae_lr = mean_absolute_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
r2_lr = r2_score(y_test, y_pred_lr)

# Display results
print("Linear Regression Performance:")
print(f"MAE  : {mae_lr:.2f}")
print(f"RMSE : {rmse_lr:.2f}")
print(f"R²   : {r2_lr:.2f}")


📈 Linear Regression Performance:
MAE  : 29582.34
RMSE : 42122.14
R²   : 0.93


In [4]:

# Visualizations using ModelVisualizer
visualizer_lr = ModelVisualizer(model=lr_model, X=X_test, y=y_test, model_name="Linear Regression")
visualizer_lr.plot_predicted_vs_actual()
visualizer_lr.plot_residuals_vs_predicted()
visualizer_lr.plot_residual_distribution()
visualizer_lr.plot_price_range_residuals()

NameError: name 'ModelVisualizer' is not defined