# 03 ‚Äì Baseline Modeling (Classification & Regression)

## 1. Objective

In this notebook, we will:

- Load the **feature-only dataset** generated in `02_feature_engineering.ipynb`:
  - `data/processed/housing_with_features.csv`
- Prepare target variables for modeling:
  - `Good_Investment` (classification target)
  - `Future_Price_5Y` (regression target)
- Define the feature matrix `X` and target vectors.
- Split the data into training and test sets.
- Train baseline models for:
  - Classification (Good_Investment)
  - Regression (Future_Price_5Y)
- Evaluate models using standard metrics.
- Save the best-performing models to the canonical locations specified in `src.config`:
  - `models/tuned_classification_model.pkl`
  - `models/tuned_regression_model.pkl`


In [2]:
import sys
from pathlib import Path
import logging

# --------------------------------------------------
# Locate project root (directory containing "src/")
# --------------------------------------------------
PROJECT_ROOT = None
for parent in Path.cwd().resolve().parents:
    if (parent / "src").exists():
        PROJECT_ROOT = parent
        break

if PROJECT_ROOT is None:
    raise RuntimeError("Project root with 'src/' directory not found.")

if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

# --------------------------------------------------
# Imports
# --------------------------------------------------
import pandas as pd
import numpy as np

from src.data.load import load_raw_data
from src.features.build_features import run_feature_pipeline, validate_features
from src.targets import generate_targets

# Display config
pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", lambda x: f"{x:,.2f}")

print(f"‚úÖ Project root detected at: {PROJECT_ROOT}")


  return FileStore(store_uri, store_uri)


In [3]:
# ===============================
# Load processed feature dataset
# ===============================

DATA_PATH = PROJECT_ROOT / "data" / "processed" / "housing_with_features.csv"

if not DATA_PATH.exists():
    raise FileNotFoundError(f"Processed dataset not found at: {DATA_PATH}")

df = pd.read_csv(DATA_PATH)

print("Dataset loaded successfully")
print("Shape:", df.shape)

df.head()


Dataset loaded successfully
Shape: (50000, 33)


Unnamed: 0,ID,State,City,Locality,Property_Type,BHK,Size_in_SqFt,Price_in_Lakhs,Price_per_SqFt,Year_Built,...,Furnished_Status_Enc,Availability_Status_Enc,Transport_Score,Security_Score,Investment_Score,Annual_Growth_Rate,Effective_Growth_Rate,Future_Price_5Y,ROI,Good_Investment
0,38684,Haryana,Gurgaon,Locality_123,Independent House,4,692,256.62,0.370838,2022,...,1,1,1,0,3.364008,0.063456,0.10475,422.289677,0.645584,1
1,64940,Andhra Pradesh,Vishakhapatnam,Locality_74,Apartment,2,3094,86.04,0.027809,2015,...,2,0,1,0,3.449898,0.0638,0.082349,127.801938,0.485378,0
2,3955,Madhya Pradesh,Bhopal,Locality_486,Apartment,3,4993,237.86,0.047639,1995,...,0,1,2,0,3.826176,0.065305,0.105652,393.019329,0.652314,1
3,120375,Punjab,Ludhiana,Locality_13,Villa,1,2461,339.41,0.137915,2018,...,1,0,0,0,2.460123,0.05984,0.12,598.156391,0.762342,1
4,172862,Haryana,Faridabad,Locality_22,Independent House,2,4535,124.99,0.027561,1991,...,1,0,1,0,2.235174,0.058941,0.061643,168.565485,0.348632,0


In [4]:
# ===============================
# Optional sampling for faster iteration
# ===============================

SAMPLE_SIZE = 50_000
RANDOM_STATE = 42

if len(df) > SAMPLE_SIZE:
    df_small = df.sample(n=SAMPLE_SIZE, random_state=RANDOM_STATE).reset_index(drop=True)
    print(f"Using sample of {len(df_small)} rows out of {len(df)} total rows.")
else:
    df_small = df.copy()
    print(f"Dataset has only {len(df_small)} rows, using full data.")

df_small.head()


Dataset has only 50000 rows, using full data.


Unnamed: 0,ID,State,City,Locality,Property_Type,BHK,Size_in_SqFt,Price_in_Lakhs,Price_per_SqFt,Year_Built,...,Furnished_Status_Enc,Availability_Status_Enc,Transport_Score,Security_Score,Investment_Score,Annual_Growth_Rate,Effective_Growth_Rate,Future_Price_5Y,ROI,Good_Investment
0,38684,Haryana,Gurgaon,Locality_123,Independent House,4,692,256.62,0.370838,2022,...,1,1,1,0,3.364008,0.063456,0.10475,422.289677,0.645584,1
1,64940,Andhra Pradesh,Vishakhapatnam,Locality_74,Apartment,2,3094,86.04,0.027809,2015,...,2,0,1,0,3.449898,0.0638,0.082349,127.801938,0.485378,0
2,3955,Madhya Pradesh,Bhopal,Locality_486,Apartment,3,4993,237.86,0.047639,1995,...,0,1,2,0,3.826176,0.065305,0.105652,393.019329,0.652314,1
3,120375,Punjab,Ludhiana,Locality_13,Villa,1,2461,339.41,0.137915,2018,...,1,0,0,0,2.460123,0.05984,0.12,598.156391,0.762342,1
4,172862,Haryana,Faridabad,Locality_22,Independent House,2,4535,124.99,0.027561,1991,...,1,0,1,0,2.235174,0.058941,0.061643,168.565485,0.348632,0


In [5]:
# ===============================
# Target Variable Creation (Safely centralized)
# ===============================

YEARS = 5

df_small = df_small.copy()

# Import the centralized target generator
from src.targets import generate_targets

# Generate leakage-safe targets (reproducible using RANDOM_STATE)
# NOTE: generate_targets does NOT use price transforms like Price_per_SqFt as
# predictors for growth; it uses transport/security/age/investment signals and
# adds Gaussian noise for realism.

df_small = generate_targets(
    df_small,
    years=YEARS,
    base_growth=0.05,
    noise_std=0.03,
    random_state=RANDOM_STATE,
    growth_clip=(0.03, 0.12),
    investment_quantile=0.65,
)

# Verify distribution
class_distribution = df_small["Good_Investment"].value_counts()
print("Target distribution:")
print(class_distribution)

if len(class_distribution) < 2:
    raise ValueError("Single class detected! Adjust `investment_quantile` or provide more diverse data.")

print("Targets created successfully")
print(f"Growth rate range: {df_small['Effective_Growth_Rate'].min():.3f} to {df_small['Effective_Growth_Rate'].max():.3f}")
print(f"Future price range: {df_small['Future_Price_5Y'].min():.2f} to {df_small['Future_Price_5Y'].max():.2f} lakhs")


Target distribution:
Good_Investment
0    32500
1    17500
Name: count, dtype: int64
Targets created successfully
Growth rate range: 0.030 to 0.120
Future price range: 11.65 to 881.01 lakhs


In [6]:
# Persist processed dataset including targets for modeling and the Streamlit app
PROCESSED_PATH = PROJECT_ROOT / "data" / "processed" / "housing_with_features.csv"

# Overwrite intentionally: the modeling pipeline requires targets to be present for
# training and the Streamlit app expects this file as a single source of truth.
df_small.to_csv(PROCESSED_PATH, index=False)
print("Saved dataset with targets to:", PROCESSED_PATH)


Saved dataset with targets to: D:\Labmentix\2nd Project\Real_Estate_Investment_Advisor\data\processed\housing_with_features.csv



## Target construction assumptions & limitations

This dataset uses *synthetic* / rule-based targets created for demonstration and
training purposes. Key assumptions:

- Future price is modeled using a **base macro growth rate** (default 5%) plus
  signals from local attributes: Transport_Score, Security_Score, Age_of_Property,
  and Investment_Score (when available).
- To prevent deterministic inversion (and thus data leakage), we **add
  Gaussian noise** to the growth rate and **clip** growth within reasonable
  bounds (default 3%‚Äì12% annually).
- `Good_Investment` is a **relative** label based on ROI quantiles (top ~35% by
  default). It is not an absolute financial recommendation.

Limitations:
- These targets are synthetic and do not replace true time-series or market
  forecasting data. Treat them as illustrative: they teach model training,
  evaluation, and deployment practice but should be replaced with real
  historical forward-looking labels for production use.

If you accept these assumptions, the next step is to train models using these
targets and persist the final datasets and models for reproducible inference.


# 4. Baseline classification & Regression models


In [7]:
clf_models, reg_models, clf_metrics, reg_metrics = train_baseline_models(
    df_small,
    model_dir="../models",
)
print("Baseline training finished")


Baseline training finished


In [10]:
# -------------------------------
# Show metrics (no errors)
# -------------------------------
print("Classification metrics:")
display(pd.DataFrame(clf_metrics).T)

print("Regression metrics:")
display(pd.DataFrame(reg_metrics).T)

# -------------------------------
# Find best models (for display only)
# -------------------------------
clf_metrics_df = pd.DataFrame(clf_metrics).T
reg_metrics_df = pd.DataFrame(reg_metrics).T

best_clf_name = clf_metrics_df["f1"].idxmax()
best_reg_name = reg_metrics_df["r2"].idxmax()

print("Best classification model (by F1):", best_clf_name)
print("Best regression model (by R2):", best_reg_name)


Classification metrics:


Unnamed: 0,accuracy,precision,recall,f1,roc_auc
logistic_regression,0.6663,0.543935,0.288286,0.376844,0.657748
random_forest_classifier,0.6722,0.565914,0.272286,0.36767,0.664822
xgboost_classifier,0.671,0.558269,0.287429,0.379479,0.66668


Regression metrics:


Unnamed: 0,mae,rmse,r2
linear_regression,38.926643,51.396206,0.941859
random_forest_regressor,37.255838,50.751262,0.943309
xgboost_regressor,37.16292,50.447544,0.943986


Best classification model (by F1): xgboost_classifier
Best regression model (by R2): xgboost_regressor


# 5. Save best models

In [9]:
MODELS_DIR = Path("..") / "models"

best_clf_path = MODELS_DIR / BEST_CLASSIFIER
best_reg_path = MODELS_DIR / BEST_REGRESSOR

print("Best classification model :", best_clf_path.exists())
print("Best regression model :", best_reg_path.exists())

Best classification model : True
Best regression model : True


## 6. Summary & Next Steps

### üîç Classification (Good_Investment)

- Trained baseline models:
  - **Logistic Regression**
  - **Random Forest Classifier**
  - **XGBoost Classifier**
- Best model selected based on **F1-score**.
- Final classification model saved to:
  - `models/best_classification_model.pkl`

---

### üìà Regression (Future_Price_5Y)

- Trained baseline models:
  - **Linear Regression**
  - **Random Forest Regressor**
  - **XGBoost Regressor**
- Best model selected based on **R¬≤ score**.
- Final regression model saved to:
  - `models/best_regression_model.pkl`

---

### üöÄ Next Steps

- Perform **hyperparameter tuning** for Random Forest and XGBoost to improve model performance.
- Integrate the saved models into:
  - `src/models/predict.py`
  - `Streamlit_app.py`
- Build and deploy the **Real Estate Investment Advisor** Streamlit web application.
