# 04 ‚Äì Model Tuning & Selection (Classification + Regression)

## üéØ Objective

In this notebook, we perform **hyperparameter tuning and model selection** for both
classification and regression tasks using the processed real estate dataset.

---

## üîç Tuned Classification ‚Äì *Good_Investment*

We train and tune the following classification models using **RandomizedSearchCV**:

- **Random Forest Classifier**
- **XGBoost Classifier**

### Evaluation Strategy
- Primary metric: **F1-score**
- Cross-validation used to ensure robust performance
- Best model selected based on highest validation F1-score

### Output
The best-performing classification model is saved to:



# 2. Imports & configuration

In [1]:
# ======================================================
# Project Setup (Single Source of Truth)
# ======================================================

import sys
from pathlib import Path
import logging
import pandas as pd

# ------------------------------------------------------
# Locate project root (directory containing 'src/')
# ------------------------------------------------------
PROJECT_ROOT = None
for parent in Path.cwd().resolve().parents:
    if (parent / "src").exists():
        PROJECT_ROOT = parent
        break

if PROJECT_ROOT is None:
    raise RuntimeError("Project root with 'src/' directory not found")

# Add project root to PYTHONPATH once
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

print(f"‚úÖ Project root set to: {PROJECT_ROOT}")

# ------------------------------------------------------
# Imports from project
# ------------------------------------------------------
from src.features.build_features import validate_features
from src.data.load import load_raw_data
from src.models import set_mlflow_experiment, log_requirements
from src.models import train_baseline_models

# ------------------------------------------------------
# Load processed dataset
# ------------------------------------------------------
PROCESSED_PATH = PROJECT_ROOT / "data" / "processed" / "housing_with_features.csv"

if not PROCESSED_PATH.exists():
    raise FileNotFoundError(
        "Processed dataset not found. Run feature engineering first."
    )

df = pd.read_csv(PROCESSED_PATH)

# Validate dataset integrity
validate_features(df, require_targets=False)

print("‚úÖ Dataset loaded:", df.shape)

# ------------------------------------------------------
# MLflow setup
# ------------------------------------------------------
import mlflow
logging.getLogger("mlflow.models.model").setLevel(logging.ERROR)

set_mlflow_experiment("real_estate_investment")
log_requirements()


‚úÖ Project root set to: D:\Labmentix\2nd Project\Real_Estate_Investment_Advisor
‚úÖ Dataset loaded: (50000, 33)


  return FileStore(store_uri, store_uri)


# 3. Load processed dataset


In [2]:
DATA_PATH = PROJECT_ROOT / "data" / "processed" / "housing_with_features.csv"

try:
    df = pd.read_csv(DATA_PATH)
    print(f"‚úî Processed Data loaded successfully ‚Äî {df.shape[0]} rows, {df.shape[1]} columns")
    display(df.head())
except FileNotFoundError:
    print("‚ùå ERROR: Dataset not found. Check file path.")


‚úî Processed Data loaded successfully ‚Äî 50000 rows, 33 columns


Unnamed: 0,ID,State,City,Locality,Property_Type,BHK,Size_in_SqFt,Price_in_Lakhs,Price_per_SqFt,Year_Built,...,Furnished_Status_Enc,Availability_Status_Enc,Transport_Score,Security_Score,Investment_Score,Annual_Growth_Rate,Effective_Growth_Rate,Future_Price_5Y,ROI,Good_Investment
0,38684,Haryana,Gurgaon,Locality_123,Independent House,4,692,256.62,0.370838,2022,...,1,1,1,0,3.364008,0.063456,0.10475,422.289677,0.645584,1
1,64940,Andhra Pradesh,Vishakhapatnam,Locality_74,Apartment,2,3094,86.04,0.027809,2015,...,2,0,1,0,3.449898,0.0638,0.082349,127.801938,0.485378,0
2,3955,Madhya Pradesh,Bhopal,Locality_486,Apartment,3,4993,237.86,0.047639,1995,...,0,1,2,0,3.826176,0.065305,0.105652,393.019329,0.652314,1
3,120375,Punjab,Ludhiana,Locality_13,Villa,1,2461,339.41,0.137915,2018,...,1,0,0,0,2.460123,0.05984,0.12,598.156391,0.762342,1
4,172862,Haryana,Faridabad,Locality_22,Independent House,2,4535,124.99,0.027561,1991,...,1,0,1,0,2.235174,0.058941,0.061643,168.565485,0.348632,0


# 4. Run Hyperparameter Tuning (Random Forest + XGBoost)

In [3]:
clf_results, reg_results = tune_models_with_random_search(
    df=df,
    sample_size=50_000,                
    n_iter=15,                          
    cv=3,                                
    models_dir=PROJECT_ROOT / "models", 
)

print("\nClassification ‚Äì tuned model comparison (test metrics):")
display(pd.DataFrame(clf_results).T)

print("\nRegression ‚Äì tuned model comparison (test metrics):")
display(pd.DataFrame(reg_results).T)


NameError: name 'tune_models_with_random_search' is not defined

# 5. Verify Tuned Models Saved to Disk


In [None]:
models_dir = project_root / "models"
from src.config import BEST_CLASSIFIER, BEST_REGRESSOR

tuned_clf_path = models_dir / BEST_CLASSIFIER
tuned_reg_path = models_dir / BEST_REGRESSOR

print("Tuned classification model exists:", tuned_clf_path.exists())
print("Tuned regression model exists    :", tuned_reg_path.exists())

## 6. Summary & Next Steps

### üîç Tuned Classification (Good_Investment)

- Tuned models:
  - **Random Forest Classifier** (RandomizedSearchCV)
  - **XGBoost Classifier** (RandomizedSearchCV)
- Evaluation metric: **F1-score** on the test set.
- Selected the best tuned classifier based on **F1-score**.
- Saved the tuned classification model to:
  - `models/tuned_classification_model.pkl`

---

### üìà Tuned Regression (Future_Price_5Y)

- Tuned models:
  - **Random Forest Regressor** (RandomizedSearchCV)
  - **XGBoost Regressor** (RandomizedSearchCV)
- Evaluation metric: **R¬≤ score** on the test set.
- Selected the best tuned regressor based on **R¬≤ score**.
- Saved the tuned regression model to:
  - `models/tuned_regression_model.pkl`

---

### üìä MLflow Experiment Tracking

- Each tuning run was logged to **MLflow**, including:
  - Hyperparameters searched
  - Best hyperparameters
  - Cross-validation scores
  - Test metrics (F1, ROC-AUC, MAE, RMSE, R¬≤, etc.)
  - Tuned models as MLflow artifacts

---

### üöÄ Next Steps

- Use:
  - `models/tuned_classification_model.pkl`
  - `models/tuned_regression_model.pkl`
  in:
  - `src/models/predict.py`
  - `Streamlit_app.py`
- Update the Streamlit app to use the **tuned models** for predictions.
