# 02 – Feature Engineering 

## 1. Objective

In this notebook, we will:

- Start from the **cleaned dataset** finalized in `01_eda.ipynb`.
- Apply **minimal preprocessing** strictly based on EDA findings:
  - No missing values
  - No duplicate rows
  - Only handle **outliers** and basic consistency checks (as decided in EDA).
- Run a **feature engineering pipeline** to create:
  - Clean, model-ready feature transformations (encodings and derived numeric features).
- **Target variables**:
  - `Future_Price_5Y` (regression)
  - `Good_Investment` (classification)
  are **intentionally NOT created or inspected** in this notebook and are handled
  separately in downstream modeling notebooks to avoid data leakage.
- Save a single, final **feature-only dataset** to:
  - `../data/processed/housing_with_features.csv`

This notebook serves as the bridge between **EDA** and **modeling**.


## 2. Imports & configuration


In [1]:
# ============================================
# Project Setup & Imports (STANDARDIZED)
# ============================================

import os 
import sys
from pathlib import Path

# --- Locate project root (folder containing 'src/') ---
PROJECT_ROOT = None
for parent in Path.cwd().resolve().parents:
    if (parent / "src").exists():
        PROJECT_ROOT = parent
        break

if PROJECT_ROOT is None:
    raise RuntimeError("Project root with 'src/' directory not found.")

# Add project root to Python path
sys.path.insert(0, str(PROJECT_ROOT))

# -----------------------------------------------
# Core libraries
# -----------------------------------------------
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Project imports
from src.features.build_features import validate_features, run_feature_pipeline
from src.data.load import load_raw_data

# Display settings
pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", lambda x: f"{x:,.2f}")
sns.set_style("whitegrid")

print(f"✅ Project root set to: {PROJECT_ROOT}")


✅ Project root set to: D:\Labmentix\2nd Project\Real_Estate_Investment_Advisor


## 3️. Loading the Dataset

In [2]:
DATA_PATH = "../data/raw/india_housing_prices.csv"
df = load_raw_data(DATA_PATH)
print("Dataset shape:", df.shape)
df.head()

Dataset shape: (250000, 23)


Unnamed: 0,ID,State,City,Locality,Property_Type,BHK,Size_in_SqFt,Price_in_Lakhs,Price_per_SqFt,Year_Built,Furnished_Status,Floor_No,Total_Floors,Age_of_Property,Nearby_Schools,Nearby_Hospitals,Public_Transport_Accessibility,Parking_Space,Security,Amenities,Facing,Owner_Type,Availability_Status
0,1,Tamil Nadu,Chennai,Locality_84,Apartment,1,4740,489.76,0.1,1990,Furnished,22,1,35,10,3,High,No,No,"Playground, Gym, Garden, Pool, Clubhouse",West,Owner,Ready_to_Move
1,2,Maharashtra,Pune,Locality_490,Independent House,3,2364,195.52,0.08,2008,Unfurnished,21,20,17,8,1,Low,No,Yes,"Playground, Clubhouse, Pool, Gym, Garden",North,Builder,Under_Construction
2,3,Punjab,Ludhiana,Locality_167,Apartment,2,3642,183.79,0.05,1997,Semi-furnished,19,27,28,9,8,Low,Yes,No,"Clubhouse, Pool, Playground, Gym",South,Broker,Ready_to_Move
3,4,Rajasthan,Jodhpur,Locality_393,Independent House,2,2741,300.29,0.11,1991,Furnished,21,26,34,5,7,High,Yes,Yes,"Playground, Clubhouse, Gym, Pool, Garden",North,Builder,Ready_to_Move
4,5,Rajasthan,Jaipur,Locality_466,Villa,4,4823,182.9,0.04,2002,Semi-furnished,3,2,23,4,9,Low,No,Yes,"Playground, Garden, Gym, Pool, Clubhouse",East,Builder,Ready_to_Move


## 4. Preprocessing Alignment with EDA

Based on `01_eda.ipynb`, the dataset:
- Contains **no missing values**
- Contains **no duplicate rows**

Therefore, **no imputation or deduplication** is applied in this notebook.

Only the **outlier handling strategy decided in EDA** is applied later during feature engineering.
No additional preprocessing is performed here.


## 5. Feature Engineering Plan

Using `run_feature_pipeline()` from `src/features/build_features.py`, the following
EDA-driven transformations will be applied:

1. **Numerical Features**
   - Retain all validated numerical features
   - Apply outlier capping at the 99th percentile for price-related columns
     (as decided in `01_eda.ipynb`)

2. **Categorical Features**
   - Encode categorical variables using One-Hot Encoding
   - No ordinal assumptions are imposed unless explicitly justified by EDA

3. **Target Variables**
   - `Good_Investment` (classification target) is treated as an existing label
   - `Future_Price_5Y` (regression target) is treated as an existing value
   - Targets are **not used** in feature construction to avoid leakage

4. **Final Output**
   - A clean, model-ready feature matrix
   - Targets kept separate for downstream modeling notebooks


## 6. Run feature engineering pipeline


In [3]:
df_features = run_feature_pipeline(df)
print("Featured Dataset shape:", df_features.shape)
df_features.head()

Featured Dataset shape: (250000, 29)


Unnamed: 0,ID,State,City,Locality,Property_Type,BHK,Size_in_SqFt,Price_in_Lakhs,Price_per_SqFt,Year_Built,Furnished_Status,Floor_No,Total_Floors,Age_of_Property,Nearby_Schools,Nearby_Hospitals,Public_Transport_Accessibility,Parking_Space,Security,Amenities,Facing,Owner_Type,Availability_Status,Furnished_Status_Enc,Availability_Status_Enc,Transport_Score,Security_Score,Investment_Score,Annual_Growth_Rate
0,1,Tamil Nadu,Chennai,Locality_84,Apartment,1,4740,489.76,0.1,1990,Furnished,22,1,35,10,3,High,No,No,"Playground, Gym, Garden, Pool, Clubhouse",West,Owner,Ready_to_Move,2,1,2,0,3.72,0.06
1,2,Maharashtra,Pune,Locality_490,Independent House,3,2364,195.52,0.08,2008,Unfurnished,21,20,17,8,1,Low,No,Yes,"Playground, Clubhouse, Pool, Gym, Garden",North,Builder,Under_Construction,0,0,0,0,2.48,0.06
2,3,Punjab,Ludhiana,Locality_167,Apartment,2,3642,183.79,0.05,1997,Semi-furnished,19,27,28,9,8,Low,Yes,No,"Clubhouse, Pool, Playground, Gym",South,Broker,Ready_to_Move,1,1,0,0,2.17,0.06
3,4,Rajasthan,Jodhpur,Locality_393,Independent House,2,2741,300.29,0.11,1991,Furnished,21,26,34,5,7,High,Yes,Yes,"Playground, Clubhouse, Gym, Pool, Garden",North,Builder,Ready_to_Move,2,1,2,0,3.72,0.06
4,5,Rajasthan,Jaipur,Locality_466,Villa,4,4823,182.9,0.04,2002,Semi-furnished,3,2,23,4,9,Low,No,Yes,"Playground, Garden, Gym, Pool, Clubhouse",East,Builder,Ready_to_Move,1,1,0,0,2.39,0.06


## 7. Target Handling (Separated for Modeling)

Target variables are **not inspected or engineered** in this notebook.

They are:
- Prepared explicitly in the modeling notebooks
- Kept separate from feature engineering to avoid leakage

This notebook outputs **features only**.


## 8. Save processed dataset for modeling


In [4]:
PROCESSED_DIR = "../data/processed"
os.makedirs(PROCESSED_DIR, exist_ok=True)

FINAL_PATH = os.path.join(PROCESSED_DIR, "housing_with_features.csv")

df_features.to_csv(FINAL_PATH, index=False)

print("✓ Saved processed dataset with engineered features:")
print(FINAL_PATH)

# Quick confirmation read
pd.read_csv(FINAL_PATH).head()

✓ Saved processed dataset with engineered features:
../data/processed\housing_with_features.csv


Unnamed: 0,ID,State,City,Locality,Property_Type,BHK,Size_in_SqFt,Price_in_Lakhs,Price_per_SqFt,Year_Built,Furnished_Status,Floor_No,Total_Floors,Age_of_Property,Nearby_Schools,Nearby_Hospitals,Public_Transport_Accessibility,Parking_Space,Security,Amenities,Facing,Owner_Type,Availability_Status,Furnished_Status_Enc,Availability_Status_Enc,Transport_Score,Security_Score,Investment_Score,Annual_Growth_Rate
0,1,Tamil Nadu,Chennai,Locality_84,Apartment,1,4740,489.76,0.1,1990,Furnished,22,1,35,10,3,High,No,No,"Playground, Gym, Garden, Pool, Clubhouse",West,Owner,Ready_to_Move,2,1,2,0,3.72,0.06
1,2,Maharashtra,Pune,Locality_490,Independent House,3,2364,195.52,0.08,2008,Unfurnished,21,20,17,8,1,Low,No,Yes,"Playground, Clubhouse, Pool, Gym, Garden",North,Builder,Under_Construction,0,0,0,0,2.48,0.06
2,3,Punjab,Ludhiana,Locality_167,Apartment,2,3642,183.79,0.05,1997,Semi-furnished,19,27,28,9,8,Low,Yes,No,"Clubhouse, Pool, Playground, Gym",South,Broker,Ready_to_Move,1,1,0,0,2.17,0.06
3,4,Rajasthan,Jodhpur,Locality_393,Independent House,2,2741,300.29,0.11,1991,Furnished,21,26,34,5,7,High,Yes,Yes,"Playground, Clubhouse, Gym, Pool, Garden",North,Builder,Ready_to_Move,2,1,2,0,3.72,0.06
4,5,Rajasthan,Jaipur,Locality_466,Villa,4,4823,182.9,0.04,2002,Semi-furnished,3,2,23,4,9,Low,No,Yes,"Playground, Garden, Gym, Pool, Clubhouse",East,Builder,Ready_to_Move,1,1,0,0,2.39,0.06


## 9. Summary & Next Steps

### Summary

- Used the same dataset finalized in the EDA notebook.
- No additional cleaning was required:
  - No missing values
  - No duplicate rows
- Performed **feature engineering only**, strictly based on EDA insights.
- Engineered safe, model-ready features (e.g. encodings, derived numeric features).
- **No target variables were created or inspected** in this notebook.
- Saved the final **feature-only dataset** to:
  - `../data/processed/housing_with_features.csv`


