# 2. Feature Engineering

In this notebook, we will perform feature engineering on the raw housing data. This includes handling categorical variables, dealing with skewed data, removing outliers, and scaling the features. The goal is to prepare the data for model training.

In [None]:
import os
import pandas as pd
import numpy as np

## Data Loading

In [None]:
input_data_dir = os.path.join("..", "data", "raw")
df = pd.read_csv(os.path.join(input_data_dir, "Housing.csv"))
df.head()

## Data Preprocessing

### Handling Binary Categorical Features

In [None]:
binary_columns = [
    'mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea'
]

# Convert binary categorical features to numeric
for col in binary_columns:
    df[col] = df[col].map({'yes': 1, 'no': 0})

print("DataFrame after converting binary columns:")
df.head()

### Handling Multi-Level Categorical Features

In [None]:
# One-hot encode the furnishingstatus column
dummies = pd.get_dummies(df['furnishingstatus'], drop_first=True)

df = pd.concat([df, dummies], axis=1)
df = df.drop("furnishingstatus", axis=1)

print("DataFrame after optimal One-Hot Encoding:")
df.head()

### Handling Skewness in Numerical Features

In [None]:
# Log transform skewed numerical features to make them more normally distributed
df['price'] = np.log(df['price'])
df['area'] = np.log(df['area'])


### A Note on Data Leakage

In the original notebook, a feature called `price_per_sqft` was created by dividing `price` (the target variable) by `area`. This is a form of **data leakage**, where information from the target variable is used to create a feature. This can lead to models that perform unrealistically well on the test set but fail in the real world. We have removed this feature to prevent data leakage.

### Handling Outliers

In [None]:
# Remove outliers based on the 99th percentile of the area
q99 = df['area'].quantile(0.99)
print(f"Original number of houses: {len(df)}")

df = df[df['area'] < q99]
print(f"Number of houses after removing outliers: {len(df)}")

## Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

# Scale the numerical features (excluding the target variable 'price')
numeric_vars_X = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking']

scaler = StandardScaler()

df[numeric_vars_X] = scaler.fit_transform(df[numeric_vars_X])

print("DataFrame after scaling:")
df.head()

## Saving the Processed Data

In [None]:
output_data_dir = os.path.join("..", "data", "interim")
df.to_csv(os.path.join(output_data_dir, "feature_engineered.csv"), index=False)