
# Feature Engineering & Data Preprocessing (With Visuals + Model Impact)

This notebook extends the preprocessing notebook by:
1. Adding **visualizations** for key preprocessing steps  
2. Showing **model performance before vs after preprocessing**  
3. Keeping the notebook **trainer & interview ready**  

A **separate theory document** accompanies this notebook.



## Dataset: California Housing Dataset
Target: `MedHouseValue`


In [None]:

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

data = fetch_california_housing(as_frame=True)
df = data.frame.copy()
df.head()



## 1. Feature Distribution (Before Preprocessing)

Understanding skewness and scale helps decide:
- Log / Power transform
- Scaling strategy


In [None]:

plt.figure(figsize=(6,4))
sns.histplot(df["Population"], kde=True)
plt.title("Population Distribution (Raw)")
plt.show()



## 2. Missing Value Imputation – Visual Impact

We introduce missing values artificially to visualize imputation effects.


In [None]:

from sklearn.impute import SimpleImputer

df_missing = df.copy()
df_missing.iloc[::15, df.columns.get_loc("Population")] = np.nan

imputer = SimpleImputer(strategy="median")
df_imputed = pd.DataFrame(imputer.fit_transform(df_missing), columns=df.columns)

plt.figure(figsize=(6,4))
sns.kdeplot(df_missing["Population"], label="With NaNs")
sns.kdeplot(df_imputed["Population"], label="After Imputation")
plt.legend()
plt.show()



## 3. Scaling Effect Visualization

Standardization vs Normalization


In [None]:

from sklearn.preprocessing import StandardScaler, MinMaxScaler

std = StandardScaler()
mm = MinMaxScaler()

pop_std = std.fit_transform(df_imputed[["Population"]])
pop_mm = mm.fit_transform(df_imputed[["Population"]])

plt.figure(figsize=(6,4))
sns.kdeplot(pop_std.flatten(), label="Standardized")
sns.kdeplot(pop_mm.flatten(), label="Normalized")
plt.legend()
plt.show()



## 4. Power Transformation Visualization


In [None]:

from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method="yeo-johnson")
pop_power = pt.fit_transform(df_imputed[["Population"]])

plt.figure(figsize=(6,4))
sns.kdeplot(df_imputed["Population"], label="Original")
sns.kdeplot(pop_power.flatten(), label="Power Transformed")
plt.legend()
plt.show()



## 5. Model Performance: Before vs After Preprocessing

We compare:
- Raw data
- Preprocessed data

Metric: **R² Score**


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X = df.drop(columns=["MedHouseValue"])
y = df["MedHouseValue"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr_raw = LinearRegression()
lr_raw.fit(X_train, y_train)
r2_raw = r2_score(y_test, lr_raw.predict(X_test))

# Preprocessed
X_scaled = StandardScaler().fit_transform(df_imputed.drop(columns=["MedHouseValue"]))
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

lr_scaled = LinearRegression()
lr_scaled.fit(X_train_s, y_train_s)
r2_scaled = r2_score(y_test_s, lr_scaled.predict(X_test_s))

r2_raw, r2_scaled



### Interpretation
- Preprocessing improves numerical stability
- Gradient-based models benefit the most



## Final Takeaways

✔ Visual intuition for preprocessing  
✔ Clear impact on model performance  
✔ End-to-end real-world workflow  

Use this notebook for:
- Teaching
- Interview preparation
- ML foundations revision
