
# Complete Feature Engineering & Data Preprocessing (End-to-End)

This notebook is a **trainer-level, exhaustive reference** for **data preprocessing and feature engineering**
using a **real-world dataset**.

You will learn:
- What each technique does
- When to use it
- Why it matters for model performance



## Dataset: California Housing (Real-Life)

Target: `MedHouseValue`

Why this dataset?
- Numerical-heavy (perfect for preprocessing)
- Realistic distributions
- Industry-standard dataset


In [None]:

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np

data = fetch_california_housing(as_frame=True)
df = data.frame.copy()
df.head()



## 1. Basic EDA and Descriptive statistics

Before preprocessing, always inspect:
- Distribution
- Scale differences
- Skewness

In [None]:

df.describe()



## 2. Missing Value Imputation

### Why?
Most ML algorithms **cannot handle NaN values**.

### Techniques Covered
- Mean
- Median
- Mode
- KNN Imputation
- Iterative Imputation


In [None]:

from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Introduce artificial missing values
df_missing = df.copy()
df_missing.iloc[::10, 0] = np.nan

mean_imputer = SimpleImputer(strategy="mean")
median_imputer = SimpleImputer(strategy="median")
mode_imputer = SimpleImputer(strategy="most_frequent")

knn_imputer = KNNImputer(n_neighbors=5)
iter_imputer = IterativeImputer(random_state=42)

df_mean = pd.DataFrame(mean_imputer.fit_transform(df_missing), columns=df.columns)
df_median = pd.DataFrame(median_imputer.fit_transform(df_missing), columns=df.columns)
df_mode = pd.DataFrame(mode_imputer.fit_transform(df_missing), columns=df.columns)
df_knn = pd.DataFrame(knn_imputer.fit_transform(df_missing), columns=df.columns)
df_iter = pd.DataFrame(iter_imputer.fit_transform(df_missing), columns=df.columns)

df_mean.isnull().sum().head()



## 3. Feature Scaling

### Standardization vs Normalization

**Standardization (Z-score):**
- Mean = 0, Std = 1
- Used for GD-based models

**Normalization (MinMax):**
- Range [0,1]
- Used for distance-based models


In [None]:

from sklearn.preprocessing import StandardScaler, MinMaxScaler

std_scaler = StandardScaler()
minmax_scaler = MinMaxScaler()

df_std = pd.DataFrame(std_scaler.fit_transform(df_mean), columns=df.columns)
df_norm = pd.DataFrame(minmax_scaler.fit_transform(df_mean), columns=df.columns)

df_std.head()



## 4. Power Transformation

### Why?
- Handles skewed distributions
- Makes data more Gaussian

Techniques:
- Box-Cox (positive only)
- Yeo-Johnson (allows zero/negative)


In [None]:

from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method="yeo-johnson")
df_power = pd.DataFrame(pt.fit_transform(df_mean), columns=df.columns)
df_power.head()



## 5. Encoding Categorical Data

We simulate categorical data for demonstration.

Techniques:
- Ordinal Encoding
- One-Hot Encoding


In [None]:

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

df_cat = df_mean.copy()
df_cat["Income_Category"] = pd.cut(
    df_cat["MedInc"],
    bins=[0,2,4,6,8,15],
    labels=["Low","Medium","High","Very High","Ultra"]
)

ordinal_enc = OrdinalEncoder(categories=[["Low","Medium","High","Very High","Ultra"]])
df_cat["Income_Ordinal"] = ordinal_enc.fit_transform(df_cat[["Income_Category"]])

ohe = OneHotEncoder(sparse=False)
ohe_features = ohe.fit_transform(df_cat[["Income_Category"]])
ohe_df = pd.DataFrame(ohe_features, columns=ohe.get_feature_names_out())

df_cat[["Income_Category","Income_Ordinal"]].head()



## 6. ColumnTransformer & FunctionTransformer

### Why?
- Apply different transformations to different columns
- Apply custom logic inside sklearn workflow


In [None]:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

def log_transform(x):
    return np.log1p(x)

log_transformer = FunctionTransformer(log_transform)

ct = ColumnTransformer([
    ("log_population", log_transformer, ["Population"]),
    ("scale_income", StandardScaler(), ["MedInc"])
])

ct.fit_transform(df_mean)



## 7. Binning & Binarization

### Binning
- Converts continuous → discrete
- Useful for tree models & interpretability

### Binarization
- Converts feature to 0/1 based on threshold


In [None]:

from sklearn.preprocessing import KBinsDiscretizer, Binarizer

binning = KBinsDiscretizer(n_bins=4, encode="ordinal", strategy="quantile")
df_mean["HouseAge_Binned"] = binning.fit_transform(df_mean[["HouseAge"]])

binarizer = Binarizer(threshold=5)
df_mean["Income_Binary"] = binarizer.fit_transform(df_mean[["MedInc"]])

df_mean[["HouseAge","HouseAge_Binned","Income_Binary"]].head()



## 8. Outlier Detection & Removal

### Z-score
- Assumes Gaussian distribution

### IQR
- Robust to skewed data


In [None]:

from scipy.stats import zscore

z_scores = np.abs(zscore(df_mean))
df_z = df_mean[(z_scores < 3).all(axis=1)]

Q1 = df_mean.quantile(0.25)
Q3 = df_mean.quantile(0.75)
IQR = Q3 - Q1

df_iqr = df_mean[~((df_mean < (Q1 - 1.5 * IQR)) | (df_mean > (Q3 + 1.5 * IQR))).any(axis=1)]

(df_mean.shape, df_z.shape, df_iqr.shape)



## Final Summary

### Preprocessing Techniques Covered
✔ Mean / Median / Mode Imputation  
✔ KNN & Iterative Imputation  
✔ Standardization & Normalization  
✔ Power Transformation  
✔ Ordinal & One-Hot Encoding  
✔ ColumnTransformer & FunctionTransformer  
✔ Binning & Binarization  
✔ Outlier removal (Z-score & IQR)

This notebook reflects **real-world ML preprocessing workflows** used in industry.
