
# Feature Engineering

Feature engineering is the process of transforming raw data into meaningful input variables (features) for machine learning models. It is a crucial step that can significantly impact model performance and interpretability.

It combines **Domain knowledge**, **Data intuition**, and **Mathematical transformations** to create new features that help models learn better.

## Common Feature Engineering Techniques

-   **Polynomial features** (e.g. $x^2$, $xy$)
-   **Log/exp/sqrt transformations**
-   **Ratios** and **derived metrics** (e.g., $\text{BMI} = \text{weight} / \text{height}^2$)
-   **Discretization/Binning** (e.g. age $\Rightarrow$ age groups)
-   **Time-based features** (e.g. time since last event, weekday vs weekend)
-   **Interaction features** (e.g. combining `income` $\times$ `education level`)
-   **Domain-specific rules** (e.g. combining gene expression scores)

Good feature engineering can:

-   Improve model accuracy
-   Improve interpretability
-   Reduce model complexity

## Practical Demonstration

In [None]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

# Load dataset
data = fetch_california_housing(as_frame=True)
df = data.frame

# Example: create new features
df['Rooms_per_person'] = df['AveRooms'] / (df['Population'] + 1)  # avoid division by zero

# Polynomial features for MedInc
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['MedInc']])
poly_feature_names = poly.get_feature_names_out(['MedInc'])
poly_df = pd.DataFrame(poly_features, columns=poly_feature_names)

# Combine into single dataset
df = pd.concat([df, poly_df], axis=1)
print(df[['MedInc', 'MedInc^2', 'Rooms_per_person']].head())

-   Training a model with the new features

In [None]:
# Prepare data for training
X = df.drop(columns=[data.target.name])
y = df[data.target.name]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model
print("Test R²:", model.score(X_test, y_test))

## Hands-on Exercises

Add the following features to the California housing dataset:
-   `Bedrooms_ratio = AveBedrms / AveRooms`
-   `Log_income = log(MedInc + 1)`