# 🧼 Handling Missing Values and Encoding Categorical Variables

This notebook focuses on robust preprocessing using `SimpleImputer` and encoding with `OneHotEncoder`. All steps will feed into a scikit-learn `Pipeline` or `ColumnTransformer` for scalable modeling.

## 📊 Data Overview

In [None]:
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv("../data/housing.csv")

# Quick shape check
print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Define target and feature matrix
y = df['Price']
X = df.drop('Price', axis=1)

In [None]:
# Separate column types
numeric_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()

print(f"Numeric columns: {numeric_cols}")
print(f"Categorical columns: {categorical_cols}")

## 🔍 Missing Data Handling

In [None]:
# Count missing values
print("Missing values in numeric features:")
display(X[numeric_cols].isnull().sum().sort_values(ascending=False))

print("Missing values in categorical features:")
display(X[categorical_cols].isnull().sum().sort_values(ascending=False))

## 🏗️ Pipeline Construction

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [None]:
# Median for numerics, most_frequent for categoricals
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

In [None]:
# Combine numerical and categorical transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Full pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(n_estimators=50, random_state=42))
])

# Fit model
model_pipeline.fit(X_train, y_train)

## 🧪 Model Evaluation

In [None]:
# Score model on validation data
from sklearn.metrics import mean_absolute_error

preds = model_pipeline.predict(X_test)
mae = mean_absolute_error(y_test, preds)

print(f"Validation MAE with pipeline: ${mae:,.0f} AUD")

## Summary of Preprocessing Strategy

- Imputed numerical columns using `median` strategy.
- Imputed categorical columns using `most_frequent` strategy.
- Encoded categoricals using `OneHotEncoder(handle_unknown='ignore')`.
- Combined preprocessing and model into a single scikit-learn `Pipeline`.
- Achieved MAE of $162,798 AUD on hold-out validation data.

### Next Steps
- Consider handling high-cardinality features like `Suburb` with frequency or target encoding.
- Integrate cross-validation for more robust evaluation (Day 3).