# Custom Ratio Transformation Function 

### Ratio Transformer
* This function takes two columns (as a 2D array) and returns their ratio: column 0 divided by column 1.
It's used to compute features like `bedrooms / rooms`, `rooms / households`, etc.

In [1]:
def column_ratio(X):
    # Takes a 2D array and returns the element-wise ratio of column 0 over column 1
    return X[:, [0]] / X[:, [1]]

### Ratio Feature Name Function
* Provides a name ("ratio") for the output feature of the `FunctionTransformer`.

In [None]:
def ratio_name(function_transformer, feature_names_in):
    # Sets a custom output name for the ratio feature
    return ["ratio"]

---
# Pipeline for Ratio Features

### Ratio Pipeline
Applies the following steps:
1. Impute missing values with the median.
2. Compute the ratio.
3. Standardize the result.

In [None]:
def ratio_pipeline():
    return make_pipeline(
        SimpleImputer(strategy="median"), # Handle missing values
        FunctionTransformer(column_ratio, feature_names_out=ratio_name), # Compute the ratio
        StandardScaler() # Standardize the result
    )

---
# Pipeline for Log-Transformed Features

### Log Pipeline
Applies the following to skewed numerical features:
1. Impute missing values with the median.
2. Apply log transformation to reduce skew.
3. Standardize the result.

In [None]:
import numpy as np

log_pipeline = make_pipeline(
    SimpleImputer(strategy="median"), # Handle missing values
    FunctionTransformer(np.log, feature_names_out="one-to-one"), # Apply log transform
    StandardScaler()# Standardize the result
)

---
# Cluster Similarity Transformer

### ClusterSimilarity Transformer
Custom transformer that:
* Uses KMeans to find 10 geographic clusters.
* Applies Gaussian RBF kernel to measure how close each row is to each cluster center.


cluster_simil = ClusterSimilarity(
    n_clusters=10, # Find 10 clusters using KMeans
    gamma=1.0, # RBF kernel width (controls smoothness of similarity)
    random_state=42 # Ensures reproducibility
)

---
# Default Numeric Pipeline for Unprocessed Columns

### Default Numeric Pipeline
Handles numeric columns not covered by earlier transformers.
1. Impute with median.
2. Standardize.


In [None]:
default_num_pipeline = make_pipeline(
    SimpleImputer(strategy="median"), # Handle missing values
    StandardScaler() # Standardize features
)

---
# Categorical Pipeline

### Categorical Pipeline
1. Impute missing categories using the most frequent value.
2. One-hot encode the categorical column.

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore")
)

---
# Final ColumnTransformer to Apply All Steps

### Final Preprocessing Pipeline
Combines all previous pipelines and applies them to their corresponding columns.

Transformations:
- Bedrooms Ratio: total_bedrooms / total_rooms
- Rooms per House: total_rooms / households
- People per House: population / households
- Log Transform: skewed features
- Cluster Similarity: latitude + longitude
- Categorical Pipeline: ocean_proximity
- Remainder: housing_median_age (default processing)

In [None]:
preprocessing = ColumnTransformer([
    # Compute bedrooms/rooms ratio and scale
    ("bedrooms", ratio_pipeline(), ["total_bedrooms", "total_rooms"]),

    # Compute rooms/households ratio and scale
    ("rooms_per_house", ratio_pipeline(), ["total_rooms", "households"]),

    # Compute population/households ratio and scale
    ("people_per_house", ratio_pipeline(), ["population", "households"]),

    # Apply log transform + scale to skewed features
    ("log", log_pipeline, [
        "total_bedrooms", "total_rooms", "population",
        "households", "median_income"
    ]),

    # Generate 10 cluster similarity features from latitude and longitude
    ("geo", cluster_simil, ["latitude", "longitude"]),

    # Handle categorical columns: impute + one-hot encode
    ("cat", cat_pipeline, make_column_selector(dtype_include=object)),

], remainder=default_num_pipeline)  # Apply default pipeline to remaining feature: housing_median_age