# Custom transformers

## Previous steps

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

housing = pd.read_csv("./data/housing.csv") 

train_set, test_set = train_test_split(housing, test_size=0.2,
    stratify=pd.cut(housing["median_income"], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5]),
    random_state=42
    )

X_train = train_set.drop("median_house_value", axis=1) # Remove the dependent variable column
y_train = train_set["median_house_value"].copy() # Save the dependent variable (labels)

## Creating custom transformers

For transformations that don't require training, you can simply define a function that receives a NumPy array and returns a transformed one and pass it to `FunctionTransformer` to create a custom transformer. These transformers will allow creating objects that behave like those from the `sklearn` library and can be used in its *pipelines*. For example, for logarithmic transformations

In [None]:
from sklearn.preprocessing import FunctionTransformer

log_transformer = FunctionTransformer(np.log, inverse_func=np.exp)
log_pop = log_transformer.transform(X_train[["population"]])

or to combine *features*:

In [None]:
def column_ratio(X): # Custom transformer to compute the ratio of two columns
    return X[:, [0]] / X[:, [1]]

ratio_transformer = FunctionTransformer(column_ratio)
    
X_train["rooms_per_household"] = ratio_transformer.fit_transform(X_train[['total_rooms', 'households']].values)
X_train[['rooms_per_household', 'total_rooms', 'households']].head()

The same example as before can be defined more compactly using a lambda (an anonymous function):

In [None]:
ratio_transformer = FunctionTransformer(lambda X: X[:, [0]] / X[:, [1]])
X_train["rooms_per_household"] = ratio_transformer.transform(X_train[['total_rooms', 'households']].values)
X_train[['rooms_per_household', 'total_rooms', 'households']].head()

When our transformation requires training, we can create a transformer that has a `fit` method in which the necessary parameters are learned and a `transform` method that applies the transformation. A custom transformer must inherit from `BaseEstimator` (from which it inherits the `get_params` and `set_params` methods, necessary for adjusting the transformation parameters) and from `TransformerMixin` (which provides the `fit_transform` method).

For example, defining a transformer that behaves like `StandardScaler`:

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted

class StandardScalerClone(BaseEstimator, TransformerMixin):
    def __init__(self, with_mean=True):  # no *args or **kwargs!
        self.with_mean = with_mean

    def fit(self, X, y=None):  # y is required even though we don't use it
        X = check_array(X)  # checks that X is an array with finite float values
        self.mean_ = X.mean(axis=0)
        self.scale_ = X.std(axis=0)
        self.n_features_in_ = X.shape[1]  # every estimator stores this in fit()
        return self  # always return self!

    def transform(self, X):
        check_is_fitted(self)  # looks for learned attributes (with trailing _)
        X = check_array(X)
        assert self.n_features_in_ == X.shape[1]
        if self.with_mean:
            X = X - self.mean_
        return X / self.scale_

In [None]:
# Example of using a custom transformer
scaler = StandardScalerClone()
scaler.fit(X_train[["total_rooms"]])
scaler.transform(X_train[["total_rooms"]])

## Next Steps

This notebook introduced custom transformers using `FunctionTransformer` and class-based approaches. The preprocessing pipeline is completed in:

- [e2e060 - Spatial Clustering](e2e060_spatial_clustering.ipynb): `ClusterSimilarity` transformer for geospatial features using K-means and RBF kernel

The complete preprocessing pipeline is consolidated in [`utils/housing_preprocessing.py`](utils/housing_preprocessing.py) for reuse across model training notebooks.