# Transform Data

* In this notebook we'll create custom transformers and pipelines to transform the raw `training` data into transformed data for ML training.
* We'll use the same pipeline to transform the data for prediction as well. 

## Import Libraries

In [39]:
## import the necessary libraries
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.metrics.pairwise import rbf_kernel
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer



## Load Training Data

In [40]:
processed_data_path = Path("..", "data", "processed", "housing")

In [41]:
## read data
data = pd.read_csv(Path(processed_data_path, "train_set.csv"))

In [42]:
data.head()

Unnamed: 0.1,Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,income_categories,population_categories
0,13096,-122.42,37.8,52.0,3321.0,1115.0,1576.0,1034.0,2.0987,458300.0,NEAR BAY,2,1
1,14973,-118.38,34.14,40.0,1965.0,354.0,666.0,357.0,6.0876,483800.0,<1H OCEAN,5,1
2,3785,-121.98,38.36,33.0,1083.0,217.0,562.0,203.0,2.433,101700.0,INLAND,2,1
3,14689,-117.11,33.75,17.0,4174.0,851.0,1845.0,780.0,2.2618,96100.0,INLAND,2,1
4,20507,-118.15,33.77,36.0,4366.0,1211.0,1912.0,1172.0,3.5292,361800.0,NEAR OCEAN,3,1


General Notes:
* We can either create a simple `FunctionTransformer`
* Or a Custom Class Transformer if we to fit the data.

We need the following data transformations (in same order)
* Fill in missing values
    * Custom Class Transformer cause we need to find the median on fit and impute on transform. 
* Convert `ocean_proximity` to one hot encoding
    * Custom Class Transformer cause we'll need to fit/trasnform using OneHotEncoder
* Feature Engineering `rooms_per_house`, `bedroom_ratio` and `people_per_house`
    Function Transformer - simply find ratio between two columns
* Add cluster similarity features - with hyperparameter to control `gamma`
    * We can either use the same modes to calculate the similarity in that case we can use simple Function Transformer
    * We can fit and transform using gaussian KDE everytime. 
* Drop Outliers - 
    * Custom Class Transformer since we'll need to fit the data using `IsolationForest`
    * We'll also need a boolean hyperparameter to drop outlier or not.
* Transform heavy tailed features using logarithm 
    * Simple Functoin Transformer to find `np.log` and `np.exp` as inverse transformation.
* Scale all numeric features. 
    * Custom Class Transformer since we need to fit and transform

## Split Features & Labels

In [43]:
## before we create the pipeline lets split he training data into features and labels
df_features = data.drop("median_house_value", axis=1)
df_labels = data["median_house_value"].copy()

## Transfomer To Fill Missing Values

* After researching a bit I realized we can just use `SimpleImputer` directly into the pipeline without the need to creating a class. 
* So skipping this step. 

## Handling Categorical Values

* Even this is similar to missing values transformer
* We can directly use the `OneHotEncoder` and `SimpleImputer` in the pipeline

In [44]:
cat_pipeline = Pipeline([
    ("impute categories", SimpleImputer(strategy="most_frequent")),
    ("encode categories", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))
])

In [45]:
df_features["ocean_proximity"]

0          NEAR BAY
1         <1H OCEAN
2            INLAND
3            INLAND
4        NEAR OCEAN
            ...    
16507     <1H OCEAN
16508        INLAND
16509    NEAR OCEAN
16510     <1H OCEAN
16511    NEAR OCEAN
Name: ocean_proximity, Length: 16512, dtype: object

In [46]:
## testing to see if pipelines worked
cat_pipeline.fit_transform(df_features.select_dtypes(include=object))

array([[0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       ...,
       [0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.]])

In [47]:
cat_pipeline.get_feature_names_out()

array(['ocean_proximity_<1H OCEAN', 'ocean_proximity_INLAND',
       'ocean_proximity_ISLAND', 'ocean_proximity_NEAR BAY',
       'ocean_proximity_NEAR OCEAN'], dtype=object)

In [48]:
pd.DataFrame(cat_pipeline.fit_transform(df_features.select_dtypes(include=object)), columns=cat_pipeline.get_feature_names_out())

Unnamed: 0,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,0.0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...
16507,1.0,0.0,0.0,0.0,0.0
16508,0.0,1.0,0.0,0.0,0.0
16509,0.0,0.0,0.0,0.0,1.0
16510,1.0,0.0,0.0,0.0,0.0


## Feature Engineering

* For feature engineering we'll need to write a custom FunctionTransformer to find the ratio

In [49]:
## sample transfomer to test the input/output
def hello_transformer(X):
    return X.iloc[:,[0]] / X.iloc[:,[1]]

def ratio_name(function_transformer, feature_names_in):
    return ["ratio"]  # feature names out

my_transformer = FunctionTransformer(hello_transformer, feature_names_out=ratio_name)


In [50]:
def column_ratio(X):
    X = X.to_numpy()  # Convert DataFrame to NumPy array
    return X[:, [0]] / X[:, [1]]

def ratio_name(function_transformer, feature_names_in):
    print("Feature names in:", feature_names_in)
    return ["ratio"]  # feature names out

def ratio_pipeline():
    return make_pipeline(
        # SimpleImputer(strategy="median"),
        FunctionTransformer(column_ratio, feature_names_out=ratio_name),
        # StandardScaler()
        )

Lesson Learnt:
* `column_ratio` function transformer only works because 

* So we are getting the whole dataframe in X, lets check how `ColumnTransformer` works since thats what we need to use for our pipeline

In [51]:
temp_transformer = ColumnTransformer([
    ("hello", ratio_pipeline(), ["total_rooms", "households"])
])

In [52]:
transformed_data = temp_transformer.fit_transform(df_features)

In [53]:
temp_transformer

In [54]:
temp_transformer.get_feature_names_out()

Feature names in: ['total_rooms' 'households']


array(['hello__ratio'], dtype=object)

In [55]:
df_features.isna().sum()

Unnamed: 0                 0
longitude                  0
latitude                   0
housing_median_age         0
total_rooms                0
total_bedrooms           168
population                 0
households                 0
median_income              0
ocean_proximity            0
income_categories          0
population_categories      0
dtype: int64