# Transform Data

* In this notebook we'll create custom transformers and pipelines to transform the raw `training` data into transformed data for ML training.
* We'll use the same pipeline to transform the data for prediction as well. 

## Import Libraries

In [24]:
## import the necessary libraries
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.metrics.pairwise import rbf_kernel
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer



## Load Training Data

In [25]:
processed_data_path = Path("..", "data", "processed", "housing")

In [26]:
## read data
data = pd.read_csv(Path(processed_data_path, "train_set.csv"))

General Notes:
* We can either create a simple `FunctionTransformer`
* Or a Custom Class Transformer if we to fit the data.

We need the following data transformations (in same order)
* Fill in missing values
    * Custom Class Transformer cause we need to find the median on fit and impute on transform. 
* Convert `ocean_proximity` to one hot encoding
    * Custom Class Transformer cause we'll need to fit/trasnform using OneHotEncoder
* Feature Engineering `rooms_per_house`, `bedroom_ratio` and `people_per_house`
    Function Transformer - simply find ratio between two columns
* Add cluster similarity features - with hyperparameter to control `gamma`
    * We can either use the same modes to calculate the similarity in that case we can use simple Function Transformer
    * We can fit and transform using gaussian KDE everytime. 
* Drop Outliers - 
    * Custom Class Transformer since we'll need to fit the data using `IsolationForest`
    * We'll also need a boolean hyperparameter to drop outlier or not.
* Transform heavy tailed features using logarithm 
    * Simple Functoin Transformer to find `np.log` and `np.exp` as inverse transformation.
* Scale all numeric features. 
    * Custom Class Transformer since we need to fit and transform

## Split Features & Labels

In [27]:
## before we create the pipeline lets split he training data into features and labels
df_features = data.drop("median_house_value", axis=1)
df_labels = data["median_house_value"].copy()

## Custom Transfomers & Pipelines

* Before we start writing transformers, we need to write some sample code to understand how transformers and pipelines work, this section is just for learning purposes.
* I might remove it from the notebook in future

In [28]:
## sample transfomer to test the input/output

## a simple transformer function
def hello_transformer(X):
    ## transformers always need to return a 2D array (numpy array, scipy sparse array, dataframe)
    ## create column name from first two columns
    column_lst = X.iloc[:,[0,1]].columns
    col_name = f"{column_lst[0]}_to_{column_lst[1]}_ratio"
    result = pd.DataFrame(X.iloc[:,0] / X.iloc[:,1])
    result.columns = [col_name]
    return result

# a callable function for output names
def ratio_name(function_transformer, feature_names_in):
    col_name = f"{feature_names_in[0]}_to_{feature_names_in[1]}_ratio"
    return [col_name]  # feature names out

## using FunctionTransformer to create a transformer out of our function
my_transformer = FunctionTransformer(hello_transformer, feature_names_out=ratio_name)

## creating a column transfomer
temp_transformer = ColumnTransformer([
    ("hello", my_transformer, ["total_rooms", "households"])
])

## fit transform to df_features
temp_transformer.fit_transform(df_features)

array([[3.21179884],
       [5.50420168],
       [5.33497537],
       ...,
       [5.15789474],
       [4.51193317],
       [2.03301887]])

In [29]:
temp_transformer.get_feature_names_out()

array(['hello__total_rooms_to_households_ratio'], dtype=object)

Lessons Learnt:
* Transformations always need to return a 2D array (numpy array, scipy sparse array, dataframe), for now as a rule, we'll convert all the custom transformer return data to DataFrame


* I wonder if we number of columns returned in one transformer is less, does another transformer in the sequence get fewer columns?

In [30]:
def hello_again_transformer(X):
    ## transformers always need to return a 2D array (numpy array, scipy sparse array, dataframe)
    result = pd.DataFrame(X.iloc[:,0] * 2)
    result.columns = ["double"]
    return result

# a callable function for output names
def again_ratio_name(function_transformer, feature_names_in):
    print(f"again_ratio_name_{function_transformer}")
    return ["double"]  # feature names out

## using FunctionTransformer to create a transformer out of our function
my_another_transformer = FunctionTransformer(hello_again_transformer, feature_names_out=again_ratio_name)

# base_estimator = FunctionTransformer(lambda X: X, feature_names_out="one-to-one")

## creating a column transfomer
temp_transformer_2 = ColumnTransformer([
    ("room_to_house", my_transformer, ["total_rooms", "households"]),
    ("room_to_bedroom", my_transformer, ["total_rooms", "total_bedrooms"]),
    # ("hello_again", my_another_transformer, ["total_rooms","total_bedrooms"])
], remainder="passthrough", verbose_feature_names_out=False)

temp_transformer_2.set_output(transform="pandas")
## fit transform to df_features
temp_transformer_2.fit_transform(df_features)

Unnamed: 0,total_rooms_to_households_ratio,total_rooms_to_total_bedrooms_ratio,longitude,latitude,housing_median_age,population,median_income,ocean_proximity,income_categories,population_categories
0,3.211799,2.978475,-122.42,37.80,52.0,1576.0,2.0987,NEAR BAY,2,1
1,5.504202,5.550847,-118.38,34.14,40.0,666.0,6.0876,<1H OCEAN,5,1
2,5.334975,4.990783,-121.98,38.36,33.0,562.0,2.4330,INLAND,2,1
3,5.351282,4.904818,-117.11,33.75,17.0,1845.0,2.2618,INLAND,2,1
4,3.725256,3.605285,-118.15,33.77,36.0,1912.0,3.5292,NEAR OCEAN,3,1
...,...,...,...,...,...,...,...,...,...,...
16507,4.277247,3.747069,-118.40,33.86,41.0,938.0,4.7105,<1H OCEAN,4,1
16508,5.535714,4.974662,-119.31,36.32,23.0,1419.0,2.5733,INLAND,2,1
16509,5.157895,5.058065,-117.06,32.59,13.0,2814.0,4.0616,NEAR OCEAN,3,1
16510,4.511933,4.331042,-118.40,34.06,37.0,1725.0,4.1455,<1H OCEAN,3,1


In [31]:
df_features

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,income_categories,population_categories
0,-122.42,37.80,52.0,3321.0,1115.0,1576.0,1034.0,2.0987,NEAR BAY,2,1
1,-118.38,34.14,40.0,1965.0,354.0,666.0,357.0,6.0876,<1H OCEAN,5,1
2,-121.98,38.36,33.0,1083.0,217.0,562.0,203.0,2.4330,INLAND,2,1
3,-117.11,33.75,17.0,4174.0,851.0,1845.0,780.0,2.2618,INLAND,2,1
4,-118.15,33.77,36.0,4366.0,1211.0,1912.0,1172.0,3.5292,NEAR OCEAN,3,1
...,...,...,...,...,...,...,...,...,...,...,...
16507,-118.40,33.86,41.0,2237.0,597.0,938.0,523.0,4.7105,<1H OCEAN,4,1
16508,-119.31,36.32,23.0,2945.0,592.0,1419.0,532.0,2.5733,INLAND,2,1
16509,-117.06,32.59,13.0,3920.0,775.0,2814.0,760.0,4.0616,NEAR OCEAN,3,1
16510,-118.40,34.06,37.0,3781.0,873.0,1725.0,838.0,4.1455,<1H OCEAN,3,1


In [32]:
transformed_data = pd.DataFrame(temp_transformer_2.fit_transform(df_features), columns=temp_transformer_2.get_feature_names_out())

Lessons Learnt:
* We'll have to set the `columns` attribute in our transformers to make sure the pipelines work as expected. 
* Also 

In [33]:
## checking what would `fit` method return on the `FunctionTransformer` pipeline
transformed_data

Unnamed: 0,total_rooms_to_households_ratio,total_rooms_to_total_bedrooms_ratio,longitude,latitude,housing_median_age,population,median_income,ocean_proximity,income_categories,population_categories
0,3.211799,2.978475,-122.42,37.80,52.0,1576.0,2.0987,NEAR BAY,2,1
1,5.504202,5.550847,-118.38,34.14,40.0,666.0,6.0876,<1H OCEAN,5,1
2,5.334975,4.990783,-121.98,38.36,33.0,562.0,2.4330,INLAND,2,1
3,5.351282,4.904818,-117.11,33.75,17.0,1845.0,2.2618,INLAND,2,1
4,3.725256,3.605285,-118.15,33.77,36.0,1912.0,3.5292,NEAR OCEAN,3,1
...,...,...,...,...,...,...,...,...,...,...
16507,4.277247,3.747069,-118.40,33.86,41.0,938.0,4.7105,<1H OCEAN,4,1
16508,5.535714,4.974662,-119.31,36.32,23.0,1419.0,2.5733,INLAND,2,1
16509,5.157895,5.058065,-117.06,32.59,13.0,2814.0,4.0616,NEAR OCEAN,3,1
16510,4.511933,4.331042,-118.40,34.06,37.0,1725.0,4.1455,<1H OCEAN,3,1


## Transfomer To Fill Missing Values

* After researching a bit I realized we can just use `SimpleImputer` directly into the pipeline without the need to creating a class. 
* So skipping this step. 

## Handling Categorical Values

* Even this is similar to missing values transformer
* We can directly use the `OneHotEncoder` and `SimpleImputer` in the pipeline

In [34]:
cat_pipeline = Pipeline([
    ("impute categories", SimpleImputer(strategy="most_frequent")),
    ("encode categories", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))
])

In [35]:
df_features["ocean_proximity"]

0          NEAR BAY
1         <1H OCEAN
2            INLAND
3            INLAND
4        NEAR OCEAN
            ...    
16507     <1H OCEAN
16508        INLAND
16509    NEAR OCEAN
16510     <1H OCEAN
16511    NEAR OCEAN
Name: ocean_proximity, Length: 16512, dtype: object

In [36]:
## testing to see if pipelines worked
cat_pipeline.fit_transform(df_features.select_dtypes(include=object))

array([[0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       ...,
       [0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.]])

In [37]:
cat_pipeline.get_feature_names_out()

array(['ocean_proximity_<1H OCEAN', 'ocean_proximity_INLAND',
       'ocean_proximity_ISLAND', 'ocean_proximity_NEAR BAY',
       'ocean_proximity_NEAR OCEAN'], dtype=object)

In [38]:
pd.DataFrame(cat_pipeline.fit_transform(df_features.select_dtypes(include=object)), columns=cat_pipeline.get_feature_names_out())

Unnamed: 0,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,0.0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...
16507,1.0,0.0,0.0,0.0,0.0
16508,0.0,1.0,0.0,0.0,0.0
16509,0.0,0.0,0.0,0.0,1.0
16510,1.0,0.0,0.0,0.0,0.0
