# Transform Data

* In this notebook we'll create custom transformers and pipelines to transform the raw `training` data into transformed data for ML training.
* We'll use the same pipeline to transform the data for prediction as well. 

## Import Libraries

In [28]:
## import the necessary libraries
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.metrics.pairwise import rbf_kernel
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer



## Load Training Data

In [29]:
processed_data_path = Path("..", "data", "processed", "housing")

In [30]:
## read data
data = pd.read_csv(Path(processed_data_path, "train_set.csv"))

General Notes:
* We can either create a simple `FunctionTransformer`
* Or a Custom Class Transformer if we to fit the data.

We need the following data transformations (in same order)
* Fill in missing values
    * Custom Class Transformer cause we need to find the median on fit and impute on transform. 
* Convert `ocean_proximity` to one hot encoding
    * Custom Class Transformer cause we'll need to fit/trasnform using OneHotEncoder
* Feature Engineering `rooms_per_house`, `bedroom_ratio` and `people_per_house`
    Function Transformer - simply find ratio between two columns
* Add cluster similarity features - with hyperparameter to control `gamma`
    * We can either use the same modes to calculate the similarity in that case we can use simple Function Transformer
    * We can fit and transform using gaussian KDE everytime. 
* Drop Outliers - 
    * Custom Class Transformer since we'll need to fit the data using `IsolationForest`
    * We'll also need a boolean hyperparameter to drop outlier or not.
* Transform heavy tailed features using logarithm 
    * Simple Functoin Transformer to find `np.log` and `np.exp` as inverse transformation.
* Scale all numeric features. 
    * Custom Class Transformer since we need to fit and transform

## Split Features & Labels

In [31]:
## before we create the pipeline lets split he training data into features and labels
df_features = data.drop("median_house_value", axis=1)
df_labels = data["median_house_value"].copy()

## Custom Transfomers & Pipelines

* Before we start writing transformers, we need to write some sample code to understand how transformers and pipelines work, this section is just for learning purposes.
* I might remove it from the notebook in future

### Function Transformer

In [32]:
## sample transfomer to test the input/output

## a simple transformer function
def hello_transformer(X):
    ## transformers always need to return a 2D array (numpy array, scipy sparse array, dataframe)
    ## create column name from first two columns
    column_lst = X.iloc[:,[0,1]].columns
    col_name = f"{column_lst[0]}_to_{column_lst[1]}_ratio"
    result = pd.DataFrame(X.iloc[:,0] / X.iloc[:,1])
    result.columns = [col_name]
    return result

# a callable function for output names
def ratio_name(function_transformer, feature_names_in):
    col_name = f"{feature_names_in[0]}_to_{feature_names_in[1]}_ratio"
    return [col_name]  # feature names out

## using FunctionTransformer to create a transformer out of our function
my_transformer = FunctionTransformer(hello_transformer, feature_names_out=ratio_name)

## creating a column transfomer
temp_transformer = ColumnTransformer([
    ("hello", my_transformer, ["total_rooms", "households"])
])

## fit transform to df_features
temp_transformer.fit_transform(df_features)

array([[3.21179884],
       [5.50420168],
       [5.33497537],
       ...,
       [5.15789474],
       [4.51193317],
       [2.03301887]])

In [33]:
temp_transformer.get_feature_names_out()

array(['hello__total_rooms_to_households_ratio'], dtype=object)

Lessons Learnt:
* Transformations always need to return a 2D array (numpy array, scipy sparse array, dataframe), for now as a rule, we'll convert all the custom transformer return data to DataFrame


* I wonder if we number of columns returned in one transformer is less, does another transformer in the sequence get fewer columns?

In [34]:
def hello_again_transformer(X):
    ## transformers always need to return a 2D array (numpy array, scipy sparse array, dataframe)
    result = pd.DataFrame(X.iloc[:,0] * 2)
    result.columns = ["double"]
    return result

# a callable function for output names
def again_ratio_name(function_transformer, feature_names_in):
    return ["double"]  # feature names out

## using FunctionTransformer to create a transformer out of our function
my_another_transformer = FunctionTransformer(hello_again_transformer, feature_names_out=again_ratio_name)

# base_estimator = FunctionTransformer(lambda X: X, feature_names_out="one-to-one")

## creating a column transfomer
temp_transformer_2 = ColumnTransformer([
    ("room_to_house", my_transformer, ["total_rooms", "households"]),
    ("room_to_bedroom", my_transformer, ["total_rooms", "total_bedrooms"]),
    # ("hello_again", my_another_transformer, ["total_rooms","total_bedrooms"])
], remainder="passthrough", verbose_feature_names_out=False)

## set output to pandas if we want output to be a pandas dataframe.
temp_transformer_2.set_output(transform="pandas")
## fit transform to df_features
temp_transformer_2.fit_transform(df_features)

Unnamed: 0,total_rooms_to_households_ratio,total_rooms_to_total_bedrooms_ratio,longitude,latitude,housing_median_age,population,median_income,ocean_proximity,income_categories,population_categories
0,3.211799,2.978475,-122.42,37.80,52.0,1576.0,2.0987,NEAR BAY,2,1
1,5.504202,5.550847,-118.38,34.14,40.0,666.0,6.0876,<1H OCEAN,5,1
2,5.334975,4.990783,-121.98,38.36,33.0,562.0,2.4330,INLAND,2,1
3,5.351282,4.904818,-117.11,33.75,17.0,1845.0,2.2618,INLAND,2,1
4,3.725256,3.605285,-118.15,33.77,36.0,1912.0,3.5292,NEAR OCEAN,3,1
...,...,...,...,...,...,...,...,...,...,...
16507,4.277247,3.747069,-118.40,33.86,41.0,938.0,4.7105,<1H OCEAN,4,1
16508,5.535714,4.974662,-119.31,36.32,23.0,1419.0,2.5733,INLAND,2,1
16509,5.157895,5.058065,-117.06,32.59,13.0,2814.0,4.0616,NEAR OCEAN,3,1
16510,4.511933,4.331042,-118.40,34.06,37.0,1725.0,4.1455,<1H OCEAN,3,1


In [35]:
## interesting even if the output is DataFrame, pd.DataFrame doesn't throw and error
transformed_data = pd.DataFrame(temp_transformer_2.fit_transform(df_features), columns=temp_transformer_2.get_feature_names_out())
transformed_data

Unnamed: 0,total_rooms_to_households_ratio,total_rooms_to_total_bedrooms_ratio,longitude,latitude,housing_median_age,population,median_income,ocean_proximity,income_categories,population_categories
0,3.211799,2.978475,-122.42,37.80,52.0,1576.0,2.0987,NEAR BAY,2,1
1,5.504202,5.550847,-118.38,34.14,40.0,666.0,6.0876,<1H OCEAN,5,1
2,5.334975,4.990783,-121.98,38.36,33.0,562.0,2.4330,INLAND,2,1
3,5.351282,4.904818,-117.11,33.75,17.0,1845.0,2.2618,INLAND,2,1
4,3.725256,3.605285,-118.15,33.77,36.0,1912.0,3.5292,NEAR OCEAN,3,1
...,...,...,...,...,...,...,...,...,...,...
16507,4.277247,3.747069,-118.40,33.86,41.0,938.0,4.7105,<1H OCEAN,4,1
16508,5.535714,4.974662,-119.31,36.32,23.0,1419.0,2.5733,INLAND,2,1
16509,5.157895,5.058065,-117.06,32.59,13.0,2814.0,4.0616,NEAR OCEAN,3,1
16510,4.511933,4.331042,-118.40,34.06,37.0,1725.0,4.1455,<1H OCEAN,3,1


Lessons Learnt:
* We'll have to set the `columns` attribute in our transformers to make sure the pipelines work as expected. 
* Also we need to create dynamic column names if we don't want `remainder` prefix in all the remaining columns.  

### Custom Class Transformer

* Lets create a simple transformer calculates the column median, doubles it and scales all the rows with it.  

In [69]:
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted
from sklearn.pipeline import Pipeline

class SimpleMedianScaler(BaseEstimator, TransformerMixin):
    ## A simple constructor with no hyperparams
    def __init__(self):
        pass

    ## implement fit method        
    def fit(self, X, y=None):
        ## validate if X is finite numeric array        
        ## check_array always returns np array and not a dataframe
        checked_X = check_array(X)
        
        self.n_features_in_ = checked_X.shape[1]
        self.feature_names_in_ = X.columns
                
        self.median_ = np.median(checked_X, axis=0)
        # print(np.median(X, axis=0))

        return self
    
    def transform(self, X):
        ## check if its fitted
        check_is_fitted(self, ['median_', 'n_features_in_', 'feature_names_in_'])
        ## check if its valid finite numeric array
        X = check_array(X)
        ## check if number of features match
        assert self.n_features_in_ == X.shape[1]
        return pd.DataFrame(X*self.median_*2, columns=self.feature_names_in_)
    
    def get_feature_names_out(self, names=None):
        # return [f"{feature}_scaled" for feature in self.feature_names_in_]
        return self.feature_names_in_

In [70]:
## using the transformer as it is
df = pd.DataFrame({'a': [1, 2, 3], 'b': [10, 30, 40]})
smi = SimpleMedianScaler()
smi_data = smi.fit_transform(df)
smi_data

Unnamed: 0,a,b
0,4.0,600.0
1,8.0,1800.0
2,12.0,2400.0


Lessons Learnt:
* Check array converts dataframe to ndarray, so pandas dataframe methods won't work.


* Lets use this in our ColumnTransformer

In [71]:
## creating a column transfomer
temp_transformer_3 = ColumnTransformer([
    ("simple_median_imputer", SimpleMedianScaler(), ["total_rooms","households"]),
    ("room_to_house", my_transformer, ["total_rooms", "households"]),
    ("room_to_bedroom", my_transformer, ["total_rooms", "total_bedrooms"]),
    ("hello_again", my_another_transformer, ["total_rooms","total_bedrooms"]),
    # ("simple_median_imputer", SimpleMedianScaler(), ["total_rooms","households"])
], remainder="passthrough", verbose_feature_names_out=False)

## set output to pandas if we want output to be a pandas dataframe.
temp_transformer_3.set_output(transform="pandas")
## fit transform to df_features
temp_transformer_3.fit_transform(df_features)

Unnamed: 0,total_rooms,households,total_rooms_to_households_ratio,total_rooms_to_total_bedrooms_ratio,double,longitude,latitude,housing_median_age,population,median_income,ocean_proximity,income_categories,population_categories
0,14114250.0,843744.0,3.211799,2.978475,6642.0,-122.42,37.80,52.0,1576.0,2.0987,NEAR BAY,2,1
1,8351250.0,291312.0,5.504202,5.550847,3930.0,-118.38,34.14,40.0,666.0,6.0876,<1H OCEAN,5,1
2,4602750.0,165648.0,5.334975,4.990783,2166.0,-121.98,38.36,33.0,562.0,2.4330,INLAND,2,1
3,17739500.0,636480.0,5.351282,4.904818,8348.0,-117.11,33.75,17.0,1845.0,2.2618,INLAND,2,1
4,18555500.0,956352.0,3.725256,3.605285,8732.0,-118.15,33.77,36.0,1912.0,3.5292,NEAR OCEAN,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
16507,9507250.0,426768.0,4.277247,3.747069,4474.0,-118.40,33.86,41.0,938.0,4.7105,<1H OCEAN,4,1
16508,12516250.0,434112.0,5.535714,4.974662,5890.0,-119.31,36.32,23.0,1419.0,2.5733,INLAND,2,1
16509,16660000.0,620160.0,5.157895,5.058065,7840.0,-117.06,32.59,13.0,2814.0,4.0616,NEAR OCEAN,3,1
16510,16069250.0,683808.0,4.511933,4.331042,7562.0,-118.40,34.06,37.0,1725.0,4.1455,<1H OCEAN,3,1


Lessons Learnt:
*  Column transformer renames the columns and returns new ones, doens't keep the original columns if we rename in return.
* Each transformer in column transformer gets pre transformed copy of the data frame. One transformer chagnes are not reflected in another transformer. (For that we'll need pipelines.)

* Lets try creating `Pipelines` where we'll create a simple scale using our `SimpleMedianScaler` and then find column ratios

In [72]:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [10, 30, 40], 'c': [10, 10, 10]})

# first column transfomer to test
temp_transformer_3 = ColumnTransformer([
    ("simple_median_scaler", SimpleMedianScaler(),
     ["a", "b"]),
    ("a_to_b", my_transformer, ["a", "b"]),
    ("a_to_c", my_transformer, ["a", "c"]),
    ("b_to_c", my_transformer, ["b", "b"]),
], remainder="passthrough", verbose_feature_names_out=False)

# set output to pandas if we want output to be a pandas dataframe.
temp_transformer_3.set_output(transform="pandas")
# fit transform to df_features
transformed_df = temp_transformer_3.fit_transform(df)
transformed_df


Unnamed: 0,a,b,a_to_b_ratio,a_to_c_ratio,b_to_b_ratio
0,4.0,600.0,0.1,0.1,1.0
1,8.0,1800.0,0.066667,0.2,1.0
2,12.0,2400.0,0.075,0.3,1.0


In [74]:
# creating pipelines in same order
## this piple line scales the data and then find the ratio
temp_pipe_line = Pipeline([
    ("simple_median_scaler", SimpleMedianScaler()),
    ("a_to_b", my_transformer)
])

transformed_df = temp_pipe_line.fit_transform(df)
transformed_df

Unnamed: 0,a_to_b_ratio
0,0.006667
1,0.004444
2,0.005


In [75]:
temp_pipe_line.get_feature_names_out()

array(['a_to_b_ratio'], dtype=object)

Lessons Learnt:
* So pipelines are the ones that we want to use if we want column transformation to pass over each other. 
* We can then use the pipelines in Column Transformers if we want to apply the pipelines to certain columns only and get back a fully transformed dataset. 
* Pipelines and column transformers can be very power full tools, but implementing them needs certain planning and thinking, for e.g. we can agree on best practise that they always return a DataFrame, so they work consistently in everywhere. 

## Transfomer To Fill Missing Values

* After researching a bit I realized we can just use `SimpleImputer` directly into the pipeline without the need to creating a class. 
* So skipping this step. 

## Handling Categorical Values

* Even this is similar to missing values transformer
* We can directly use the `OneHotEncoder` and `SimpleImputer` in the pipeline

In [39]:
cat_pipeline = Pipeline([
    ("impute categories", SimpleImputer(strategy="most_frequent")),
    ("encode categories", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))
])

In [40]:
df_features["ocean_proximity"]

0          NEAR BAY
1         <1H OCEAN
2            INLAND
3            INLAND
4        NEAR OCEAN
            ...    
16507     <1H OCEAN
16508        INLAND
16509    NEAR OCEAN
16510     <1H OCEAN
16511    NEAR OCEAN
Name: ocean_proximity, Length: 16512, dtype: object

In [41]:
## testing to see if pipelines worked
cat_pipeline.fit_transform(df_features.select_dtypes(include=object))

array([[0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       ...,
       [0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.]])

In [42]:
cat_pipeline.get_feature_names_out()

array(['ocean_proximity_<1H OCEAN', 'ocean_proximity_INLAND',
       'ocean_proximity_ISLAND', 'ocean_proximity_NEAR BAY',
       'ocean_proximity_NEAR OCEAN'], dtype=object)

In [43]:
pd.DataFrame(cat_pipeline.fit_transform(df_features.select_dtypes(include=object)), columns=cat_pipeline.get_feature_names_out())

Unnamed: 0,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,0.0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...
16507,1.0,0.0,0.0,0.0,0.0
16508,0.0,1.0,0.0,0.0,0.0
16509,0.0,0.0,0.0,0.0,1.0
16510,1.0,0.0,0.0,0.0,0.0
