# Understanding Transformers
* This is just a practise notebook to get a better understanding of transformers, column transformers and pipelines in `SciKit Learn`

In [1]:
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.metrics.pairwise import rbf_kernel
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer

## Creating Mock Data

In [2]:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [10, 30, 40], 'c': [10, 10, 10]})

### Function Transformer

In [8]:
## sample transfomer to test the input/output

## a simple transformer function
def hello_transformer(X):
    ## transformers always need to return a 2D array (numpy array, scipy sparse array, dataframe)
    ## create column name from first two columns
    column_lst = X.iloc[:,[0,1]].columns
    col_name = f"{column_lst[0]}_to_{column_lst[1]}_ratio"
    ## here we are not checking if columns are numbers or not
    ## we are assuming that they are numbers and not 0
    result = pd.DataFrame(X.iloc[:,0] / X.iloc[:,1])
    result.columns = [col_name]
    return result

# a callable function for output names
def ratio_name(function_transformer, feature_names_in):
    col_name = f"{feature_names_in[0]}_to_{feature_names_in[1]}_ratio"
    return [col_name]  # feature names out

## using FunctionTransformer to create a transformer out of our function
my_transformer = FunctionTransformer(hello_transformer, feature_names_out=ratio_name)

## creating a column transfomer
temp_transformer = ColumnTransformer([
    ("hello", my_transformer, ["a", "b"])
])

In [9]:
## fit transform to df_features
temp_transformer.fit_transform(df)

array([[0.1       ],
       [0.06666667],
       [0.075     ]])

In [10]:
## get transformed column names
temp_transformer.get_feature_names_out()

array(['hello__a_to_b_ratio'], dtype=object)

In [11]:
## create dataframe from transformed data
pd.DataFrame(temp_transformer.fit_transform(df), columns=temp_transformer.get_feature_names_out())

Unnamed: 0,hello__a_to_b_ratio
0,0.1
1,0.066667
2,0.075


Lessons Learnt:
* Transformations always need to return a 2D array (numpy array, scipy sparse array, dataframe), for now as a rule, we'll convert all the custom transformer return data to DataFrame


* I wonder if we number of columns returned in one transformer is less, does another transformer in the sequence get fewer columns?

In [15]:
def hello_again_transformer(X):
    ## transformers always need to return a 2D array (numpy array, scipy sparse array, dataframe)
    result = pd.DataFrame(X.iloc[:,0] * 2)
    result.columns = ["double"]
    return result

# a callable function for output names
def again_ratio_name(function_transformer, feature_names_in):
    return ["double"]  # feature names out

## using FunctionTransformer to create a transformer out of our function
my_another_transformer = FunctionTransformer(hello_again_transformer, feature_names_out=again_ratio_name)

# base_estimator = FunctionTransformer(lambda X: X, feature_names_out="one-to-one")

## creating a column transfomer
temp_transformer_2 = ColumnTransformer([
    ("room_to_house", my_transformer, ["a", "b"]),
    ("room_to_bedroom", my_transformer, ["b", "c"]),
    ("hello_again", my_another_transformer, ["a","b"])
], remainder="passthrough", verbose_feature_names_out=False)

## set output to pandas if we want output to be a pandas dataframe.
temp_transformer_2.set_output(transform="pandas")
## fit transform to df_features
temp_transformer_2.fit_transform(df)

Unnamed: 0,a_to_b_ratio,b_to_c_ratio,double
0,0.1,1.0,2
1,0.066667,3.0,4
2,0.075,4.0,6


In [16]:
## interesting even if the output is DataFrame, pd.DataFrame doesn't throw and error
transformed_data = pd.DataFrame(temp_transformer_2.fit_transform(df), columns=temp_transformer_2.get_feature_names_out())
transformed_data

Unnamed: 0,a_to_b_ratio,b_to_c_ratio,double
0,0.1,1.0,2
1,0.066667,3.0,4
2,0.075,4.0,6


Lessons Learnt:
* We'll have to set the `columns` attribute in our transformers to make sure the pipelines work as expected. 
* Also we need to create dynamic column names if we don't want `remainder` prefix in all the remaining columns.  

### Custom Class Transformer

* Lets create a simple transformer calculates the column median, doubles it and scales all the rows with it.  

In [17]:
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted
from sklearn.pipeline import Pipeline

class SimpleMedianScaler(BaseEstimator, TransformerMixin):
    ## A simple constructor with no hyperparams
    def __init__(self):
        pass

    ## implement fit method        
    def fit(self, X, y=None):
        ## validate if X is finite numeric array        
        ## check_array always returns np array and not a dataframe
        checked_X = check_array(X)
        
        self.n_features_in_ = checked_X.shape[1]
        self.feature_names_in_ = X.columns
                
        self.median_ = np.median(checked_X, axis=0)
        # print(np.median(X, axis=0))

        return self
    
    def transform(self, X):
        ## check if its fitted
        check_is_fitted(self, ['median_', 'n_features_in_', 'feature_names_in_'])
        ## check if its valid finite numeric array
        X = check_array(X)
        ## check if number of features match
        assert self.n_features_in_ == X.shape[1]
        return pd.DataFrame(X*self.median_*2, columns=self.feature_names_in_)
    
    def get_feature_names_out(self, names=None):
        # return [f"{feature}_scaled" for feature in self.feature_names_in_]
        return self.feature_names_in_

In [18]:
## using the transformer as it is
smi = SimpleMedianScaler()
smi_data = smi.fit_transform(df)
smi_data

Unnamed: 0,a,b,c
0,4.0,600.0,200.0
1,8.0,1800.0,200.0
2,12.0,2400.0,200.0


Lessons Learnt:
* Check array converts dataframe to `ndarray`, so pandas dataframe methods won't work.


* Lets use this in our ColumnTransformer

In [20]:
## creating a column transfomer
temp_transformer_3 = ColumnTransformer([
    ("simple_median_imputer", SimpleMedianScaler(), ["a","b","c"]),
    ("room_to_house", my_transformer, ["a", "b"]),
    ("room_to_bedroom", my_transformer, ["b", "c"]),
    ("hello_again", my_another_transformer, ["a","c"]),
    # ("simple_median_imputer", SimpleMedianScaler(), ["total_rooms","households"])
], remainder="passthrough", verbose_feature_names_out=False)

## set output to pandas if we want output to be a pandas dataframe.
temp_transformer_3.set_output(transform="pandas")
## fit transform to df_features
temp_transformer_3.fit_transform(df)

Unnamed: 0,a,b,c,a_to_b_ratio,b_to_c_ratio,double
0,4.0,600.0,200.0,0.1,1.0,2
1,8.0,1800.0,200.0,0.066667,3.0,4
2,12.0,2400.0,200.0,0.075,4.0,6


Lessons Learnt:
*  Column transformer renames the columns and returns new ones, doens't keep the original columns if we rename in return.
* Each transformer in column transformer gets pre transformed copy of the data frame. One transformer chagnes are not reflected in another transformer. (For that we'll need pipelines.)

* Lets try creating `Pipelines` where we'll create a simple scale using our `SimpleMedianScaler` and then find column ratios

In [21]:
# first column transfomer to test
temp_transformer_3 = ColumnTransformer([
    ("simple_median_scaler", SimpleMedianScaler(),
     ["a", "b"]),
    ("a_to_b", my_transformer, ["a", "b"]),
    ("a_to_c", my_transformer, ["a", "c"]),
    ("b_to_c", my_transformer, ["b", "b"]),
], remainder="passthrough", verbose_feature_names_out=False)

# set output to pandas if we want output to be a pandas dataframe.
temp_transformer_3.set_output(transform="pandas")
# fit transform to df_features
transformed_df = temp_transformer_3.fit_transform(df)
transformed_df


Unnamed: 0,a,b,a_to_b_ratio,a_to_c_ratio,b_to_b_ratio
0,4.0,600.0,0.1,0.1,1.0
1,8.0,1800.0,0.066667,0.2,1.0
2,12.0,2400.0,0.075,0.3,1.0


In [22]:
# creating pipelines in same order
## this piple line scales the data and then find the ratio
temp_pipe_line = Pipeline([
    ("simple_median_scaler", SimpleMedianScaler()),
    ("a_to_b", my_transformer)
])

transformed_df = temp_pipe_line.fit_transform(df)
transformed_df

Unnamed: 0,a_to_b_ratio
0,0.006667
1,0.004444
2,0.005


In [23]:
temp_pipe_line.get_feature_names_out()

array(['a_to_b_ratio'], dtype=object)

Lessons Learnt:
* So pipelines are the ones that we want to use if we want column transformation to pass over each other. 
* We can then use the pipelines in Column Transformers if we want to apply the pipelines to certain columns only and get back a fully transformed dataset. 
* Pipelines and column transformers can be very power full tools, but implementing them needs certain planning and thinking, for e.g. 
    * We can agree on best practise that they always return a DataFrame, so they work consistently in everywhere. 
    * We need to think about where to add a pipeline and where to add a colunn transformer.
    * We need to make sure that any columns that are not transformed but necessary for model training are retained in the dataframe