# Scratch Pad
* This notebook is created to quickly experiment around with different code implementation before we add it to notebooks or production scripts.
* Basically its a scratch pad intended to undersand how some apis or code behave.

In [4]:
%pip install scikit-learn
%pip install pandas
%pip install numpy

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [23]:
import os
import sys
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from pathlib import Path

# Build an absolute path from this notebook's parent directory
module_path = os.path.abspath(os.path.join('..'))

# Add to sys.path if not already present
if module_path not in sys.path:
    sys.path.append(module_path)
    
from src.utils import categorical_preprocessing, numerical_preprocessing,custom_encoder

In [24]:
categories = [['Male','Female'],[1,2,3]]

ce = custom_encoder.CustomEncoder(encoding="ordinal", categories=categories)
X = [['Male', 1.0], ['Female', 3.0], ['Female', 2.0]]

temp = ce.fit_transform(X)

In [25]:
temp

array([[0., 0.],
       [1., 2.],
       [1., 1.]])

In [26]:
X

[['Male', 1.0], ['Female', 3.0], ['Female', 2.0]]

## Mutation Experiment
* We are trying to understand how mutation in a transformer affects other transformers in pipeline. 

In [27]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

In [81]:
# Sample DataFrame
from sklearn.base import check_is_fitted


X_df = pd.DataFrame({
    "col1": [1, 4, 7],
    "col2": [2, 5, 8],
    "col3": [3, 6, 9]
})


class PandasMutatingTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.feature_names_in_ = np.array(X.columns, dtype=object)
        return self

    def transform(self, X, y=None):
        print("Before mutation:")
        print(X)
        # WARNING: Mutating original DataFrame
        # X.drop(columns=["col2"], inplace=True)
        X["col2"] = ["a","b","c"]
        print("\nAfter mutation:")
        print(X)
        return X

class PandasFailingTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        print("Fitting Pandas Failing Transformer")
        print(X)
        return self

    def transform(self, X, y=None):
        # Try to access dropped column
        print("\nTrying to access col2:", X["col2"])
        return X

In [82]:

pipeline = Pipeline([
    ("mutator", PandasMutatingTransformer()),
    ("failing", PandasFailingTransformer())
])

temp = pipeline.fit_transform(X_df)
temp

Before mutation:
   col1  col2  col3
0     1     2     3
1     4     5     6
2     7     8     9

After mutation:
   col1 col2  col3
0     1    a     3
1     4    b     6
2     7    c     9
Fitting Pandas Failing Transformer
   col1 col2  col3
0     1    a     3
1     4    b     6
2     7    c     9

Trying to access col2: 0    a
1    b
2    c
Name: col2, dtype: object


Unnamed: 0,col1,col2,col3
0,1,a,3
1,4,b,6
2,7,c,9


In [83]:
temp

Unnamed: 0,col1,col2,col3
0,1,a,3
1,4,b,6
2,7,c,9


In [84]:
X_df

Unnamed: 0,col1,col2,col3
0,1,a,3
1,4,b,6
2,7,c,9


In [88]:

X_df = pd.DataFrame({
    "col1": [1, 4, 7],
    "col2": [2, 5, 8],
    "col3": [3, 6, 9]
})

class PandasAnotherMutatingTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.feature_names_in_ = np.array(X.columns, dtype=object)
        return self

    def transform(self, X, y=None):
        print("Before mutation:")
        print(X)
        # WARNING: Mutating original DataFrame
        # X.drop(columns=["col1"], inplace=True)
        X["col2"] = ["a","b","c"]
        print("\nAfter mutation:")
        print(X)
        return X

class PandasFailingTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        print("Fitting Pandas Failing Transformer")
        print(X)
        return self

    def transform(self, X, y=None):
        # Try to access dropped column
        print("\nTrying to access col1:", X["col2"])
        return X

column_pipeline = ColumnTransformer([
    ("t1", PandasAnotherMutatingTransformer(),["col1","col2"]),
    ("t2", PandasFailingTransformer(), ["col2"])
])

temp = column_pipeline.fit_transform(X_df)

Fitting Pandas Failing Transformer
   col2
0     2
1     5
2     8

Trying to access col1: 0    2
1    5
2    8
Name: col2, dtype: int64


In [89]:
X_df

Unnamed: 0,col1,col2,col3
0,1,2,3
1,4,5,6
2,7,8,9


Observations:
* So in case of pipeline same dataframe is passed from one transformer to another, so updating one column in dataframe without changes results into other transformers in pipeline getting same updated column.
* In case of column transformers, each transformer gets its own copy of columns or dataset so updating dataset there doesn't affect the main dataset or rest of the column transformers. 
* Based on this some of the best practises could be,
    * While editing the data always make a copy of the dataframe so as to avoid mutating the dataset.
    * Always add transformed data a additional columns instead of updating existing columns. 

Questions:
* Is there a pattern or practise regarding returning only transformed column vs returning original and transformed column?
* How can we make the transformers robust so that we can use it in pipeline and or column transformer and they work as expected?