# Handmade Standardizer

🧑🏻‍🏫 In this challenge, we are going to create our *own* StandardScaler. Are you wondering why? Glad you asked!

🎯 The goals of this exercise are to:
- understand `stateless transformers` vs. `stateful transformers`
- manipulate `FeatureUnion`

## (1) 📚 Stateless Transformer vs. Stateful Transformer

🔢 Consider the following training set and the following test set...

In [1]:
import numpy as np
import pandas as pd

X_train = pd.DataFrame({
    'A': {0: 1, 1: 2, 2: 3},
    'B': {0: 4, 1: 5, 2: 6},
    'C': {0: 7, 1: 8, 2: 9}})

print("This is the training dataset:")
display(X_train)

print("This is the test dataset:")
X_test = pd.DataFrame({
    'A': {0: 1, 1: 2, 2: 3},
    'B': {0: 2, 1: 3, 2: 4},
    'C': {0: 3, 1: 4, 2: 10}})
display(X_test)

This is the training dataset:


Unnamed: 0,A,B,C
0,1,4,7
1,2,5,8
2,3,6,9


This is the test dataset:


Unnamed: 0,A,B,C
0,1,2,3
1,2,3,4
2,3,4,10


🛠 ...and the following union which:
- scales the features
- creates a new feature which is the average of the other (unscaled) features

In [2]:
from sklearn import set_config; set_config(display='diagram')
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.pipeline import make_pipeline, make_union

standard_scaler = StandardScaler()
feature_averager = FunctionTransformer(lambda df: pd.DataFrame(1/3 * (df["A"] + df["B"] + df["C"])))
pipeline = make_union(standard_scaler, feature_averager)
pipeline

▶️ Let's :
- fit the pipeline to the training set 
- and transform both the training set and the test set

In [3]:
pipeline.fit(X_train)

In [4]:
X_train_transformed = pd.DataFrame(pipeline.transform(X_train))
X_train_transformed

Unnamed: 0,0,1,2,3
0,-1.224745,-1.224745,-1.224745,4.0
1,0.0,0.0,0.0,5.0
2,1.224745,1.224745,1.224745,6.0


In [5]:
X_test_transformed = pd.DataFrame(pipeline.transform(X_test))
X_test_transformed

Unnamed: 0,0,1,2,3
0,-1.224745,-3.674235,-6.123724,2.0
1,0.0,-2.44949,-4.898979,3.0
2,1.224745,-1.224745,2.44949,5.666667


👨🏻‍🏫 Notice how the `StandardScaler` and the `FunctionTransformer` are fundamentally different ❗️

When we fitted the pipeline and used it to transform the training set and test set:

* **`FunctionTransformer (feature_averager)`**:
    * did _not_ "learn" anything during the *.fit()*
    * just performed a **stateless transformation**: $ \large (X_1, X_2, X_3) \rightarrow \frac{(X_1 + X_2 + X_3)}{3}$


* **`StandardScaler`**:
    * "learned" $\mu_{\color{blue}{train}}$ and $\sigma
   _{\color{blue}{train}}$ during the *.fit()*
    * performed a **stateful stransformation** using these learned values both in the train set and the test set:
        * $ \large X_{\color{blue}{train-scaled}} =  \frac{X_{\color{blue}{train}} -\mu_{\color{blue}{train}}}{\sigma_{\color{blue}{train}}}$
        * $ \large X_{\color{red}{test-scaled}} =  \frac{X_{\color{red}{test}} -\mu_{\color{blue}{train}}}{\sigma_{\color{blue}{train}}}$

## (2) 💻 Create your own state-full transformer

🤔 What if we would like to code our own **stateful custom transformer** ? 

💪 We could code our own class !

### (2.1) 💻 Custom Standardizer

❓ **Questions: Coding your own class** ❓

1. Code your own class `CustomStandardizer` 
    * It should behave exactly like the  `StandardScaler` from Scikit Learn, this means having:
        * a `.fit()` method which computes ("learns") $\mu_{\color{blue}{train}}$ and $\sigma
   _{\color{blue}{train}}$
        * and a `.transform()` method.


2. Fit it on `X_train` 

3. Transform both `X_train` and `X_test` 

4. Compare your `CustomStandardizer` with the `StandardScaler` from Scikit Learn to make sure you got it right !

In [6]:
#########################################
# 1 - Code the CustomStandardizer Class #
#########################################

# TransformerMixin inheritance is used to create fit_transform() method from fit() and transform()
from sklearn.base import TransformerMixin, BaseEstimator
import numpy as np


class CustomStandardizer(TransformerMixin, BaseEstimator):
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        '''
        Stores what needs to be stored as instance attributes. 
        ReturnS "self" to allow chaining fit and transform.
        '''
        self.mean = X.mean()
        self.std = X.std(ddof=0)
        return self
    
    def transform(self, X, y=None):
        return (X - self.mean) / self.std
        
    def inverse_transform(self, X, y=None):
        return self.std * X + self.mean
    

In [7]:
#########################################
# 2 - Fit the CustomStandardizer Class  #
#########################################

custom_standardizer = CustomStandardizer()
custom_standardizer.fit(X_train)

In [8]:
#########################################
# 3 - Transform                         #
#########################################

train = custom_standardizer.transform(X_train)
test = custom_standardizer.transform(X_test)


🧪 **Test your code**

In [9]:
from nbresult import ChallengeResult

tmp = CustomStandardizer()
tmp_train = np.array(tmp.fit_transform(X_train))
tmp_test = np.array(tmp.transform(X_test))

result = ChallengeResult('standardizer', 
                         X_train_transformed=tmp_train,
                         X_test_transformed=tmp_test
)

result.write()
print(result.check())

platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/bitazaratustra/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/bitazaratustra/code/bitazaratustra/data-challenges/05-ML/08-Workflow/05-Hand-Made-Standardizer
plugins: anyio-3.4.0
[1mcollecting ... [0mcollected 1 item

tests/test_standardizer.py::TestStandardizer::test_solution [32mPASSED[0m[32m       [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/standardizer.pickle

[32mgit[39m commit -m [33m'Completed standardizer step'[39m

[32mgit[39m push origin master


<details>
<summary>💡 <i>Hints</i> (🧪 if the tests  above fail only by a small margin) </summary>

* Be careful there is a slight difference between `np.std()` and `pd.std` methods! 
    
* This [Stackoverflow post](https://stackoverflow.com/questions/44220290/sklearn-standardscaler-result-different-to-manual-result) might help 😉
      
</details>

### (2.2) 💻 Inverse Transform

❓ **Question (Inverse Transform)** ❓

_StandardScaler_ from Scikit Learn has a [`.inverse_transform()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler.inverse_transform) method that helps you revert back to the unscaled dataset.

1. Go back to your `CustomStandardizer` class and implement your own `.inverse_transform()` method.

2. Try it on your scaled training set and your scaled test set.

In [11]:
# YOUR CODE HERE
print(X_train_detransformed)
print(train)

     A    B    C
0  1.0  4.0  7.0
1  2.0  5.0  8.0
2  3.0  6.0  9.0
          A         B         C
0 -1.224745 -1.224745 -1.224745
1  0.000000  0.000000  0.000000
2  1.224745  1.224745  1.224745


In [12]:
# Try your inverse transform and run the test down below to make sure you coded it correctly
X_train_detransformed = custom_standardizer.inverse_transform(train)
X_test_detransformed = custom_standardizer.inverse_transform(test)

🧪 **Test your code**

In [13]:
assert np.allclose(X_train_detransformed, X_train)
assert np.allclose(X_test_detransformed, X_test)

### (2.3) 💻 Complete custom pipeline!

💪 We've managed to replicate Scikit-Learn's `StandardScaler`.

🌶 Let's spice it up!

❓ **Question: improve the previous `CustomStandardizer` custom transformer with a shrinking factor** ❓


The `CustomStandardizer(shrink_factor = 1)` class should take one additional argument to perform a stronger scaling, in a sense that the scaling is proportional to $\sigma_{\color{blue}{train}}$ 👇:
- $ \large X_{\color{blue}{train-scaled}} =  (\frac{X_{\color{blue}{train}} -\mu_{\color{blue}{train}}}{\sigma_{\color{blue}{train}}}) \times \frac{1}{shrinkfactor}$
- $ \large X_{\color{red}{test-scaled}} =  (\frac{X_{\color{red}{test}} -\mu_{\color{blue}{train}}}{\sigma_{\color{blue}{train}}}) \times \frac{1}{shrinkfactor}$


In [14]:
###################################
# Custom Standardizer             #
###################################



class CustomStandardizer(TransformerMixin, BaseEstimator):
    
    def __init__(self, shrink_factor = 1):
        self.shrink_factor = shrink_factor
    
    def fit(self, X, y=None):
        '''
        Stores what needs to be stored as instance attributes. 
        Returns "self" to allow chaining fit and transform.
        '''
        self.mean = X.mean()
        self.std = X.std(ddof=0)
        return self
    
    def transform(self, X, y=None): 
        return ((X - self.mean) / self.std) / self.shrink_factor
    def inverse_transform(self, X, y=None):
        return self.std * X * slf.shrink_factor + self.mean

🧪 **Test you new `CustomStandardizer`** custom transformer with (`shrink_factor = 2`) by fitting on `X_train` and transforming both `X_train` and `X_test` and store the transformed dataframes into `X_train_transformed` and `X_test_transformed`

In [15]:
# YOUR CODE HERE
custom_scaler = CustomStandardizer(shrink_factor=2).fit(X_train)
X_train_transformed = custom_scaler.transform(X_train)

X_test_transformed = custom_scaler.transform(X_test)

🧪 **Test your code**

In [16]:
from nbresult import ChallengeResult

tmp = CustomStandardizer(shrink_factor=2).fit(X_train)
tmp_train = np.array(tmp.fit_transform(X_train))
tmp_test = np.array(tmp.transform(X_test))

result = ChallengeResult('new_standardizer', 
                         X_train_transformed=tmp_train,
                         X_test_transformed=tmp_test
)

result.write()
print(result.check())

platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/bitazaratustra/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/bitazaratustra/code/bitazaratustra/data-challenges/05-ML/08-Workflow/05-Hand-Made-Standardizer
plugins: anyio-3.4.0
[1mcollecting ... [0mcollected 1 item

tests/test_new_standardizer.py::TestNewStandardizer::test_solution [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/new_standardizer.pickle

[32mgit[39m commit -m [33m'Completed new_standardizer step'[39m

[32mgit[39m push origin master


In [17]:
 #Run the following cells to ensure you got the right transformations 
 truth_train = np.array([
     [-0.612372, -0.612372, -0.612372],
     [0.000000, 0.000000, 0.000000],
     [0.612372, 0.612372, 0.612372]
 ])
 truth_test = np.array([
     [-0.612372, -1.837117, -3.061862],
     [ 0.        , -1.224745, -2.449490],
     [ 0.612372, -0.612372,  1.224745]])

In [18]:
#Asserts
np.allclose(X_train_transformed, truth_train)

True

In [19]:
#Assert - Test
np.allclose(X_test_transformed, truth_test)

True

❓ **Question: "tweak" the previous `FeatureAverager` custom transformer** ❓

This modified `FeatureAverager()` class:
- still computes the average of the three different features...
- ...and now divides the result by the maximum value for each row 
    - _Note: don't try to interpret this operation, let's just be creative and practice our skills coding a custom class :)_

$$(X_1, X_2, X_3) \rightarrow \frac{1/3 \times (X_1 + X_2 + X_3)}{max(X_1, X_2, X_3)}$$


In [20]:
###################################
# Feature Averager                #
###################################

class FeatureAverager(TransformerMixin, BaseEstimator):
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        '''
        Stores what needs to be stored as instance attributes. 
        Returns "self" to allow chaining fit and transform.
        '''
        return self
    
    def transform(self, X, y=None): 
        features_sum = X['A'] + X['B'] + X['C']
        max_factor = np.max(X, axis = "columns")
        feature_averager = (1/3 * features_sum) / max_factor
        return pd.DataFrame(feature_averager)
    

🧪 **Test you `FeatureAverager` custom transformer** by fitting on `X_train`  and transforming both `X_train` and `X_test`

In [21]:
custom_feature_averager = FeatureAverager().fit(X_train)

X_train_tranformed = custom_feature_averager.transform(X_train)
X_test_transformed = custom_feature_averager.transform(X_test)

In [22]:
from nbresult import ChallengeResult

tmp = FeatureAverager()
tmp_train = np.array(tmp.fit_transform(X_train))
tmp_test = np.array(tmp.transform(X_test))

result = ChallengeResult('feature_averager', 
                         X_train_transformed=tmp_train,
                         X_test_transformed=tmp_test
)

result.write()
print(result.check())

platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/bitazaratustra/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/bitazaratustra/code/bitazaratustra/data-challenges/05-ML/08-Workflow/05-Hand-Made-Standardizer
plugins: anyio-3.4.0
[1mcollecting ... [0mcollected 1 item

tests/test_feature_averager.py::TestFeatureAverager::test_solution [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/feature_averager.pickle

[32mgit[39m commit -m [33m'Completed feature_averager step'[39m

[32mgit[39m push origin master


❓ **Question (Feature Union)**❓

1. Use both `CustomStandardizer` and `FeatureAverager` to create a `FeatureUnion` pipeline.

2. Fit the pipeline to `X_train` and transform both `X_train` and `X_test` (`shrink_factor = 3`)

3. Make sure you pass the final test of this challenge.

In [23]:
#####################
# 1 - Feature Union #
#####################

custom_standardizer = CustomStandardizer(shrink_factor=3)
custom_feature_averager = FeatureAverager()

pipeline = make_union(custom_standardizer, custom_feature_averager)
pipeline

In [24]:
#########################
# 2 - Fit and Transform #
#########################

pipeline.fit(X_train)

X_train_transformed = pd.DataFrame(pipeline.transform(X_train))
X_test_transformed = pd.DataFrame(pipeline.transform(X_test))

In [25]:
from nbresult import ChallengeResult

tmp = pipeline
tmp_train = np.array(tmp.fit_transform(X_train))
tmp_test = np.array(tmp.transform(X_test))



result = ChallengeResult('feature_union_custom_transformers', 
                         X_train_transformed=tmp_train,
                         X_test_transformed=tmp_test
)

result.write()
print(result.check())

platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/bitazaratustra/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/bitazaratustra/code/bitazaratustra/data-challenges/05-ML/08-Workflow/05-Hand-Made-Standardizer
plugins: anyio-3.4.0
[1mcollecting ... [0mcollected 1 item

tests/test_feature_union_custom_transformers.py::TestFeatureUnionCustomTransformers::test_solution [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/feature_union_custom_transformers.pickle

[32mgit[39m commit -m [33m'Completed feature_union_custom_transformers step'[39m

[32mgit[39m push origin master


🏁 Congratulations! You discovered how to create your own Transformer!

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!