<font color=#CC3D3D>
## Custom transformers

`FunctionTransformer`를 사용하여 임의의 함수로부터 변환기(transformer)를 구현할 수 있다.  

예를 들어, 파이프 라인에서 로그 변환을 처리하는 변환기를 빌드하려면 다음과 같이 하면 된다.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer

transformer = FunctionTransformer(np.log1p, validate=False)
X = np.array([[0, 1], [2, 3]])
transformer.fit_transform(X)

array([[0.        , 0.69314718],
       [1.09861229, 1.38629436]])

다음은 `PCA`와 `FunctionTransformer`로 구성된 `pipeline`을 만드는 코드를 보여준다.

In [2]:
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=0, 
                               n_classes=2, n_clusters_per_class=1, class_sep=0.8, 
                               weights=[0.6, 0.4], random_state=0)

"""
 Create a pipeline with PCA and the column selector and use it to
 transform the dataset.
"""
pipeline = make_pipeline(
    PCA(n_components=5), FunctionTransformer(lambda x: x[:, 1:], validate=False),
)

X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_transformed = pipeline.fit(X_train, y_train).transform(X_train)
X_test_transformed = pipeline.transform(X_test)

X_train.shape, X_train_transformed.shape

((750, 20), (750, 4))

#### 연습문제
- `pd.get_dummy()`와 동일한 기능을 수행하는 Custom Transformer를 만들어 보자. 

In [3]:
df = pd.DataFrame({'id': range(5), 'f1': ['a', 'b', 'a', 'c', 'b']})
pd.get_dummies(df)

Unnamed: 0,id,f1_a,f1_b,f1_c
0,0,1,0,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,0,1,0


In [4]:
dummy_trans = FunctionTransformer(lambda x: pd.get_dummies(x).values, validate=False)
dummy_trans.transform(df)

array([[0, 1, 0, 0],
       [1, 0, 1, 0],
       [2, 1, 0, 0],
       [3, 0, 0, 1],
       [4, 0, 1, 0]])

- 위 Custom Transformer와 동일한 기능을 하는 `DataFrameEncoder`라는 새로운 `class`를 만들어 보자.

In [5]:
# define a custom transformer

from sklearn.base import BaseEstimator, TransformerMixin

class DataFrameEncoder(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return pd.get_dummies(X).values

# apply pipeline
#
#pipe = Pipeline([
#    ("encoder", DataFrameEncoder()),
#    ("scaler", MinMaxScaler()),
#])
#pipe.fit_transform(df)

DataFrameEncoder().fit_transform(df)

array([[0, 1, 0, 0],
       [1, 0, 1, 0],
       [2, 1, 0, 0],
       [3, 0, 0, 1],
       [4, 0, 1, 0]])

<font color=#CC3D3D>
## End