# Custom Transformerを自作する

- FunctionTransformerよりも細かい処理が記述できる！

- BaseEstimator と TransformerMixin を継承する
    
    - BaseEstimator
        - モデルのパラメータの取得と設定を行うことができるget_params()とset_params()メソッドが利用可能になる。
    
    - TransformerMixin
        - fit_transform()メソッドが利用可能になる．

In [3]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import PowerTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb

In [4]:
# タイタニック（titanic）のデータセット
dataset = fetch_openml(data_id=40945, parser='auto')
df = dataset['frame']
display(df.head(3))

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


## ①変数選択用の変換器

In [6]:
class ColumnFilterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns=[]):
        self.columns = columns
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X[self.columns]

In [8]:
# 選択した変数
columns_to_keep = ['age','fare','sibsp','parch']

transformer = ColumnFilterTransformer(columns=columns_to_keep)
filtered_df = transformer.transform(df)
display(filtered_df.head(3))

Unnamed: 0,age,fare,sibsp,parch
0,29.0,211.3375,0,0
1,0.9167,151.55,1,2
2,2.0,151.55,1,2


In [13]:
filtered_df.isnull().sum()

age      263
fare       1
sibsp      0
parch      0
dtype: int64

## ②欠損値処理用の変換器

In [10]:
class CustomMedianImputer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.medians_ = None
        
    def fit(self, X, y=None):
        self.medians_ = X.median()
        return self
    
    def transform(self, X, y=None):
        X_copy = X.copy()
        for column in X_copy.columns:
            X_copy[column].fillna(self.medians_[column], inplace=True)
        return X_copy

In [12]:
imputer = CustomMedianImputer()
imputer.fit(filtered_df)
imputed_data = imputer.transform(filtered_df)
display(imputed_data.head(3))

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_copy[column].fillna(self.medians_[column], inplace=True)


Unnamed: 0,age,fare,sibsp,parch
0,29.0,211.3375,0,0
1,0.9167,151.55,1,2
2,2.0,151.55,1,2


In [14]:
imputed_data.isnull().sum()

age      0
fare     0
sibsp    0
parch    0
dtype: int64

## ③Box-Cox変換

In [15]:
class CustomBoxCoxTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self._estimators = {}
        
    def fit(self, X, y=None):
        X_copy = X.copy()
        for column in X_copy.columns:
            X_copy[column] += 1
            estimator = PowerTransformer()
            self._estimators[column] = estimator.fit(np.array(X_copy[column]).reshape(-1, 1))
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        for column in X_copy.columns:
            X_copy[column] += 1
            X_copy[column] = self._estimators[column].transform(np.array(X_copy[column]).reshape(-1, 1))
        return X_copy
    
    def inverse_transform(self, X):
        X_copy = X.copy()
        for column in X_copy.columns:
            X_copy[column] = self._estimators[column].inverse_transform(np.array(X_copy[column]).reshape(-1, 1))
            X_copy[column] -= 1
        return X_copy

In [16]:
boxcox_trans = CustomBoxCoxTransformer()
boxcox_trans.fit(imputed_data)
transformed_data = boxcox_trans.transform(imputed_data)
display(transformed_data.head(3))

Unnamed: 0,age,fare,sibsp,parch
0,0.012524,2.106427,-0.681878,-0.553158
1,-2.583499,1.89359,1.361687,1.884514
2,-2.444805,1.89359,1.361687,1.884514


## パイプラインの構築

In [18]:
X = df.drop('survived', axis=1)
y = df['survived'].astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=123)

In [19]:
# 選択した変数
columns_to_keep = ['age','fare','sibsp','parch']

titanic_pipeline = Pipeline(
    steps=[
        ("filter", ColumnFilterTransformer(columns_to_keep)),
        ("imputer", CustomMedianImputer()),        
        ("boxcoxtrans", CustomBoxCoxTransformer()),
        ("estimator", xgb.XGBClassifier()),
    ]
)

In [20]:
# パイプラインの学習
titanic_pipeline.fit(X_train, y_train)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_copy[column].fillna(self.medians_[column], inplace=True)


In [21]:
pred_y = titanic_pipeline.predict(X_test)

accuracy_score(y_test, pred_y) 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_copy[column].fillna(self.medians_[column], inplace=True)


0.6870229007633588

In [23]:
# 選択した変数
columns_to_keep = ['age','fare']

# パイプラインに新たにパラメータを設定
params = {'filter__columns':columns_to_keep}
titanic_pipeline.set_params(**params)
titanic_pipeline.fit(X_train, y_train)

pred_y = titanic_pipeline.predict(X_test)
accuracy_score(y_test, pred_y)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_copy[column].fillna(self.medians_[column], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_copy[column].fillna(self.medians_[column], inplace=True)


0.6870229007633588