<a href="https://colab.research.google.com/github/anko191/Python_Kaggle/blob/master/Pipeline/Pipeline_preprocessing_and_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pipeline
パイプラインはデータの全処理とモデリングのコードを整理しておくためのシンプルな方法<br>具体的には、パイプラインは、<b>前処理とモデリングのステップを束ねている(バンドル)ので、</b><br>バンドル全体をあたかも単一のステップであるかのように使用できます。

## Many benefits
* Cleaner Code
    * 各ステップでトレーニングデータとバリデーションデータを手動で管理する必要がない
* Fewer Bugs
    * 全処理を忘れたり、ステップを間違えて適用したりしなくなる
* Easier to Productionize
    * プロトタイプから展開可能なものにやるのは難しいけど、パイプラインが役立つよ
* More Options for Model Validation
    * モデル検証のためのより多くのオプション。次のチュートリアルでクロスバリデーションの例を見ます。

## 例として、Melbourne Housing datasetを使いましょう

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
data = pd.read_csv('/content/melb_data.csv')

In [3]:
data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,2.0,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,3.0,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [4]:
# Predict Price
y = data.Price
X = data.drop(['Price'], axis = 1)

In [5]:
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 0)

In [6]:
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and X_train_full[cname].dtype == 'object']

In [7]:
categorical_cols

['Type', 'Method', 'Regionname']

In [13]:
X_train_full['Type'].nunique() # number of unique

3

In [14]:
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

In [15]:
# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

# Pipeline Three Steps
* <b>sklearn.compose.ColumnTransformer</b>を使えば、ことなる全処理のステップを束ねることが出来ます
* <b>sklearn.impute.SimpleImputer</b>を使えば、numericalデータの欠損値の補完が出来ます
    * If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
    * fill_valuestring or numerical value, default=None
When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.
    * 基本的には 0, objectはmissing_valueになる
    * mean, median, most_frequentもあるよ

In [16]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

#numerical
numerical_transformer = SimpleImputer(strategy = 'constant')


### こっからパイプライン

In [17]:
# categorical
categorical_transformer = Pipeline(steps = [
        ('imputer', SimpleImputer(strategy = 'most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown = 'ignore'))
])

### Bundle しよう！ ( 束ねよう！)

In [19]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)

## Step2 : Define the Model
* ここではsklearn.ensemble.RandomForestRegressorを使うぞ

In [20]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators = 100, random_state = 0)

## Step3 : Create and Evaluate the Pipeline
* 注意すべきことがいくつかあるよ！！
    * パイプラインを使えば**1行**で、全処理とモデルの適合が出来るよ！
    * X_validの未処理の特徴量をpredict()コマンドに渡すと、パイプラインでは自動的に特徴量を全処理してくれるよ！やばええ

In [21]:
from sklearn.metrics import mean_absolute_error

# Bundle pre, model
my_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                                ('model', model)])

# pre of training data, fit model
my_pipeline.fit(X_train, y_train)

# pre of validation , get predictions
preds = my_pipeline.predict(X_valid)

score = mean_absolute_error(y_valid, preds)
print('MAE', score)

MAE 160679.18917034855
