# Section 1-4 - Building Pipelines

GridSearchCV는 cross-validation을 기반으로 하여 parameter 집합의 성능을 살펴볼 수 있게 한다. 즉 한 번에 training data의 일부만 검토된다. 하지만 NA value 들을 평균 값들로 채울 때에는 데이터셋 전체를 검토하게 된다.

따라서 우리는 GridSearchCV를 실행할 때 데이터의 일부만을 검토하는 일관성없는 접근법을 사용했지만, 누락된 값을 채울 때는 데이터의 전체 집합을 검토했다. pipeline을 만들고 imputation을 하면 이러한 일관성없음을 피할 수 있다.

# Pandas - Extracting data

In [6]:
import numpy as np
import pandas as pd

df = pd.read_csv('../data/train.csv')

# Pandas - Cleaning data

Age column의 NA value를 그대로 둔다.

In [10]:
df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)

age_mean = df['Age'].mean()

from scipy.stats import mode

mode_embarked = mode(df['Embarked'].tolist())[0][0]
df['Embarked'] = df['Embarked'].fillna(mode_embarked)

df['Gender'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)

df = pd.concat([df, pd.get_dummies(df['Embarked'], prefix='Embarked')], axis=1)

df = df.drop(['Sex', 'Embarked'], axis=1)

cols = df.columns.tolist()
cols = [cols[1]] + cols[0:1] + cols[2:]

df = df[cols]



Age column의 NA value들을 음수 마커인 -1로 바꾼다. 어떤 버그가 있어서 missing value marker를 사용하지 못하기 때문이다.

In [12]:
df = df.fillna(-1)

이제 데이터 셋을 살펴보자

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
Survived       891 non-null int64
PassengerId    891 non-null int64
Pclass         891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Gender         891 non-null int32
Embarked_C     891 non-null uint8
Embarked_Q     891 non-null uint8
Embarked_S     891 non-null uint8
dtypes: float64(2), int32(1), int64(5), uint8(3)
memory usage: 54.9 KB


In [14]:
train_data = df.values

## Scikit-learn - Training the model

pipeline을 만들어서 training data의 한 부분을 차지하는 Age column을 평균값으로 채우고, tuning parameter의 성능을 평가해 보도록 한다.

In [30]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV

imputer = Imputer(strategy='mean', missing_values=-1)

classifier = RandomForestClassifier(n_estimators=100)

pipeline = Pipeline([
    ('imp', imputer),
    ('clf', classifier),
])

parameter grid 내부에서 약간의 문법의 변화를 주도록 한다.

In [31]:
parameter_grid = {
    'clf__max_features': [0.5, 1],
    'clf__max_depth': [5, None],
}

GridSearchCV를 실행시키되, classifier를 pipeline으로 대체한다.

In [32]:
grid_search = GridSearchCV(pipeline, parameter_grid, cv=5, verbose=3)

In [35]:
grid_search.fit(train_data[0:,1:], train_data[0:,0])

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] clf__max_depth=5, clf__max_features=0.5 .........................
[CV]  clf__max_depth=5, clf__max_features=0.5, score=0.748603 -   0.0s
[CV] clf__max_depth=5, clf__max_features=0.5 .........................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s remaining:    0.0s


[CV]  clf__max_depth=5, clf__max_features=0.5, score=0.821229 -   0.0s
[CV] clf__max_depth=5, clf__max_features=0.5 .........................
[CV]  clf__max_depth=5, clf__max_features=0.5, score=0.825843 -   0.0s
[CV] clf__max_depth=5, clf__max_features=0.5 .........................
[CV]  clf__max_depth=5, clf__max_features=0.5, score=0.780899 -   0.0s
[CV] clf__max_depth=5, clf__max_features=0.5 .........................
[CV]  clf__max_depth=5, clf__max_features=0.5, score=0.836158 -   0.0s
[CV] clf__max_depth=5, clf__max_features=1 ...........................
[CV] .. clf__max_depth=5, clf__max_features=1, score=0.648045 -   0.0s
[CV] clf__max_depth=5, clf__max_features=1 ...........................
[CV] .. clf__max_depth=5, clf__max_features=1, score=0.832402 -   0.0s
[CV] clf__max_depth=5, clf__max_features=1 ...........................
[CV] .. clf__max_depth=5, clf__max_features=1, score=0.842697 -   0.0s
[CV] clf__max_depth=5, clf__max_features=1 ...........................
[CV] .

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    2.6s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values=-1, strategy='mean', verbose=0)), ('clf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07...ators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'clf__max_features': [0.5, 1], 'clf__max_depth': [5, None]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)

In [39]:
sorted(grid_search.grid_scores_, key=lambda x: x.mean_validation_score)
grid_search.best_score_, grid_search.best_params_


(0.8047138047138047, {'clf__max_depth': None, 'clf__max_features': 1})

Age column의 -1값(여기서는 -1이 NaN 값이 된다)을 원하는 값으로 바꿔주고 model을 학습시키도록 한다.

In [41]:
df['Age'].describe()

count    891.000000
mean      23.600640
std       17.867496
min       -1.000000
25%        6.000000
50%       24.000000
75%       35.000000
max       80.000000
Name: Age, dtype: float64

In [42]:
df['Age'] = df['Age'].map(lambda x: age_mean if x == -1 else x)

In [43]:
df['Age'].describe()

count    891.000000
mean      29.699118
std       13.002015
min        0.420000
25%       22.000000
50%       29.699118
75%       35.000000
max       80.000000
Name: Age, dtype: float64

In [44]:
train_data = df.values

In [47]:
model = RandomForestClassifier(n_estimators=100, max_features=0.5, max_depth=5)
model = model.fit(train_data[0:, 2:], train_data[0:, 0])

## Scikit-learn - Making predictions

In [48]:
df_test = pd.read_csv('../data/test.csv')

df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)

test data의 NA value들을 평균값으로 채운다.

In [49]:
df_test['Age'] = df_test['Age'].fillna(age_mean)

In [53]:
fare_means = df.pivot_table('Fare', index='Pclass', aggfunc='mean')
df_test['Fare'] = df_test[['Fare', 'Pclass']].apply(lambda x: fare_means.loc[x['Pclass']] if pd.isnull(x['Fare']) else x['Fare'], axis=1)

df_test['Gender'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)
df_test = pd.concat([df_test, pd.get_dummies(df_test['Embarked'], prefix='Embarked')],
                   axis=1)

df_test = df_test.drop(['Sex', 'Embarked'], axis=1)

test_data = df_test.values

output = model.predict(test_data[0:, 1:])

## Pandas - Preparing for submission

In [54]:
result = np.c_[test_data[:, 0], output.astype(int)]

df_result = pd.DataFrame(result, columns=['PassengerID', 'Survived'])
df_result.to_csv('../results/titanic_test_1-4.csv', index=False)