# Section 1-3 - Parameter Tuning

이전 섹션들에서, 우리는 Scikit-learn을 블랙박스로서 사용하는 접근법을 알아보았다. 이제 parameter들을 튜닝하여 model accruracy를 향상시키는 방법에 대해서 알아보도록 한다.

# Pandas - Extracting data

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/train.csv')

# Pandas - Cleaning data

In [2]:
df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)

age_mean = df['Age'].mean()
df['Age'] = df['Age'].fillna(age_mean)

from scipy.stats import mode

mode_embarked = mode(df['Embarked'].tolist())[0][0]
df['Embarked'] = df['Embarked'].fillna(mode_embarked)

df['Gender'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)

df = pd.concat([df, pd.get_dummies(df['Embarked'], prefix='Embarked')], axis=1)

df = df.drop(['Sex', 'Embarked'], axis=1)

cols = df.columns.tolist()
cols = [cols[1]] + cols[0:1] + cols[2:]

df = df[cols]

train_data = df.values



# Scikit-learn -Training the model

Random Forest Classifier의 documentation은 model의 input parameter에 대해 자세하게 설명하고 있다. input parameter들은 tree의 수, 그리고 각 tree가 갖는 branch들의 수를 포함한다.

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

GridSearchCV는 디자인 된 input parameter들의 범위를 테스트할 수 있도록 해준다. 그리고 각 값들의 set을 cross-validation 기반 하에서 성능을 테스트 할 수 있도록 해준다. 여기서는 각 branch가 만들어지는 각 단계 (max_features: feature의 50% 혹은 100%)와 최대 branch 수 (max_depth : 5 단계 또는 제한 없음)에서 고려한 feature의 수를 검토합니다.

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV

parameter_grid =  {
    'max_features': [0.5, 1.],
    'max_depth': [5., None]
}

grid_search = GridSearchCV(RandomForestClassifier(n_estimators=100), parameter_grid,
                              cv=5, verbose=3)



In [4]:
grid_search.fit(train_data[:, 2:], train_data[0:, 0])

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] max_depth=5.0, max_features=0.5 .................................
[CV] ........ max_depth=5.0, max_features=0.5, score=0.815642 -   0.0s
[CV] max_depth=5.0, max_features=0.5 .................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s remaining:    0.0s


[CV] ........ max_depth=5.0, max_features=0.5, score=0.826816 -   0.0s
[CV] max_depth=5.0, max_features=0.5 .................................
[CV] ........ max_depth=5.0, max_features=0.5, score=0.820225 -   0.0s
[CV] max_depth=5.0, max_features=0.5 .................................
[CV] ........ max_depth=5.0, max_features=0.5, score=0.792135 -   0.0s
[CV] max_depth=5.0, max_features=0.5 .................................
[CV] ........ max_depth=5.0, max_features=0.5, score=0.853107 -   0.0s
[CV] max_depth=5.0, max_features=1.0 .................................
[CV] ........ max_depth=5.0, max_features=1.0, score=0.798883 -   0.0s
[CV] max_depth=5.0, max_features=1.0 .................................
[CV] ........ max_depth=5.0, max_features=1.0, score=0.821229 -   0.0s
[CV] max_depth=5.0, max_features=1.0 .................................
[CV] ........ max_depth=5.0, max_features=1.0, score=0.820225 -   0.0s
[CV] max_depth=5.0, max_features=1.0 .................................
[CV] .

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    2.8s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_features': [0.5, 1.0], 'max_depth': [5.0, None]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)

이제 결과를 살펴보도록 한다.

In [5]:
grid_search.grid_scores_

[mean: 0.82155, std: 0.01962, params: {'max_depth': 5.0, 'max_features': 0.5},
 mean: 0.81706, std: 0.01819, params: {'max_depth': 5.0, 'max_features': 1.0},
 mean: 0.81257, std: 0.03725, params: {'max_depth': None, 'max_features': 0.5},
 mean: 0.81369, std: 0.02610, params: {'max_depth': None, 'max_features': 1.0}]

결과를 sort하고, best-performance를 수행하는 parameter를 선택하여 튜닝한다.

In [6]:
sorted(grid_search.grid_scores_, key=lambda x: x.mean_validation_score)
grid_search.best_score_
grid_search.best_params_

{'max_depth': 5.0, 'max_features': 0.5}

parameter를 활용하여 model을 튜닝한다.

In [7]:
model = RandomForestClassifier(n_estimators=100, max_features=0.5, max_depth=5.0)
model = model.fit(train_data[0:, 2:], train_data[0:, 0])

# Scikit-learn - Making prediction

In [9]:
df_test = pd.read_csv('../data/test.csv')

df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)

df_test['Age'] = df_test['Age'].fillna(age_mean)

fare_means = df.pivot_table('Fare', index='Pclass', aggfunc='mean')
df_test['Fare'] = df_test[['Fare', 'Pclass']].apply(lambda x:
                            fare_means.loc[x['Pclass']] if pd.isnull(x['Fare'])
                            else x['Fare'], axis=1)

df_test['Gender'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)
df_test = pd.concat([df_test, pd.get_dummies(df_test['Embarked'], prefix='Embarked')],
                axis=1)

df_test = df_test.drop(['Sex', 'Embarked'], axis=1)

test_data = df_test.values

output = model.predict(test_data[:,1:])

# Pandas - Preparing for submission

In [10]:
result = np.c_[test_data[:, 0].astype(int), output.astype(int)]

df_result = pd.DataFrame(result[:, 0:2], columns=['PassengerID', 'Survive'])
df_result.to_csv('../results/titanic_test_1-3.csv', index=False)