# Section 1-4 - Building Pipelines

GridSearchCV reviews the performance of a set range of parameters on a cross-validation basis. This means only a portion of the training data is reviewed at any one time. When filling in the NA values with the mean value, however, we considered the whole set of training data.

Hence we took an inconsistent approach in reviewing only a portion of the data when running GridSearchCV, but the full set of data when filling in missing values. We can avoid this inconsistency by building pipelines and making imputations.

## Pandas - Extracting data

In [105]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/train.csv')
df.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C


## Pandas - Cleaning data

In [106]:
df[df['Embarked'].isnull()]
df['Embarked'].dropna().mode().values

array(['S'], dtype=object)

Doesn't work because no Prefix column

In [95]:
age_means_prefix = df.pivot_table('Age', index='Prefix', aggfunc='mean')

df[ (df['Name'].str.contains('Mr.') | df['Name'].str.contains('Miss') | df['Name'].str.contains('Ms')) 
   & df['Age'].isnull()].head()

#fill age based on Prefix
df['Age'] = df[['Age', 'Prefix']].apply(lambda x:
                            age_means_prefix[x['Prefix']] if pd.isnull(x['Age'])
                            else x['Age'], axis=1)


In [107]:
df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)

age_mean = df['Age'].mean()

from scipy.stats import mode

# Replace missing values with most common port
df['Embarked'].isnull= df['Embarked'].dropna().mode().values

#mode_embarked = mode(df['Embarked'])[0][0]
#df['Embarked'] = df['Embarked'].fillna(mode_embarked)

df['Gender'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)

pd.get_dummies(df['Embarked'], prefix='Embarked').head(10)
df = pd.concat([df, pd.get_dummies(df['Embarked'], prefix='Embarked')], axis=1)

df = df.drop(['Sex', 'Embarked'], axis=1)

cols = df.columns.tolist()
cols = [cols[1]] + cols[0:1] + cols[2:]

df = df[cols]
df.head(2)

Unnamed: 0,Survived,PassengerId,Pclass,Age,SibSp,Parch,Fare,Gender,Embarked_C,Embarked_Q,Embarked_S
0,0,1,3,22,1,0,7.25,1,0,0,1
1,1,2,1,38,1,0,71.2833,0,1,0,0


We replace the NA values in the column Age with a negative value marker -1, as the following bug disallows us from using a missing value marker:

https://github.com/scikit-learn/scikit-learn/issues/3044

In [108]:
df = df.fillna(-1)

We then review our dataset.

In [109]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 11 columns):
Survived       891 non-null int64
PassengerId    891 non-null int64
Pclass         891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Gender         891 non-null int32
Embarked_C     891 non-null float64
Embarked_Q     891 non-null float64
Embarked_S     891 non-null float64
dtypes: float64(5), int32(1), int64(5)
memory usage: 80.1 KB


Unnamed: 0,Survived,PassengerId,Pclass,Age,SibSp,Parch,Fare,Gender,Embarked_C,Embarked_Q,Embarked_S
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,446.0,2.308642,23.60064,0.523008,0.381594,32.204208,0.647587,0.188552,0.08642,0.722783
std,0.486592,257.353842,0.836071,17.867496,1.102743,0.806057,49.693429,0.47799,0.391372,0.281141,0.447876
min,0.0,1.0,1.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,223.5,2.0,6.0,0.0,0.0,7.9104,0.0,0.0,0.0,0.0
50%,0.0,446.0,3.0,24.0,0.0,0.0,14.4542,1.0,0.0,0.0,1.0
75%,1.0,668.5,3.0,35.0,1.0,0.0,31.0,1.0,0.0,0.0,1.0
max,1.0,891.0,3.0,80.0,8.0,6.0,512.3292,1.0,1.0,1.0,1.0


In [110]:
train_data = df.values
train_data

array([[   0.,    1.,    3., ...,    0.,    0.,    1.],
       [   1.,    2.,    1., ...,    1.,    0.,    0.],
       [   1.,    3.,    3., ...,    0.,    0.,    1.],
       ..., 
       [   0.,  889.,    3., ...,    0.,    0.,    1.],
       [   1.,  890.,    1., ...,    1.,    0.,    0.],
       [   0.,  891.,    3., ...,    0.,    1.,    0.]])

## Scikit-learn - Training the model

We now build a pipeline to enable us to first impute the mean value of the column Age on the portion of the training data we are considering, and second, assess the performance of our tuning parameters.

In [111]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV

imputer = Imputer(strategy='mean', missing_values=-1)

classifier = RandomForestClassifier(n_estimators=100)

pipeline = Pipeline([
    ('imp', imputer),
    ('clf', classifier),
])

We note the slight change made to the syntax inside our parameter grid.

In [112]:
parameter_grid = {
    'clf__max_features': [0.5, 1],
    'clf__max_depth': [5, None],
}

We now run GridSearchCV as before but replacing the classifier with our pipeline.

In [113]:
grid_search = GridSearchCV(pipeline, parameter_grid, cv=5, verbose=3)

In [114]:
train_data[0:,2:]

array([[  3.,  22.,   1., ...,   0.,   0.,   1.],
       [  1.,  38.,   1., ...,   1.,   0.,   0.],
       [  3.,  26.,   0., ...,   0.,   0.,   1.],
       ..., 
       [  3.,  -1.,   1., ...,   0.,   0.,   1.],
       [  1.,  26.,   0., ...,   1.,   0.,   0.],
       [  3.,  32.,   0., ...,   0.,   1.,   0.]])

In [115]:
grid_search.fit(train_data[0:,2:], train_data[0:,0])

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] clf__max_depth=5, clf__max_features=0.5 .........................
[CV]  clf__max_depth=5, clf__max_features=0.5, score=0.804469 -   0.0s
[CV] clf__max_depth=5, clf__max_features=0.5 .........................
[CV]  clf__max_depth=5, clf__max_features=0.5, score=0.826816 -   0.0s
[CV] clf__max_depth=5, clf__max_features=0.5 .........................
[CV]  clf__max_depth=5, clf__max_features=0.5, score=0.837079 -   0.0s
[CV] clf__max_depth=5, clf__max_features=0.5 .........................
[CV]  clf__max_depth=5, clf__max_features=0.5, score=0.786517 -   0.0s
[CV] clf__max_depth=5, clf__max_features=0.5 .........................
[CV]  clf__max_depth=5, clf__max_features=0.5, score=0.858757 -   0.0s
[CV] clf__max_depth=5, clf__max_features=1 ...........................
[CV] .. clf__max_depth=5, clf__max_features=1, score=0.776536 -   0.0s
[CV] clf__max_depth=5, clf__max_features=1 ...........................
[CV] .. clf__max_

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    2.4s finished





GridSearchCV(cv=5,
       estimator=Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values=-1, strategy='mean', verbose=0)), ('clf', RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0))]),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'clf__max_depth': [5, None], 'clf__max_features': [0.5, 1]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=3)

In [116]:
sorted(grid_search.grid_scores_, key=lambda x: x.mean_validation_score)
grid_search.best_score_
grid_search.best_params_

{'clf__max_depth': 5, 'clf__max_features': 0.5}

Now that we've determined the desired values for our tuning parameters, we can fill in the -1 values in the column Age with the mean and train our model.

In [117]:
df['Age'].describe()

count    891.000000
mean      23.600640
std       17.867496
min       -1.000000
25%        6.000000
50%       24.000000
75%       35.000000
max       80.000000
Name: Age, dtype: float64

In [118]:
df['Age'] = df['Age'].map(lambda x: age_mean if x == -1 else x)

In [119]:
df['Age'].describe()

count    891.000000
mean      29.699118
std       13.002015
min        0.420000
25%       22.000000
50%       29.699118
75%       35.000000
max       80.000000
Name: Age, dtype: float64

In [120]:
train_data = df.values

In [121]:
model = RandomForestClassifier(n_estimators = 100, max_features=0.5, max_depth=5)
model = model.fit(train_data[0:,2:],train_data[0:,0])

## Scikit-learn - Making predictions

In [122]:
df_test = pd.read_csv('../data/test.csv')

df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)

We can fill in the NA values in test data with the mean, since there is no analogous problem of snooping.

In [123]:
df_test['Age'] = df_test['Age'].fillna(age_mean)

In [124]:
fare_means = df.pivot_table('Fare', index='Pclass', aggfunc='mean')
df_test['Fare'] = df_test[['Fare', 'Pclass']].apply(lambda x:
                            fare_means[x['Pclass']] if pd.isnull(x['Fare'])
                            else x['Fare'], axis=1)

df_test['Gender'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)
df_test = pd.concat([df_test, pd.get_dummies(df_test['Embarked'], prefix='Embarked')],
                axis=1)

df_test = df_test.drop(['Sex', 'Embarked'], axis=1)

test_data = df_test.values

output = model.predict(test_data[:,1:])



## Pandas - Preparing for submission

In [125]:
result = np.c_[test_data[:,0].astype(int), output.astype(int)]

df_result = pd.DataFrame(result[:,0:2], columns=['PassengerId', 'Survived'])
df_result.to_csv('../results/titanic_1-4.csv', index=False)