# Exercise 06

## Data preparation and model evaluation exercise with Titanic data




We'll be working with a dataset from Kaggle's Titanic competition: [data](https://github.com/justmarkham/DAT8/blob/master/data/titanic.csv), [data dictionary](https://www.kaggle.com/c/titanic/data)

**Goal**: Predict survival based on passenger characteristics

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.


Read the data into Pandas

In [9]:
import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv'
titanic = pd.read_csv(url, index_col='PassengerId')
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


# Exercise 6.1 

Impute the missing values of the age and Embarked

In [10]:
# check for missing values
titanic.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [18]:
# fill missing values for Age with the median age
titanic.Age.fillna(titanic.Age.median(), inplace=True)

In [28]:
# fill missing values for Embarked with the mode
titanic.loc[titanic.Embarked.isnull(), 'Embarked'] = titanic.Embarked.mode().values

In [30]:
titanic.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      0
dtype: int64

# Exercise 6.3

Convert the Sex and Embarked to categorical features

In [31]:
# encode Sex_Female feature
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})

In [32]:
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [36]:
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_Female,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,0,0,1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C,1,0,0
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S,1,0,1
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,1,0,1
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S,0,0,1


# Exercise 6.3 (2 points)

From the set of features ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

*Note, use the created categorical features for Sex and Embarked

Select the features that maximize the **accuracy** the model using K-Fold cross-validation

In [None]:
y = titanic['Survived']

In [38]:
features = ['Pclass', 'Age', 'Parch', 'Sex_Female', 'Embarked_Q', 'Embarked_S']

In [46]:
import itertools

possible_models = []
for i in range(1,len(features)+1):
    possible_models.extend(list(itertools.combinations(features,i)))

possible_models

[('Pclass',),
 ('Age',),
 ('Parch',),
 ('Sex_Female',),
 ('Embarked_Q',),
 ('Embarked_S',),
 ('Pclass', 'Age'),
 ('Pclass', 'Parch'),
 ('Pclass', 'Sex_Female'),
 ('Pclass', 'Embarked_Q'),
 ('Pclass', 'Embarked_S'),
 ('Age', 'Parch'),
 ('Age', 'Sex_Female'),
 ('Age', 'Embarked_Q'),
 ('Age', 'Embarked_S'),
 ('Parch', 'Sex_Female'),
 ('Parch', 'Embarked_Q'),
 ('Parch', 'Embarked_S'),
 ('Sex_Female', 'Embarked_Q'),
 ('Sex_Female', 'Embarked_S'),
 ('Embarked_Q', 'Embarked_S'),
 ('Pclass', 'Age', 'Parch'),
 ('Pclass', 'Age', 'Sex_Female'),
 ('Pclass', 'Age', 'Embarked_Q'),
 ('Pclass', 'Age', 'Embarked_S'),
 ('Pclass', 'Parch', 'Sex_Female'),
 ('Pclass', 'Parch', 'Embarked_Q'),
 ('Pclass', 'Parch', 'Embarked_S'),
 ('Pclass', 'Sex_Female', 'Embarked_Q'),
 ('Pclass', 'Sex_Female', 'Embarked_S'),
 ('Pclass', 'Embarked_Q', 'Embarked_S'),
 ('Age', 'Parch', 'Sex_Female'),
 ('Age', 'Parch', 'Embarked_Q'),
 ('Age', 'Parch', 'Embarked_S'),
 ('Age', 'Sex_Female', 'Embarked_Q'),
 ('Age', 'Sex_Female', '

In [59]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
logreg = LogisticRegression(C=1e9)

results = pd.DataFrame(index=possible_models, columns=['accuracy'])
for model in possible_models:
    X = titanic[list(model)]
    results.loc[model, 'accuracy'] = cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean()

In [60]:
results.sort_values('accuracy', ascending=False).head()

Unnamed: 0,accuracy
"(Pclass, Age, Parch, Sex_Female, Embarked_Q, Embarked_S)",0.793541
"(Pclass, Age, Parch, Sex_Female, Embarked_S)",0.792443
"(Pclass, Age, Sex_Female)",0.792379
"(Pclass, Age, Sex_Female, Embarked_Q, Embarked_S)",0.791268
"(Pclass, Age, Parch, Sex_Female, Embarked_Q)",0.790145


# Bonus Exercise 6.4 (3 points)

Now which is the best set of features selected by AUC

In [63]:
from sklearn import metrics
from sklearn.cross_validation import KFold
import numpy as np

results['AUC'] = 0

for model in possible_models:
    X = titanic[list(model)]
    
    # Create k-folds
    kf = KFold(X.shape[0], n_folds=10, random_state=0)

    res = []

    for train_index, test_index in kf:
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
        logreg = LogisticRegression(C=1e9)
        logreg.fit(X_train, y_train)
        
        y_pred_prob = logreg.predict_proba(X_test)[:, 1]
        res.append(metrics.roc_auc_score(y_test, y_pred_prob))

    results.loc[model, 'AUC'] = np.mean(res)

In [64]:
results.sort_values('AUC', ascending=False).head()

Unnamed: 0,accuracy,AUC
"(Pclass, Age, Sex_Female, Embarked_S)",0.786774,0.846793
"(Pclass, Age, Parch, Sex_Female, Embarked_S)",0.792443,0.846359
"(Pclass, Age, Sex_Female, Embarked_Q, Embarked_S)",0.791268,0.846294
"(Pclass, Age, Parch, Sex_Female, Embarked_Q, Embarked_S)",0.793541,0.84513
"(Pclass, Age, Parch, Sex_Female)",0.790119,0.843077
