# Exercise 06

## Data preparation and model evaluation exercise with Titanic data




We'll be working with a dataset from Kaggle's Titanic competition: [data](https://github.com/justmarkham/DAT8/blob/master/data/titanic.csv), [data dictionary](https://www.kaggle.com/c/titanic/data)

**Goal**: Predict survival based on passenger characteristics

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.


Read the data into Pandas

In [50]:
import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv'
titanic = pd.read_csv(url, index_col='PassengerId')
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


# Exercise 6.1 

Impute the missing values of the age and Embarked

In [51]:
# check for missing values
titanic.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [52]:
titanic.Age.fillna(titanic.Age.median(), inplace=True)

In [53]:
titanic.Embarked.mode()

0    S
dtype: object

In [54]:
titanic.Embarked.fillna('S', inplace=True)

In [55]:
# check for missing values
titanic.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      0
dtype: int64

# Exercise 6.3

Convert the Sex and Embarked to categorical features

In [56]:
titanic['Sex1'] = titanic.Sex.map({'male':0, 'female':1})
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex1
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,0
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C,1
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S,1
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,1
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S,0


In [57]:
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic = pd.concat([titanic, embarked_dummies], axis=1)
titanic.head(1)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex1,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,0,0,1


# Exercise 6.3 (2 points)

From the set of features ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

*Note, use the created categorical features for Sex and Embarked

Select the features that maximize the **accuracy** the model using K-Fold cross-validation

In [9]:
y = titanic['Survived']

In [10]:
features = ['a', 'b', 'c', 's']  # Replace

In [11]:
import itertools

possible_models = []
for i in range(1,len(features)+1):
    possible_models.extend(list(itertools.combinations(features,i)))

possible_models

[('a',),
 ('b',),
 ('c',),
 ('s',),
 ('a', 'b'),
 ('a', 'c'),
 ('a', 's'),
 ('b', 'c'),
 ('b', 's'),
 ('c', 's'),
 ('a', 'b', 'c'),
 ('a', 'b', 's'),
 ('a', 'c', 's'),
 ('b', 'c', 's'),
 ('a', 'b', 'c', 's')]

In [None]:
X = titanic[list(possible_models[0])] # Example, do not use

In [68]:
features = ['Pclass','Sex1', 'Age', 'SibSp',  'Parch', 'Fare', 'Embarked_Q','Embarked_S']  # Replace

In [35]:
len(features)

8

In [69]:
import itertools

possible_models = []
for i in range(1,len(features)+1):
    possible_models.extend(list(itertools.combinations(features,i)))

In [70]:
y = titanic.Survived

a = pd.DataFrame(index=possible_models, columns=['accuracy'])
for i in range(len(possible_models)):
    X = titanic[list(possible_models[i])] 
    from sklearn.linear_model import LogisticRegression
    logreg = LogisticRegression(C=1e9)

    from sklearn.cross_validation import cross_val_score
    a.iloc[i]=cross_val_score(logreg,X,y, cv=10,scoring='accuracy').mean()

In [71]:
a.head()

Unnamed: 0,accuracy
"(Pclass,)",0.67927
"(Sex1,)",0.786698
"(Age,)",0.61617
"(SibSp,)",0.61617
"(Parch,)",0.60833


In [40]:
a.sort_values('accuracy',ascending=False).head()

Unnamed: 0,accuracy
"(Pclass, Sex1, Age, SibSp, Embarked_S)",0.801369
"(Pclass, Sex1, SibSp, Embarked_Q)",0.800194
"(Pclass, Sex1, SibSp)",0.800194
"(Pclass, Sex1, SibSp, Parch)",0.799083
"(Pclass, Sex1, SibSp, Parch, Embarked_Q)",0.799083


# Bonus Exercise 6.4 (3 points)

Now which are the best set of features selected by AUC

In [58]:
features = ['Pclass','Sex1', 'Age', 'SibSp',  'Parch', 'Fare', 'Embarked_Q','Embarked_S']  # Replace

In [59]:
import itertools

possible_models = []
for i in range(1,len(features)+1):
    possible_models.extend(list(itertools.combinations(features,i)))

In [60]:
y = titanic.Survived

a = pd.DataFrame(index=possible_models, columns=['accuracy'])
for i in range(len(possible_models)):
    X = titanic[list(possible_models[i])] 
    from sklearn.linear_model import LogisticRegression
    logreg = LogisticRegression(C=1e9)
    logreg.fit(X, y)

    from sklearn import metrics
    y_pred_prob = logreg.predict_proba(X)[:, 1]
    a.iloc[i]=metrics.roc_auc_score(y, y_pred_prob)

In [66]:
a.head()

Unnamed: 0,accuracy
"(Pclass,)",0.681417
"(Sex1,)",0.766873
"(Age,)",0.521834
"(SibSp,)",0.456838
"(Parch,)",0.561217


In [67]:
a.sort_values('accuracy',ascending=False).head()

Unnamed: 0,accuracy
"(Pclass, Sex1, Age, SibSp, Parch, Embarked_Q, Embarked_S)",0.857444
"(Pclass, Sex1, Age, SibSp, Fare, Embarked_Q, Embarked_S)",0.857311
"(Pclass, Sex1, Age, SibSp, Fare, Embarked_S)",0.857284
"(Pclass, Sex1, Age, SibSp, Parch, Fare, Embarked_Q, Embarked_S)",0.857117
"(Pclass, Sex1, Age, SibSp, Embarked_Q, Embarked_S)",0.857005
