# Exercise 06

## Data preparation and model evaluation exercise with Titanic data




We'll be working with a dataset from Kaggle's Titanic competition: [data](https://github.com/justmarkham/DAT8/blob/master/data/titanic.csv), [data dictionary](https://www.kaggle.com/c/titanic/data)

**Goal**: Predict survival based on passenger characteristics

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.


Read the data into Pandas

In [55]:
import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv'
titanic = pd.read_csv(url, index_col='PassengerId')
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


# Exercise 6.1 

Impute the missing values of the age and Embarked

In [56]:
titanic.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [57]:
####Imputar los datos faltantes variable Edad
titanic.Age.mean()


29.69911764705882

In [58]:
titanic.Age.fillna(titanic.Age.mean(), inplace=True)

In [59]:
#### Imputar los datos faltantes variable Embarked   
titanic.Embarked .mode()

0    S
dtype: object

In [60]:
titanic.Embarked.fillna("S", inplace=True)
titanic.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      0
dtype: int64

# Exercise 6.3

Convert the Sex and Embarked to categorical features

In [61]:
####
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
titanic['Embarked_New'] = titanic.Embarked.map({'S':0, 'C':1})
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_Female,Embarked_New
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,0,0
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C,1,1
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S,1,0
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,1,0
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S,0,0


# Exercise 6.3 (2 points)

From the set of features ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

*Note, use the created categorical features for Sex and Embarked

Select the features that maximize the **accuracy** the model using K-Fold cross-validation

In [62]:
y = titanic['Survived']

In [63]:
features = ['Pclass', 'Sex_Female','SibSp', 'Parch','Fare','Embarked_New'] 

In [64]:
import itertools

possible_models = []
for i in range(1,len(features)+1):
    possible_models.extend(list(itertools.combinations(features,i)))

possible_models

[('Pclass',),
 ('Sex_Female',),
 ('SibSp',),
 ('Parch',),
 ('Fare',),
 ('Embarked_New',),
 ('Pclass', 'Sex_Female'),
 ('Pclass', 'SibSp'),
 ('Pclass', 'Parch'),
 ('Pclass', 'Fare'),
 ('Pclass', 'Embarked_New'),
 ('Sex_Female', 'SibSp'),
 ('Sex_Female', 'Parch'),
 ('Sex_Female', 'Fare'),
 ('Sex_Female', 'Embarked_New'),
 ('SibSp', 'Parch'),
 ('SibSp', 'Fare'),
 ('SibSp', 'Embarked_New'),
 ('Parch', 'Fare'),
 ('Parch', 'Embarked_New'),
 ('Fare', 'Embarked_New'),
 ('Pclass', 'Sex_Female', 'SibSp'),
 ('Pclass', 'Sex_Female', 'Parch'),
 ('Pclass', 'Sex_Female', 'Fare'),
 ('Pclass', 'Sex_Female', 'Embarked_New'),
 ('Pclass', 'SibSp', 'Parch'),
 ('Pclass', 'SibSp', 'Fare'),
 ('Pclass', 'SibSp', 'Embarked_New'),
 ('Pclass', 'Parch', 'Fare'),
 ('Pclass', 'Parch', 'Embarked_New'),
 ('Pclass', 'Fare', 'Embarked_New'),
 ('Sex_Female', 'SibSp', 'Parch'),
 ('Sex_Female', 'SibSp', 'Fare'),
 ('Sex_Female', 'SibSp', 'Embarked_New'),
 ('Sex_Female', 'Parch', 'Fare'),
 ('Sex_Female', 'Parch', 'Embarked_N

In [68]:
X = titanic[list(possible_models[62])] 
y = titanic.Survived

# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)


In [69]:
# train a logistic regression model
# train a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

# make predictions for testing set
y_pred_class = logreg.predict(X_test)

# calculate testing accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

# Bonus Exercise 6.4 (3 points)

Now which are the best set of features selected by AUC