# DAT 19: Homework 2 Assignment

## Instructions

For Homework 2, we will build on the work we did with the Titanic dataset in Homework 1. In this assignment, we will build a logistic regression model to predict passenger survival.

Please do all your analysis to answer the questions below in this Jupyter notebook. Show your work.

**Please submit your completed notebook by 6:00PM on Monday, January 11.**

## About the Data

```
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
```

## Homework Assignment

**1) Create a logistic regression model on the Titanic dataset to predict the survival of passengers. Show your model output. Include coefficient values.**

In [475]:
# Step 1 => Import necessary packages and modules

import numpy as np
import pandas as pd
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction import DictVectorizer

In [476]:
# Step 2 => Read in the dataset
df = pd.DataFrame.from_csv("titanic.csv")

# Step 3 => print .head() to review data
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


In [477]:
# Step 4 => set features & target (class labels)
features = df.drop('Survived',axis=1)
target = df.Survived

In [478]:
# Step 5 => review datatypes of features before going forward
features.dtypes

Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object

In [479]:
# Step 6 => in order for the classifier to work, 
# we need to pass it numbers. because some of these features are of the type "object", 
# we need to change their datatypes to be readable for the matrix we'll pass to the classifier

classes = df['Survived']
features = df.drop('Survived',axis=1)
features = features.fillna(0)

D = features.to_dict('records')
v = DictVectorizer(sparse=False)
X = v.fit_transform(D)
X

array([[ 22.,   0.,   0., ...,   0.,   0.,   0.],
       [ 38.,   0.,   0., ...,   0.,   0.,   0.],
       [ 26.,   0.,   0., ...,   0.,   0.,   0.],
       ..., 
       [  0.,   0.,   0., ...,   0.,   0.,   0.],
       [ 26.,   0.,   0., ...,   0.,   0.,   0.],
       [ 32.,   0.,   0., ...,   0.,   0.,   0.]])

In [542]:
# Step 7 => Create a logistic regression model on the Titanic dataset to predict the survival of passengers.
clf = LogisticRegression().fit(X, classes)

In [481]:
# Step 8 => Show your model output. 
score = cross_val_score(clf, X, classes)
print score

[ 0.79124579  0.8013468   0.80808081]


In [482]:
# Step 9 => Coefficient values
print clf.coef_
print clf.coef_.ravel(order='F')


coeffs = pd.DataFrame(zip(features.columns[:-1],clf.coef_.ravel()),columns=['features','coeff'])
coeffs.head()
print coeffs

coeffs['abs'] = np.absolute(coeffs.coeff.values)
coeffs.sort('abs',ascending=False)

[[-0.01442997  0.         -0.26422443 ..., -0.18502362 -0.14023648
  -0.07166333]]
[-0.01442997  0.         -0.26422443 ..., -0.18502362 -0.14023648
 -0.07166333]
  features     coeff
0   Pclass -0.014430
1     Name  0.000000
2      Sex -0.264224
3      Age -0.262967
4    SibSp  0.079595
5    Parch -0.271555
6   Ticket  0.366404
7     Fare  0.472270
8    Cabin -0.230219




Unnamed: 0,features,coeff,abs
7,Fare,0.47227,0.47227
6,Ticket,0.366404,0.366404
5,Parch,-0.271555,0.271555
2,Sex,-0.264224,0.264224
3,Age,-0.262967,0.262967
8,Cabin,-0.230219,0.230219
4,SibSp,0.079595,0.079595
0,Pclass,-0.01443,0.01443
1,Name,0.0,0.0


**2) Which features are predictive for this logistic regression? Explain your thinking. Do not simply cite model statistics.**

In [483]:
# Step 10 => Domain Expertise
# Approaches to evaluating feature importance:
## (1) DOMAIN EXPERTISE_a (aka, HW1)
### 18.89% of men on the Titanic survived
### 74.20% of women on the Titanic survived
### Based on previous analysis, there's a strong relationship between 'Sex' and survival rate
### Therefore, it seems 'Sex' is clearly a strong predictive feature for this logistic regression
### To examine this fact, I will retrain the model and exclude 'Sex'

# Create features and target
features2 = features.drop('Sex',axis=1)
target2 = df.Survived

# Fill NULL values with 0 so that the Logistic Regression classifier can read it
features2a = features2.fillna(0)

# Transform features into dictionary and have vectorizer read the dictionary (matrix)
D2 = features2.to_dict('records')
v2 = DictVectorizer(sparse=False)
X2 = v.fit_transform(D2)

## Rebuiding, training, and evaluating a model with fewer features
# Show your model output.
score = cross_val_score(clf, X2, target2)
print 'Scores of the LR model when "Sex" is removed from the feature set: ' + str(score)

## DOMAIN EXPERTISE_b (aka, HW1)
### 63% of first-class passengers survived
### 47% of second-class passengers survived
### 24% of third-calss passengers survived
### Based on previous analysis, there's a strong relationship between 'Pclass' and survival rate
### Therefore, it seems 'Pclass' is clearly a strong predictive feature for this logistic regression
### To examine this fact, I will retrain the model and exclude 'Pclass'

# Create features and target
features3 = features2.drop('Pclass',axis=1)
target3 = df.Survived

# Fill NULL values with 0 so that the Logistic Regression classifier can read it
features3a = features3.fillna(0)

# Transform features into dictionary and have vectorizer read the dictionary (matrix)
D3 = features3.to_dict('records')
v3 = DictVectorizer(sparse=False)
X3 = v.fit_transform(D3)

## Rebuiding, training, and evaluating a model with fewer features
# Show your model output.
score = cross_val_score(clf, X3, target3)
print 'Scores of the LR model when "Sex + Pclass" are removed from the feature set: ' + str(score)

Scores of the LR model when "Sex" is removed from the feature set: [ 0.65656566  0.70707071  0.76094276]
Scores of the LR model when "Sex + Pclass" are removed from the feature set: [ 0.66666667  0.68686869  0.72053872]


In [484]:
# Step 11a => Feature Normalization
features_norm = (X - X.mean())/X.std()

In [485]:
# Step 11b => Feature Normalization

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
features_norm_2 = scaler.fit_transform(X)

In [486]:
# Step 12 => Run Logistic Regression with normalized features

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(C=1)

cross_val_score(clf,features_norm, target,cv=3 ).mean()

0.78900112233445563

**3) Implement cross-validation for your logistic regression model. Select the number of folds. Explain your choice.**

In [487]:
print cross_val_score(clf,X,classes,cv=2).mean()
print cross_val_score(clf,X,classes,cv=3).mean()
print cross_val_score(clf,X,classes,cv=5).mean()
print cross_val_score(clf,X,classes,cv=7).mean()
print cross_val_score(clf,X,classes,cv=9).mean()
print cross_val_score(clf,X,classes,cv=11).mean()

0.785642666398
0.800224466891
0.800255161117
0.803603651106
0.803591470258
0.809184322904


**4) In the hw-assignments directory on the class github repo, there is a file called titanic-test.csv. What does your logistic regression model predict for these previously unseen (i.e. out of sample) passengers?**

In [488]:
# Step 13 => Import Out of Sample dataset and view it
df2 = pd.DataFrame.from_csv("titanic-test.csv")
df2.head()

# The dataset doesn't include class labels!

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [489]:
# Step 14 => set new features
features_new = df2
columns = df2.columns

In [490]:
# Step 15 => in order for the classifier to work, 
# we need to pass it numbers. because some of these features are of the type "object", 
# we need to change their datatypes to be readable for the matrix we'll pass to the classifier

features_new_nan = features_new.fillna(0)

D_new = features_new_nan.to_dict('records')
v_new = DictVectorizer(sparse=False)
X_new = v.fit_transform(D_new)
X_new

array([[ 34.5,   0. ,   0. , ...,   0. ,   0. ,   0. ],
       [ 47. ,   0. ,   0. , ...,   0. ,   0. ,   0. ],
       [ 62. ,   0. ,   0. , ...,   0. ,   0. ,   0. ],
       ..., 
       [ 38.5,   0. ,   0. , ...,   0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. , ...,   0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. , ...,   0. ,   0. ,   0. ]])

In [516]:
# Step 15 => Evaluate effect of classifier on Out of Sample data 
clf_test = clf.fit(X_new,classes[:418])
score_test = cross_val_score(clf_test, X_new, classes[:418])
print score_test

[ 0.61428571  0.5971223   0.58992806]


In [517]:
# Step 16 => Results above show that classifier is has a low precision of predicting survival

In [518]:
# Step 17 => Feature Normalization of test_features
features_norm_test = (X_new - X_new.mean())/X_new.std()

In [519]:
# Step 18 => Feature Normalization part II

scaler_test = StandardScaler()
features_norm_test2 = scaler_test.fit_transform(X_new)

In [520]:
# Step 19 => Run Logistic Regression with normalized test features

cross_val_score(clf,features_norm_test2,classes[:418],cv=3 ).mean()

0.45697156560465912

In [521]:
# Step 20 => Based on output above, normalizing data didn't help

In [522]:
# Step 21 => apply cross-fold validation
print cross_val_score(clf,X_new,classes[:418],cv=2).mean()
print cross_val_score(clf,X_new,classes[:418],cv=3).mean()
print cross_val_score(clf,X_new,classes[:418],cv=5).mean()
print cross_val_score(clf,X_new,classes[:418],cv=7).mean()
print cross_val_score(clf,X_new,classes[:418],cv=9).mean()
print cross_val_score(clf,X_new,classes[:418],cv=11).mean()

0.610004578755
0.600445357999
0.602868617326
0.610184045833
0.610042570322
0.600375679323


In [561]:
# Predicted regression values;
# Classifiers that can predict the probability of class membership
print 'According to my classifier ' + str(clf_test.predict_proba(X_new)[:,0].mean()*100)+str('%') + ' of passengers in test data did not survive.'
print 'According to my classifier ' + str(clf_test.predict_proba(X_new)[:,1].mean()*100)+str('%') + ' of passengers in test data survived.'


According to my classifier 60.9133363212% of passengers in test data did not survive.
According to my classifier 39.0866636788% of passengers in test data survived.
