# DAT 19: Homework 2 Assignment

## Instructions

For Homework 2, we will build on the work we did with the Titanic dataset in Homework 1. In this assignment, we will build a logistic regression model to predict passenger survival.

Please do all your analysis to answer the questions below in this Jupyter notebook. Show your work.

**Please submit your completed notebook by 6:00PM on Monday, January 11.**

## About the Data

```
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
```

## Homework Assignment

**1) Create a logistic regression model on the Titanic dataset to predict the survival of passengers. Show your model output. Include coefficient values.**

In [131]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

#Cleaning up the data
titanic = pd.read_csv("titanic.csv", header=0)


#Changing gender to binary 0 - 1
titanic.Sex.replace(['female', 'male'],[0,1],inplace=True)

#Replacing missing ages with mean of age
titanic.Age = titanic.Age.fillna(titanic.Age.mean())

#Dropping the columns I will not use in my model, based on Homework 1 
titanic_set = titanic.drop(['PassengerId', 'Name', 'Ticket', 'Fare', 'Cabin', 'Embarked'], axis=1) 



In [132]:
#Logistic Regression Model

target = titanic_set.Survived
features = titanic_set.drop('Survived',axis=1)
new_features = StandardScaler().fit_transform(features)

model_lr = LogisticRegression(C=1).fit(new_features, target)

predictions = model_lr.predict(new_features)
train_predictions = pd.DataFrame({
        "PassengerId": titanic["PassengerId"],
        "Survived": predictions
    })

train_survival_rate = train_predictions["Survived"].mean() * 100
actual_train_survival_rate = titanic["Survived"].mean() * 100

print(train_predictions.head(10))
print "Actual Survival Rate (Training Set): " + str(actual_train_survival_rate)
print "Survival Rate (Training Set): " + str(train_survival_rate)
print "Coefficient Values: ", + model_lr.coef_

   PassengerId  Survived
0            1         0
1            2         1
2            3         1
3            4         1
4            5         0
5            6         0
6            7         0
7            8         0
8            9         1
9           10         1
Actual Survival Rate (Training Set): 38.3838383838
Survival Rate (Training Set): 36.9248035915
Coefficient Values:  [[-0.96397009 -1.3062986  -0.50920669 -0.35877595 -0.06369496]]


**2) Which features are predictive for this logistic regression? Explain your thinking. Do not simply cite model statistics.**

Gender (sex) and passenger class are the clearest predictive features for this model, followed by age. 
Gender seems to have the most significant relationship to survival-- in this case, a being a man on the Titanic (represented by 1 in the training data) was far less likely to survive, indicated by the inverse relationship of sex to survival. If you flip the genders (1 for female, 0 for male), the relationship flips-- being female increases the likelihood that a person will survive. Passenger class also has an inverse relationship to survival-- probability of survival decreases the higher the **value** of a person's passenger class (therefore, the lower their socioeconomic status).

**3) Implement cross-validation for your logistic regression model. Select the number of folds. Explain your choice.**

In [133]:
cross_val_score(model_lr,features,target,cv=10).mean()

0.7968485415957326

Honestly, I read some Q&A on StackOverflow and Quora, and I'm still not entirely sure if 10 is the correct number of folds or if I should have picked something different. I need to dig into this further, but for the homework, 10 folds seems to be a "safe" choice.

**4) In the hw-assignments director on the class github repo, there is a file called titanic-test.csv. What does your logistic regression model predict for these previously unseen (i.e. out of sample) passengers?**

In [134]:
titanic_test = pd.read_csv("titanic-test.csv", header=0)

#Clean up titanic_test the same way
titanic_test.Sex.replace(['female', 'male'],[0,1],inplace=True)
titanic_test.Age = titanic_test.Age.fillna(titanic.Age.mean())

titanic_test_set = titanic_test.drop(['PassengerId', 'Name', 'Ticket', 'Fare', 'Cabin', 'Embarked'], axis=1)
titanic_test_set_features = StandardScaler().fit_transform(titanic_test_set)


model_lr = LogisticRegression(C=1).fit(new_features, target)

test_predictions = model_lr.predict(titanic_test_set_features)
test_predictions_viz = pd.DataFrame({
        "PassengerId": titanic["PassengerId"],
        "Survived": predictions
    })

survival_rate_test = test_predictions_viz["Survived"].mean() * 100

print(test_predictions_viz.head(10))
print "Survival Rate (Test): " + str(survival_rate_test)

   PassengerId  Survived
0            1         0
1            2         1
2            3         1
3            4         1
4            5         0
5            6         0
6            7         0
7            8         0
8            9         1
9           10         1
Survival Rate (Test): 36.9248035915
