# DAT 19: Homework 2 Assignment

## Instructions

For Homework 2, we will build on the work we did with the Titanic dataset in Homework 1. In this assignment, we will build a logistic regression model to predict passenger survival.

Please do all your analysis to answer the questions below in this Jupyter notebook. Show your work.

**Please submit your completed notebook by 6:00PM on Monday, January 11.**

## About the Data

```
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
```

## Getting Started

**Load libraries and dataset**

In [None]:
# import libraries
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# bring dataset in as a pandas dataframe
df = pd.DataFrame.from_csv('https://raw.githubusercontent.com/colby-schrauth/DAT_SF_19/master/data/titanic.csv',
                           header=0, sep=',', index_col=False)

test_df = pd.DataFrame.from_csv('https://raw.githubusercontent.com/colby-schrauth/DAT_SF_19/master/hw-assignments/titanic-test.csv',
                           header=0, sep=',', index_col=False)

In [None]:
# print the first 5 rows to make sure I've imported the dataset properly
df.head(5)

In [None]:
# print the last 5 rows to make sure I've imported the dataset properly  
df.tail(5)  

In [None]:
# scan the attributes of the dataframe to obtain an initial understanding of what we're working with
df.info()
df.describe()

**Data Preparation**

In [None]:
# create a reference for the column names for easy recall
print list(df.columns.values)

# convert 'Sex' column to binary, and place values in two new columns: 'Male', 'Female'
df['Male'] = df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
df['Female'] = df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

test_df['Male'] = df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
test_df['Female'] = df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

In [None]:
# check our work - print the first 5 rows to make sure I've parsed correctly
df.head(5)

In [None]:
# check out work - make sure every row has been accounted for
print df.Sex.value_counts()
print df.Male.value_counts()
print df.Female.value_counts()

In [None]:
# identify the unique values in 'Embarked' and their associated frequencies
df.Embarked.value_counts()

In [None]:
# fill in Embarked nulls with most common embarkment
df['Embarked'].fillna('S', inplace=True)
test_df['Embarked'].fillna('S', inplace=True)

# check to make sure there are no more null values
df.info()

In [None]:
# break apart 'Embarked' column into binary peices, and place them in three newly created colums ('S', 'C', 'Q')
df['EmbarkedS'] = df['Embarked'].map( {'S': 1, 'C': 0, 'Q': 0}).astype(int)
df['EmbarkedC'] = df['Embarked'].map( {'S': 0, 'C': 1, 'Q': 0}).astype(int)
df['EmbarkedQ'] = df['Embarked'].map( {'S': 0, 'C': 0, 'Q': 1}).astype(int)

test_df['EmbarkedS'] = df['Embarked'].map( {'S': 1, 'C': 0, 'Q': 0}).astype(int)
test_df['EmbarkedC'] = df['Embarked'].map( {'S': 0, 'C': 1, 'Q': 0}).astype(int)
test_df['EmbarkedQ'] = df['Embarked'].map( {'S': 0, 'C': 0, 'Q': 1}).astype(int)

In [None]:
# check the new columns
df.head(10)

In [None]:
# fill in missing age values with the median for each gender, in each class
# start by creating an empty numpy matrix
median_ages = np.zeros((2,3))
median_ages

In [None]:
# fill the median_ages matrix with the value
for i in range(0, 2):
    for j in range(0, 3):
        median_ages[i,j] = df[(df['Male'] == i) & \
                              (df['Pclass'] == j+1)]['Age'].dropna().median()
        
median_ages

In [None]:
# create a new colume titled 'AgeFill', which is equal to 'Age'
df['AgeFill'] = df['Age']
test_df['AgeFill'] = test_df['Age']

# check my work
df.head(5)

In [None]:
# pull back a handful of entries where the 'Age' value is null
df[ df['Age'].isnull() ][['Male', 'Female', 'Pclass','Age','AgeFill']].head(10)
test_df[ test_df['Age'].isnull() ][['Male', 'Female', 'Pclass','Age','AgeFill']].head(10)

In [None]:
# fill the 'AgeFill' column with the values discovered, and stored in the median_ages matrix
for i in range(0, 2):
    for j in range(0, 3):
        df.loc[ (df.Age.isnull()) & (df.Male == i) & (df.Pclass == j+1),\
                'AgeFill'] = median_ages[i,j]
        
for i in range(0, 2):
    for j in range(0, 3):
        test_df.loc[ (test_df.Age.isnull()) & (test_df.Male == i) & (test_df.Pclass == j+1),\
                'AgeFill'] = median_ages[i,j]
        
# check my work
df[ df['Age'].isnull() ][['Male', 'Female', 'Pclass','Age','AgeFill']].head(10)

In [None]:
# create a new column 'AgeIsNull', which reminds us of whether or not the original 'Age' column was null
df['AgeIsNull'] = pd.isnull(df.Age).astype(int)
test_df['AgeIsNull'] = pd.isnull(test_df.Age).astype(int)

# check my work
df.head(5)

**Feature Engineering**

In [None]:
# create new feature called 'FamilySize' equal to # of siblings + # of parents
df['FamilySize'] = df['SibSp'] + df['Parch']
test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch']

In [None]:
# creata new feature called 'Age*Class' equal to Age * Pclass
df['Age*Class'] = df.AgeFill * df.Pclass
test_df['Age*Class'] = test_df.AgeFill * test_df.Pclass

# check my work
df.head(5)

**Machine Learning Preparation** 

In [None]:
# check for object data types, as these will columns will be eliminated
df.dtypes
df.dtypes[df.dtypes.map(lambda x: x=='object')]

test_df.dtypes
test_df.dtypes[test_df.dtypes.map(lambda x: x=='object')]

In [None]:
# drop all columns that we will not use
df = df.drop(['PassengerId', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Age'], axis=1)
test_df = test_df.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Age'], axis=1)

# check my work
test_df.info()

In [None]:
# convert back to a numpy array
train_data = df.values
train_data

**1) Create a logistic regression model on the Titanic dataset to predict the survival of passengers. Show your model output. Include coefficient values.**

Please see code below

**2) Which features are predictive for this logistic regression? Explain your thinking. Do not simply cite model statistics.**

I believe that gender, age and passenger class are going to have the highest predictive power. At first thought, this stems from domain knowledge about the Titanic story and the individuals that were most likely to survive

**3) Implement cross-validation for your logistic regression model. Select the number of folds. Explain your choice.**

I selected five folds. I'm looking for an 80-20 split in each cross-validation instance, which leaves me with 5 folds

In [None]:
# instantiate the logistic regression model
model_lr = LogisticRegression(C=1)

In [None]:
# split features from target
features = df.drop('Survived',axis=1)
target = df.Survived

In [None]:
# run the model, and get an average score for accuracy
cross_val_score(model_lr,features,target,cv=5).mean()

In [None]:
# test for different c values using a for loop
c_range = range(1,31)
c_scores = []

for i in range(1,31):
    model_lr = LogisticRegression(C=i)
    features = df.drop('Survived',axis=1)
    target = df.Survived
    c_scores.append(cross_val_score(model_lr,features,target,cv=5).mean())

print c_scores

In [None]:
# plot the value of C for Logistic Regression (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(c_range, c_scores)
plt.xlabel('Value of C for Logistic Regression')
plt.ylabel('Cross-Validated Accuracy')

In [None]:
# re-instantiate the model with an updated value for C
model_lr = LogisticRegression(C=20).fit(features, target)

# run the model, and get an average score for accuracy
cross_val_score(model_lr,features,target,cv=3).mean()

In [None]:
# get Correlation Coefficient for each feature using Logistic Regression
coeff_df = pd.DataFrame(df.columns.delete(0))
coeff_df.columns = ['Features']
coeff_df["Coefficient Estimate"] = pd.Series(model_lr.coef_[0])

# preview
coeff_df

In [None]:
# Normalize our feature set
n_features = StandardScaler().fit_transform(features)
n_features = pd.DataFrame(n_features)
n_features

In [None]:
# get Correlation Coefficient for each feature using Logistic Regression
coeff_df = pd.DataFrame(df.columns.delete(0))
coeff_df.columns = ['Features']
coeff_df["Coefficient Estimate"] = pd.Series(model_lr.coef_[0])

# preview
coeff_df

In [None]:
# fill in missing 'Fare' value within the Test dataset
test_df["Fare"].fillna(test_df["Fare"].median(), inplace=True)

# check my work
test_df[ test_df['Fare'].isnull() ][['Male', 'Female', 'Pclass','Fare']].head(10)

**4) In the hw-assignments director on the class github repo, there is a file called titanic-test.csv. What does your logistic regression model predict for these previously unseen (i.e. out of sample) passengers?**

This returned a score of .5837, which is not my highest score as I've submitted to this project in the past, making me believe I did something wrong - haha!

In [None]:
# run the test dataset through standardization and predict the outcome
X_test = test_df.drop("PassengerId",axis=1).copy()
y_features = StandardScaler().fit_transform(X_test)
X_test = y_features
Y_pred = model_lr.predict(X_test)

submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })
submission.to_csv('titanic.csv', index=False)