# Titanic - Machine Learning from Disaster
Overview From kaggle:  
The data has been split into two groups:

1. training set (train.csv)
2. test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

| Variable | Definition                                  | Key                                            |
| -------- | ------------------------------------------- | ---------------------------------------------- |
| Survived | Survival                                    | 0 = No, 1 = Yes                                |
| Pclass   | Ticket class                                | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| Sex      | Sex                                         |                                                |
| Age      | Age in years                                |                                                |
| Sibsp    | \# of siblings / spouses aboard the Titanic |                                                |
| Parch    | \# of parents / children aboard the Titanic |                                                |
| Ticket   | Ticket number                               |                                                |
| Fare     | Passenger fare                              |                                                |
| Cabin    | Cabin number                                |                                                |
| Embarked | Port of Embarkation                         | C = Cherbourg, Q = Queenstown, S = Southampton |

Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5  

sibsp: The dataset defines family relations in this way...  
Sibling = brother, sister, stepbrother, stepsister  
Spouse = husband, wife (mistresses and fiancés were ignored)  

parch: The dataset defines family relations in this way...  
Parent = mother, father  
Child = daughter, son, stepdaughter, stepson  
Some children travelled only with a nanny, therefore parch=0 for them.

In [354]:
import pandas as pd
import numpy as np
pd.read_csv('./train.csv')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### Formatting the data
I will be using a subset of the training data to test before finilizing the model. The original testing dataset will be called submit. The finilized model will use the full training set.

In [355]:
train_full = pd.read_csv('./train.csv')
submit = pd.read_csv('./test.csv')

submit['submit'] = True
submit['Survived'] = -1
train_full['submit'] = False
data = pd.concat([submit, train_full], copy=True)
del submit, train_full

data['Survived'] = data['Survived'].astype(int)
data['Embarked'] = data['Embarked'].map({'S':0, 'C':1, 'Q':2})
data['Sex'] = data['Sex'].map( {'male':1, 'female':0} )

data.dtypes

PassengerId      int64
Pclass           int64
Name            object
Sex              int64
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked       float64
submit            bool
Survived         int32
dtype: object

### PreProcessing

In [356]:
data.isnull().sum(axis=0)

PassengerId       0
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
submit            0
Survived          0
dtype: int64

I will set the single NA fare to the average Fare and the 2 NA embarked values to the most common location Southhampton or 0.

In [357]:
data['Fare'].fillna(np.average(data[data['Fare'].notnull()]['Fare']), inplace=True)
data['Embarked'].fillna(0, inplace=True)

Many of the Age values are missing. I will use a linear model to predict age for each of the missing values. Like the original dataset, predicted ages will be of the format XX.5.

In [358]:
import sklearn.linear_model as lm

age_train_x = data.drop(['Name', 'Ticket', 'Cabin', 'submit'], axis=1).dropna().drop('Age', axis=1)
age_train_y = data['Age'].dropna()

age_mod = lm.LinearRegression()
age_mod.fit(age_train_x, age_train_y)

age_na = data[data['Age'].isna()].copy()
age_na_x = age_na.drop(['Name', 'Ticket', 'Cabin', 'submit', 'Age'], axis=1)

# round and make end in 0.5
age_na['Age'] = np.subtract(np.add(age_mod.predict(age_na_x),0.5).round(),0.5)
age_na[age_na['Age'] < 0] = 0.5
data[data['Age'].isna()] = age_na

In [359]:
data.isnull().sum(axis=0)

PassengerId       0
Pclass            0
Name              0
Sex               0
Age               0
SibSp             0
Parch             0
Ticket            0
Fare              0
Cabin          1009
Embarked          0
submit            0
Survived          0
dtype: int64

#### Subsetting The Data

In [360]:
from sklearn.model_selection import train_test_split
train_full = data[data['submit'] == False].drop('submit', axis=1)
train, test = train_test_split(train_full)
submit = data[data['submit'] == True].drop(['submit', 'Survived'], axis=1)

In [361]:
train_x = train.drop(['Name', 'Ticket', 'Cabin', 'Survived'], axis=1)
train_y = train['Survived']

test_x = test.drop(['Name', 'Ticket', 'Cabin', 'Survived'], axis=1)
test_y = test['Survived']

## Modeling  
### Logistic Regression

In [362]:
mod_log = lm.LogisticRegression(max_iter=1000)
mod_log.fit(train_x, train_y)
mod_log.score(test_x, test_y)

0.7702702702702703

TypeError: 'Int64Index' object is not callable