# Titanic - Machine Learning from Disaster
Overview From kaggle:  
The data has been split into two groups:

1. training set (train.csv)
2. test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

| Variable | Definition                                  | Key                                            |
| -------- | ------------------------------------------- | ---------------------------------------------- |
| Survived | Survival                                    | 0 = No, 1 = Yes                                |
| Pclass   | Ticket class                                | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| Sex      | Sex                                         |                                                |
| Age      | Age in years                                |                                                |
| Sibsp    | \# of siblings / spouses aboard the Titanic |                                                |
| Parch    | \# of parents / children aboard the Titanic |                                                |
| Ticket   | Ticket number                               |                                                |
| Fare     | Passenger fare                              |                                                |
| Cabin    | Cabin number                                |                                                |
| Embarked | Port of Embarkation                         | C = Cherbourg, Q = Queenstown, S = Southampton |

Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5  

sibsp: The dataset defines family relations in this way...  
Sibling = brother, sister, stepbrother, stepsister  
Spouse = husband, wife (mistresses and fiancés were ignored)  

parch: The dataset defines family relations in this way...  
Parent = mother, father  
Child = daughter, son, stepdaughter, stepson  
Some children travelled only with a nanny, therefore parch=0 for them.

In [None]:
import pandas as pd
import numpy as np
pd.read_csv('./train.csv')

### Formatting the data
I will be using a subset of the training data to test before finilizing the model. The original testing dataset will be called submit. The finilized model will use the full training set.

In [None]:
train_full = pd.read_csv('./train.csv')
submit = pd.read_csv('./test.csv')

submit['submit'] = True
submit['Survived'] = -1
train_full['submit'] = False
data = pd.concat([submit, train_full], copy=True)
del submit, train_full

data['Survived'] = data['Survived'].astype(int)
data['Embarked'] = data['Embarked'].map({'S':0, 'C':1, 'Q':2})
data['Sex'] = data['Sex'].map( {'male':1, 'female':0} )

data.dtypes

### PreProcessing

In [None]:
data.isnull().sum(axis=0)

I will set the single NA fare to the average Fare and the 2 NA embarked values to the most common location Southhampton or 0.

In [None]:
data['Fare'].fillna(np.average(data[data['Fare'].notnull()]['Fare']), inplace=True)
data['Embarked'].fillna(0, inplace=True)

Many of the Age values are missing. I will use a linear model to predict age for each of the missing values. Like the original dataset, predicted ages will be of the format XX.5.

In [None]:
import sklearn.linear_model as lm

age_train_x = data.drop(['Name', 'Ticket', 'Cabin', 'submit'], axis=1).dropna().drop('Age', axis=1)
age_train_y = data['Age'].dropna()

age_mod = lm.LinearRegression()
age_mod.fit(age_train_x, age_train_y)

age_na = data[data['Age'].isna()].copy()
age_na_x = age_na.drop(['Name', 'Ticket', 'Cabin', 'submit', 'Age'], axis=1)

# round and make end in 0.5
age_na['Age'] = np.subtract(np.add(age_mod.predict(age_na_x),0.5).round(),0.5)
age_na[age_na['Age'] < 0]['Age'] = 0.5
data[data['Age'].isna()] = age_na

In [None]:
data.isnull().sum(axis=0)

#### Adding Some More Variables

In [None]:
import re
data['Prefix'] = data['Name'].apply(lambda s: s.split(', ')[1].split('. ')[0]).map(
    {'Mr':0,
    'Miss':1,
    'Mrs':2,
    'Master':3,
    'Rev':4,
    'Dr':5,
    'Col':6,
    'Ms':7,
    'Major':8,
    'Mlle':9,
    'Sir':9,
    'the Countess':9,
    'Capt':9,
    'Don':9,
    'Lady':9,
    'Mme':9,
    'Dona':9,
    'Jonkheer':9})
data['Prefix'].dtype

#### Subsetting The Data

In [None]:
from sklearn.model_selection import train_test_split
train_full = data[data['submit'] == False].drop('submit', axis=1)
train, test = train_test_split(train_full)
submit = data[data['submit'] == True].drop(['submit'], axis=1)

In [None]:
train_x = train.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Survived'], axis=1)
train_y = train['Survived']

train_full_x = train_full.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Survived'], axis=1)
train_full_y = train_full['Survived']

submit_x = submit.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Survived'], axis=1)
predictions = submit[['PassengerId', 'Survived']].copy()

test_x = test.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Survived'], axis=1)
test_y = test['Survived']

## Modeling  


### Linear SVM

In [None]:
from sklearn import svm
from sklearn.metrics import classification_report

mod_svc = svm.SVC()
mod_svc.fit(train_x, train_y)

print(classification_report(test_y, mod_svc.predict(test_x)))

### Logistic Regression

In [None]:
mod_log = lm.LogisticRegression(max_iter=1000)
mod_log.fit(train_x, train_y)

print(classification_report(test_y, mod_log.predict(test_x)))

### Passive Agressive

In [None]:
mod_pag = lm.PassiveAggressiveClassifier()
mod_pag.fit(train_x, train_y)

print(classification_report(test_y, mod_pag.predict(test_x)))

### Decision Tree

In [None]:
from sklearn import tree

mod_dec = tree.DecisionTreeClassifier()
mod_dec.fit(train_x, train_y)

print(classification_report(test_y, mod_dec.predict(test_x)))

### Naive Bayes

In [None]:
from sklearn import naive_bayes

mod_gnb = naive_bayes.GaussianNB()
mod_gnb.fit(train_x, train_y)

print(classification_report(test_y, mod_gnb.predict(test_x)))

### Random Forest

In [None]:
from sklearn import ensemble

mod_rfo = ensemble.RandomForestClassifier(n_estimators=1000)
mod_rfo.fit(train_x, train_y)

print(classification_report(test_y, mod_rfo.predict(test_x)))

### Gradient Boosting

In [None]:
mod_gbc = ensemble.GradientBoostingClassifier(n_estimators=500)
mod_gbc.fit(train_x, train_y)
print(classification_report(test_y, mod_gbc.predict(test_x)))

### Ada Boost

In [None]:
mod_ada = ensemble.AdaBoostClassifier(n_estimators=50)
mod_ada.fit(train_x, train_y)
print(classification_report(test_y, mod_ada.predict(test_x)))

### Tuning Best Model  
Which appears to be gradient boosting

In [None]:
from sklearn import model_selection as ms

parameters = {
    "learning_rate": [0.01, 0.025, 0.05],
    "n_estimators":list(range(400,625,25))
    }


gbc = ensemble.GradientBoostingClassifier()
gscv_gbc = ms.GridSearchCV(gbc, parameters, cv=5, n_jobs=-1, verbose=4)

In [None]:
gscv_gbc.fit(train_x, train_y)
print(classification_report(test_y, gscv_gbc.predict(test_x)))

In [None]:
results = pd.DataFrame.from_dict(gscv_gbc.cv_results_)
results

In [None]:
gscv_gbc = ms.GridSearchCV(gbc, parameters, cv=10, n_jobs=-1, verbose=4)
gscv_gbc.fit(train_full_x, train_full_y)
predictions['Survived'] = gscv_gbc.predict(submit_x)

### Outputting Final Predictions

In [None]:
predictions
predictions.to_csv('./final_submission.csv', index=False)