# Data Dictionary

## Variable:	Definition	Key
survival:	Survival	0 = No, 1 = Yes

pclass:	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd

sex:	Sex	

Age:	Age in years	

sibsp:	# of siblings / spouses aboard the Titanic	

parch:	# of parents / children aboard the Titanic	

ticket:	Ticket number	

fare:	Passenger fare	

cabin:	Cabin number	

embarked:	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

## Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

# Prelimenary Stuff


In [None]:
# Import some libraries we'll use, and set matplotlib to be inline
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import pandas as pd

In [None]:
# Now load in the data into a 'dataframe'
df = pd.read_csv('data/train.csv')

In [None]:
# In Jupyter, we can simply print a dataframe just by putting it as the last block of code in a cell
df

From quickly browsing our dataset, we can see that there are a lot of NaN (null) values for the Cabin and that the values for ticket are pretty inconsistent, so we can drop them very easily with pandas.

## Bash Commands!

In [None]:
!ls

In [None]:
!pip install seaborn

# Pre-Processing

In [None]:
# Drop specific features from our dataframe
df = df.drop(['Ticket','Cabin', 'PassengerId'], axis=1)

# Remove NaN values
df = df.dropna()

In [None]:
# Now lets transform the sex field to be a binary gender field called male
df['Male'] = np.where(df['Sex']=='male', 1, 0) 

# Now lets drop the sex field altogether
# We can drop the name, since it is irrelevant
df = df.drop(['Sex', 'Name'], axis=1)

df

In [None]:
# Let's turn the Embarked column into one hot encoding
df = pd.concat([df, pd.get_dummies(df['Embarked'])], axis=1)

# Now drop the original embarked
df = df.drop(['Embarked'], axis=1)
df

# Analysis

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
df.hist(figsize=(25,25))

In [None]:
import seaborn as sns
sns.set_style('whitegrid')

In [None]:
# peaks for survived/not survived passengers by their age
facet = sns.FacetGrid(df, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, df['Age'].max()))
facet.add_legend()

# average survived passengers by age
fig, axis1 = plt.subplots(1,1,figsize=(20,4))
average_age = df[["Age", "Survived"]].groupby(['Age'],as_index=False).mean()
sns.barplot(x='Age', y='Survived', data=average_age)

# Training

Now it's time to split the data. Lucky for us, sklearn has a great built-in split function. We need to make sure to set the random state so the split of data is the same everytime!

In [None]:
from sklearn.model_selection import train_test_split

# Our y will be the survived column so we'll add it then drop it from our X set
y = df['Survived']
X = df.drop(['Survived'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

## Linear Regression
The first algorithm we'll tryout is called Linear Regression. It serves as great baseline.

In [None]:
# First we'll tryout the basic LR algorithm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score

lr_clf = LinearRegression()

# First we'll train the model on our data.
lr_clf.fit(X_train, y_train)

preds = lr_clf.predict(X_test)

# LR predicts confidences, so we have to round it
preds = np.rint(preds)

In [None]:
accuracy_score(y_test, preds)

# Random Forests
A powerful, simple approach!

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' , n_estimators=40, oob_score = True, random_state=42) 

rfc.fit(X_train, y_train)

preds = rfc.predict(X_test)



In [None]:
accuracy_score(y_test, preds)

In [None]:
# You can also view the weighted importances of the columns based on a classifier
importances = rfc.feature_importances_

for i in range(len(X_train.columns)):
    print(X_train.columns[i], '\t', importances[i])
    

From this, we can see that the 3 most important features are the passengers gender, fare, and age

### Hyperparameter Tuning
Now, instead of just using th default parameters, we'll setup a gridsearch of multiple combinations to train on, then view the top ones.

In [None]:
from sklearn.grid_search import GridSearchCV
rfc = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' , n_estimators=40, oob_score = True, random_state=42) 

param_grid = {
    'n_estimators': [20, 25, 30, 40],
    'max_features': ['auto', 'sqrt', 'log2']
}

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv = 5)

CV_rfc.fit(X_train, y_train)

In [None]:
print(CV_rfc.best_estimator_)

In [None]:
accuracy_score(CV_rfc.predict(X_test), y_test)

## BONUS! Would you survive?

In [None]:
df.columns

survival: Survival 0 = No, 1 = Yes

pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd

sex: Sex

Age: Age in years

sibsp: # of siblings / spouses aboard the Titanic

parch: # of parents / children aboard the Titanic

ticket: Ticket number

fare: Passenger fare

cabin: Cabin number

embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

In [None]:
you = {
    'Pclass': 1, # class of passenger
    'Age': 23,  # your age
    'SibSp': 1, # num siblings aboard
    'Parch': 2, # num parents aboard
    'Fare': 10, # how much you paid
    'Male': 1, # true if male
    'C': 0, # city embarked from, one of these should be 1 and the others 0
    'Q': 1,
    'S': 0
}
you = list(you.values())

In [None]:
# Taking our trained random forest and predicting if you gonna make it
if rfc.predict([you])[0] == 0:
    print("Sorry, you wouldn't make it :'''''')")
else:
    print("You're gonna make it!! :')")