# Titanic with Pandas, Scikit-Learn and TensorFlow

* **Survived**: Survival (0 = no; 1 = yes)
* **Pclass**: Passenger class (1 = first; 2 = second; 3 = third)
* **Name**
* **Sex**
* **Age**
* **SibSp**: Number of siblings aboard
* **Parch**: Number of parents/children aboard
* **Ticket**: Ticket number
* **Fare**: Passenger fare
* **Cabin**
* **Embarked**: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

## Data Exploration

First, import some useful modules

In [87]:
%matplotlib notebook
import pandas as pd
# import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns  # use seaborn, see https://stanford.edu/~mwaskom/software/seaborn/
import re
import pylab
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
import tensorflow as tf
import tensorflow.contrib.learn.python.learn as learn
import random

Load the training data set into a pandas data frame and see how many records are there and what is in the data

In [88]:
training_df = pd.read_csv("train.csv")
training_df.shape

(891, 12)

Show first 5 rows of the data frame:

In [89]:
training_df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Which of these variables could be important for predicting survival? We can safely get rid of the passenger ID (**PassengerId**) as it has nothing to do with anything except giving each passenger a unique identifier. Next, the **Name** variable. It's unlikely to be meaningful in terns of changes to drawn, so we can get rid of it. On the other hand, before we do it, let us extract titles and put them in a new column. We will use them later on. 
The ticket numbers (**Ticket**) are of questionable use too, unless we feel that some numbers could have been unlucky of something of that sort. For now, we choose to ignore this variable. **Cabin** could be important if we had a map of the ship or any interpretation of what the first letter of the cabin name means (perhaps, a particular location on the ship easier to get to a boat from). However, we don't have that information so can just ignore the cabin variable.

In [90]:
training_df["Title"]=training_df["Name"].apply(lambda _: _.split(",")[1].split(".")[0])
tdf = training_df.drop(["PassengerId","Name","Ticket","Cabin"],axis=1)  # call a new smaller data frame tdf
tdf.head(5)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,0,3,male,22.0,1,0,7.25,S,Mr
1,1,1,female,38.0,1,0,71.2833,C,Mrs
2,1,3,female,26.0,0,0,7.925,S,Miss
3,1,1,female,35.0,1,0,53.1,S,Mrs
4,0,3,male,35.0,0,0,8.05,S,Mr


### A few plots to see what's going on

In [105]:
plt.figure()
sns.stripplot(x="Embarked", y="Age", data=tdf, jitter=True);

<IPython.core.display.Javascript object>

So, what's interesting about this? 
* There was only a few people over 60 on Titanic
* Not many children either
* Not many people embarked in Queenstown

In [106]:
sns.FacetGrid(training_df, col="Sex").map(plt.hist, "Age")

<IPython.core.display.Javascript object>

<seaborn.axisgrid.FacetGrid at 0x11329e4a8>

Apparently, more males than females on board. Males typically aged 20 to 40.

In [107]:
plt.figure()
sns.stripplot(x="Embarked", y="Fare",data=training_df, jitter=True);

<IPython.core.display.Javascript object>

We see that passengers that embarked in Queenstown bought the cheapest tickets except for 2 persons who paid nearly 100 pounds each.

## Missing Values

What to do about the missing values? Let's check how manuy of these we've got.

In [94]:
tdf.count()  # this returns the number of non-NA/null observations in each column

Survived    891
Pclass      891
Sex         891
Age         714
SibSp       891
Parch       891
Fare        891
Embarked    889
Title       891
dtype: int64

All looks good except Age. And there are 2 values missing in **Embarked**. It's not obvious what we should do with **Embarked** and if it's important at all. However, we do need to deal with **Age**. We could just set all absent age values to 0 or ignore all records where age is not available. Or, we could approximate the age based on Title. How often do we see records with no age but some title present?

In [95]:
len(tdf[tdf["Age"].isnull() & tdf["Title"].notnull()].index)

177

Note that if we manage to approximate **Age** from **Title**, we will have 714+177=891 records with **Age**, i.e. we will fill in all the ages this way. Let's calculate average age depending on **Title**.

In [108]:
aver_age_miss = round(tdf[~tdf["Age"].isnull()  & (tdf["Title"].isin(["Miss","Ms"]))]["Age"].mean(),1)
print("average age for a Miss is {}".format(aver_age_miss))

average age for a Miss is nan


In [97]:
aver_age_mrs = round(tdf[~tdf["Age"].isnull()  & (tdf["Title"].isin(["Mrs"]))]["Age"].mean(),1)
print("average age for a Mrs is {}".format(aver_age_mrs))

average age for a Mrs is nan


We get nan here because there are no Mrs with available **Age** in our dataset! How about Mr?

In [98]:
aver_age_mr = round(tdf[~tdf["Age"].isnull()  & (tdf["Title"].isin(["Mr"]))]["Age"].mean(),1)
print("average age for a Mr is {}".format(aver_age_mr))

average age for a Mr is nan


Alright. Let us insert the substitutes for the missing Age values. We will assign an average age for Mr to Mrs too..

In [99]:
miss_idx = tdf["Title"].isin(["Miss","Ms"])  # rows where we have Miss or Ms
tdf.loc[miss_idx,"Age"].fillna(aver_age_miss)
mrs_idx = tdf["Title"].isin(["Mr","Mrs"])
tdf.loc[mrs_idx,"Age"].fillna(aver_age_mr)
tdf.count()
tdf.head(5)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,0,3,male,22.0,1,0,7.25,S,Mr
1,1,1,female,38.0,1,0,71.2833,C,Mrs
2,1,3,female,26.0,0,0,7.925,S,Miss
3,1,1,female,35.0,1,0,53.1,S,Mrs
4,0,3,male,35.0,0,0,8.05,S,Mr


## Prediction

Split training data into the training and testing datasets. We will only be using Age, Siblings, Fare and Class for prediction.

In [100]:
X = training_df[['Age', 'SibSp', 'Fare','Pclass']].fillna(0)  # variables
y = training_df['Survived']  # outcome

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Logictic Regression

In [101]:
lr = LogisticRegression()
lr.fit(X_train, y_train)

print(accuracy_score(y_test, lr.predict(X_test)))

0.709497206704


### Decision Tree

In [102]:
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)
print(accuracy_score(y_test, clf.predict(X_test)))



0.72625698324


### Naive Bayes

In [103]:
nbc = GaussianNB()
nbc.fit(X_train, y_train)
print(accuracy_score(y_test, nbc.predict(X_test)))

0.698324022346


### TensorFlow

In [104]:
tfc = learn.LinearClassifier(n_classes=2, feature_columns=["Age", "SibSp", "Fare", "Pclass"], 
                                    optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.02))
tfc.fit(X_train, y_train, batch_size=256, steps=500)

print(accuracy_score(y_test, tfc.predict(X_test)))



AttributeError: 'str' object has no attribute 'key'