Notebook for
https://www.kaggle.com/c/titanic

Competition Description
===================

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Goal
====

It is your job to predict if a passenger survived the sinking of the Titanic or not.
For each PassengerId in the test set, you must predict a 0 or 1 value for the Survived variable.
Metric

Your score is the percentage of passengers you correctly predict. This is known simply as "accuracy”.
Submission File Format

You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

The file should have exactly 2 columns:

* PassengerId (sorted in any order)
* Survived (contains your binary predictions: 1 for survived, 0 for deceased)


In [1]:
from collections import namedtuple

import numpy as np
import pandas as pd
import tensorflow as tf

In [2]:
titanic_dataset = pd.read_csv('train.csv')
titanic_dataset.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Prepare the data
========

* One hot encode Sex and Embarked variables 
* Removed features do not needed

In [3]:
dummies = pd.get_dummies(titanic_dataset[['Sex', 'Embarked']])
titanic_dataset = pd.concat([titanic_dataset, dummies], axis=1)
titanic_dataset = titanic_dataset.drop(
    ['PassengerId', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1)

titanic_dataset.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,3,22.0,1,0,7.25,0,1,0,0,1
1,1,1,38.0,1,0,71.2833,1,0,1,0,0
2,1,3,26.0,0,0,7.925,1,0,0,0,1
3,1,1,35.0,1,0,53.1,1,0,0,0,1
4,0,3,35.0,0,0,8.05,0,1,0,0,1


Convert dataset to numpy array
==============

* Move target labes to last column
* Convert the dataset to numpy

In [4]:
titanic_dataset = titanic_dataset[[c for c in sorted(titanic_dataset) if c != 'Survived'] + ['Survived']]
titanic_dataset.head()

Unnamed: 0,Age,Embarked_C,Embarked_Q,Embarked_S,Fare,Parch,Pclass,Sex_female,Sex_male,SibSp,Survived
0,22.0,0,0,1,7.25,0,3,0,1,1,0
1,38.0,1,0,0,71.2833,0,1,1,0,1,1
2,26.0,0,0,1,7.925,0,3,1,0,0,1
3,35.0,0,0,1,53.1,0,1,1,0,1,1
4,35.0,0,0,1,8.05,0,3,0,1,0,0


In [5]:
columns_max_min = {}

# Convert to numpy array
titanic_np = np.array(titanic_dataset)

# Remove NaN values
titanic_np[np.isnan(titanic_np)] = -1
rows, columns = titanic_np.shape

# Normalize each column
for i in range(columns - 1):
    column = titanic_np[:, i]
    _min, _max = column.min(), column.max()
    
    # Store max and min values to normalize testing values
    columns_max_min[i] = {'max': _max, 'min': _min}
    
    titanic_np[:, i] = (column - _min) / _max
    
titanic_np.shape

(891, 11)

Prepare the model
=========

In [6]:
def get_placeholders():
    """
    Return a tuple with model placeholders.
    
    Returns (inputs, labels, keep_prob) =>
        inputs: represent the input batches
        labels: represet the real outputs for a given inputs
        keep_prob: It's the probability to mantain a unit output for each label
    """
    inputs = tf.placeholder(tf.float32, shape=(None, columns - 1), name='inputs')
    labels = tf.placeholder(tf.int32, shape=(None, 1), name='labels')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    
    return inputs, labels, keep_prob

def fully_connected(input_tensor, output_dim, activation, name):
    """
    Return a fully connected layer
        input_tensor => The input for this layer.
        output_dim => Unit number of this layer.
        activation => Activation function for this layer
        name => Name of the layer.
        
        Returns a tensor representing the fully connected layer output.
    """
    input_dim = input_tensor.get_shape()[-1].value
    with tf.variable_scope(name):
        w = tf.Variable(tf.truncated_normal(shape=(input_dim, output_dim),
                                                                    stddev=1 / np.sqrt(input_dim)),
                                 name='weights')
        
        output = tf.matmul(input_tensor, w)
        b = tf.Variable(tf.zeros(output_dim),
                                 name='bias')
        
        output = tf.nn.bias_add(tf.matmul(input_tensor, w), b,
                                                name='logits')

        tf.summary.histogram('{}_weights'.format(name), w)
        tf.summary.histogram('{}_bias'.format(name), b)
        
        if activation is not None:
            return activation(output, name='output')
        
        return output
    
    
def build_model(reuse=False):
    """
    Build a NN model based only in fully connected layers.
    
    Returns a namedtuple with following keys:
        inputs => placeholder representing input batches
        labels => placeholder holding expected values
        keep_prob => Probabilities of mantain the output for hidden units
        output => The model output
    """
    with tf.variable_scope('model', reuse=reuse):
        inputs, labels, keep_prob = get_placeholders()

        layer1 = fully_connected(inputs, 128, tf.nn.relu, 'layer1')
        layer1 = tf.nn.dropout(layer1, keep_prob)

        layer2 = fully_connected(layer1, 64, tf.nn.relu, 'layer2')
        layer2 = tf.nn.dropout(layer2, keep_prob)

        layer3 = fully_connected(layer2, 32, tf.nn.relu, 'layer3')
        layer3 = tf.nn.dropout(layer3, keep_prob)

        output = fully_connected(layer3, 1, tf.nn.sigmoid, 'output')

        Model = namedtuple('Model', ['inputs', 'labels', 'keep_prob', 'output'])

        return Model(inputs=inputs, labels=labels, keep_prob=keep_prob, output=output)


def get_loss(output, labels):
    """
    Return MSE given the model output and expected categories.
    
    output => Model output
    labels => Expected values.
    """
    with tf.variable_scope('loss'):
        loss = tf.losses.mean_squared_error(output, labels)
        tf.summary.scalar('loss', loss)
        return loss


def get_optimizer(loss, learning_rate):
    """
    Return AdapOptimizer optimizer.
    
    loss => tensor representing a cost function of a model
    learning_rate => A tensor or a float holding learning_rate hyperparameter.
    """
    with tf.variable_scope('optimizer'):
        with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
            return tf.train.AdamOptimizer(learning_rate).minimize(loss)


def get_accuracy(output, labels):
    """
    Return the accuracy given outputs and labels
    """
    with tf.variable_scope('accuracy'):
        acc = tf.reduce_mean(tf.cast(tf.equal(tf.round(output), tf.cast(labels, tf.float32)), tf.float32))
        tf.summary.scalar('accuracy', acc)
        return acc

Get batches
======

In [7]:
def split_validation(dataset, validation_split = 0.05):
    """
    Split into training and validation set
        dataset => dataset to split
        validation_split => Multiplier determining how many elements will have the validation set
    
    Returns a namedtuple with following keys:
        training_x => All inputs
        training_y => All expected outputs for train_x
        validation_x => Inputs not used for training
        validation_y => All expected outputs for val_x
    """
    validation_start = int(len(dataset) * validation_split)
    training_x, training_y = dataset[:-validation_start, :-1], dataset[:-validation_start, -1]
    validation_x, validation_y = dataset[-validation_start:, :-1], dataset[-validation_start:, -1]
    
    Batch = namedtuple('Batch', ['training_x', 'training_y', 'validation_x', 'validation_y'])
    
    return Batch(training_x=training_x, training_y=training_y, validation_x=validation_x, validation_y=validation_y)


def get_batches(x, y, batch_size):
    """
    Generator wich yield each batch
    
    x => Inputs
    y => Expected outputs (labels)
    batch_size => Number of elements in a batch during the training phase
    
    Returns a tuple of size batch_size.
    """
    
    assert(len(x) == len(y))
    
    for i in range(0, len(x), batch_size):
        yield x[i:i+batch_size], y[i:i+batch_size].reshape(-1, 1)

In [8]:
# Hyperparameters
learning_rate = 0.001
batch_size = 16
epochs = 20
keep_prob = 0.8

print_every_steps=100

In [9]:
# Reset default model
tf.reset_default_graph()

# Get model, loss and optimizer
model = build_model()
loss = get_loss(model.output, model.labels)
optimizer = get_optimizer(loss, learning_rate)
accuracy = get_accuracy(model.output, model.labels)

# Get variable summarizer
merged = tf.summary.merge_all()

# Split into training and validation sets
dataset = split_validation(titanic_np, validation_split=0.05)

In [10]:
steps = 0
validation_steps = 0
saver = tf.train.Saver()
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    train_writer = tf.summary.FileWriter('logs', sess.graph)
    
    training_loss = []
    training_accuracy = []

    for epoch in range(epochs):
        for batch_x, batch_y in get_batches(dataset.training_x, dataset.training_y, batch_size):
            # Run training step
            feed_dict = {
                model.inputs: batch_x,
                model.labels: batch_y,
                model.keep_prob: keep_prob}
            summary, loss_val, acc_val, _ = sess.run([merged, loss, accuracy, optimizer], feed_dict=feed_dict)
            training_loss.append(loss_val)
            training_accuracy.append(acc_val)

            # Write tensorflow summary for tensorboard
            train_writer.add_summary(summary, steps)
            steps += 1

            if steps % print_every_steps == 0:
                validation_loss = []
                validation_accuracy = []

                for val_x, val_y in get_batches(dataset.validation_x, dataset.validation_y, batch_size):
                    # Run a validation step
                    feed_dict = {
                        model.inputs: val_x,
                        model.labels: val_y,
                        model.keep_prob: 1}
                    summary, loss_val, acc_val = sess.run([merged, loss, accuracy], feed_dict=feed_dict)
                    validation_loss.append(loss_val)
                    validation_accuracy.append(acc_val)

                    validation_steps += 1

                print('Epoch {:2}/{:02}. train. loss: {:01.4f} train. acc: {:01.4f} val loss: {:01.4f} val. acc: {:01.4f}'.format(
                    epoch, epochs,
                    np.mean(training_loss), np.mean(training_accuracy),
                    np.mean(validation_loss), np.mean(validation_accuracy)))
    saver.save(sess, 'model/titanic')

Epoch  1/20. train. loss: 0.1853 train. acc: 0.7462 val loss: 0.1284 val. acc: 0.8056
Epoch  3/20. train. loss: 0.1689 train. acc: 0.7733 val loss: 0.1127 val. acc: 0.8333
Epoch  5/20. train. loss: 0.1621 train. acc: 0.7830 val loss: 0.1064 val. acc: 0.8611
Epoch  7/20. train. loss: 0.1584 train. acc: 0.7881 val loss: 0.1035 val. acc: 0.8681
Epoch  9/20. train. loss: 0.1553 train. acc: 0.7917 val loss: 0.1022 val. acc: 0.8681
Epoch 11/20. train. loss: 0.1529 train. acc: 0.7941 val loss: 0.1011 val. acc: 0.8681
Epoch 13/20. train. loss: 0.1508 train. acc: 0.7975 val loss: 0.1056 val. acc: 0.8333
Epoch 15/20. train. loss: 0.1495 train. acc: 0.7990 val loss: 0.1049 val. acc: 0.8889
Epoch 16/20. train. loss: 0.1482 train. acc: 0.8003 val loss: 0.1024 val. acc: 0.8681
Epoch 18/20. train. loss: 0.1474 train. acc: 0.8017 val loss: 0.1016 val. acc: 0.8681


Prepare test data
========

In [11]:
titanic_test = pd.read_csv('test.csv')

dummies = pd.get_dummies(titanic_test[['Sex', 'Embarked']])
titanic_test = pd.concat([titanic_test, dummies], axis=1)
titanic_test = titanic_test.drop(
    ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1)

titanic_test = titanic_test[[c for c in sorted(titanic_test) if c != 'PassengerId'] + ['PassengerId']]

# Convert to numpy array
titanic_test_np = np.array(titanic_test)

# Remove NaN values
titanic_test_np[np.isnan(titanic_test_np)] = -1
rows, columns = titanic_test_np.shape

# Normalize each column
for i in range(columns - 1):
    column = titanic_test_np[:, i]
    _min, _max = columns_max_min[i]['min'], columns_max_min[i]['max']
    titanic_test_np[:, i] = (column - _min) / _max

titanic_test_np.shape

(418, 11)

Predict
===

In [12]:
with open('submission.csv', 'w') as fd:
    fd.write('PassengerId,Survived\n')
    with tf.Session() as sess:
        saver.restore(sess, 'model/titanic')
        pid, x = titanic_test_np[:, -1], titanic_test_np[:, :-1]
        for i in range(len(titanic_test_np)):

            feed_dict = {
                model.inputs: x[i].reshape(1, 10),
                model.keep_prob: 1}

            out_val = sess.run(model.output, feed_dict=feed_dict)
            fd.write('{},{}\n'.format(int(pid[i]), int(np.round(out_val)[0][0])))

INFO:tensorflow:Restoring parameters from model/titanic
