# Using a Tensorflow DNNClassifier to classify Titanic dataset

My focus here is just show a basic approach of a Deep Neural Classifier using Google's Open Source TensorFlow library.

The TensorFlow team developed the Estimator API to make the library more accessible to the everyday developer. This high level API provides a common interface to train(...) models, evaluate(...) models, and predict(...) outcomes of unknown cases similar to (and influenced by) the popular Sci-Kit Learn library, which is accomplished by implementing a common interface for various algorithms

### Load data after feat. engineering and cleanning data

In [1]:
import pandas as pd
import numpy as np

I did the feature engineering and cleaning step separately. If want to see more details please, see here: [ Titanic Best Working Classfier:](https://www.kaggle.com/sinakhorami/titanic-best-working-classifier) by Sina

In [2]:
train = pd.read_csv('./data/train-ready.csv')
test = pd.read_csv('./data/test-ready.csv')

In [3]:
train.head(5)

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,IsAlone,Title
0,0,3,1,1,0,0,0,1
1,1,1,0,2,3,1,0,3
2,1,3,0,1,1,0,1,2
3,1,1,0,2,3,0,0,3
4,0,3,1,2,1,0,1,1


In [4]:
len(train)

891

### DNNClassifier using tensorFlow: a basic approach

Helper functions

In [5]:
def train_input_fn(features, labels, batch_size):
    """An input function for training"""

    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
    dataset = dataset.shuffle(10).repeat().batch(batch_size)
    return dataset

In [6]:
def eval_input_fn(features, labels, batch_size):
    """An input function for evaluation or prediction"""
    features=dict(features)
    if labels is None:
        inputs = features
    else:
        inputs = (features, labels)

    dataset = tf.data.Dataset.from_tensor_slices(inputs)
    
    assert batch_size is not None, "batch_size must not be None"
    dataset = dataset.batch(batch_size)

    return dataset

#### To split out this data (using Sci-Kit Learn's)

In [7]:
from sklearn.model_selection import train_test_split  

In [8]:
y = train.pop('Survived')
X = train

In [9]:
# 20% for evaluate
X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.2, random_state=23) 

#### Create the model

In [10]:
import tensorflow as tf

In [11]:
feature_columns = []

for key in X_train.keys():
    feature_columns.append(tf.feature_column.numeric_column(key=key))

In [12]:
feature_columns

[_NumericColumn(key='Pclass', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Sex', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Fare', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Embarked', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='IsAlone', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Title', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

Two hidden layers of 10 nodes each. The model must choose between 2 classes.

In [13]:
classifier = tf.estimator.DNNClassifier(
    feature_columns=feature_columns,
    hidden_units=[10, 10],
    n_classes=2)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/y7/1vddc8q51zq_q_dc0gw581480000gn/T/tmpmyzztixq', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1033b92b0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


#### Train the Model

In [31]:
batch_size = 100
train_steps = 1000

classifier.train(
    input_fn=lambda:train_input_fn(X_train, y_train,
                                             batch_size),
    steps=train_steps)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/y7/1vddc8q51zq_q_dc0gw581480000gn/T/tmpmyzztixq/model.ckpt-1100
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1101 into /var/folders/y7/1vddc8q51zq_q_dc0gw581480000gn/T/tmpmyzztixq/model.ckpt.
INFO:tensorflow:loss = 32.845863, step = 1101
INFO:tensorflow:global_step/sec: 454.841
INFO:tensorflow:loss = 34.02691, step = 1201 (0.221 sec)
INFO:tensorflow:global_step/sec: 651.907
INFO:tensorflow:loss = 33.209526, step = 1301 (0.154 sec)
INFO:tensorflow:global_step/sec: 653.724
INFO:tensorflow:loss = 36.49285, step = 1401 (0.153 sec)
INFO:tensorflow:global_step/sec: 615.79
INFO:tensorflow:loss = 37.605873, step = 1501 (0.162 sec)
INFO:tensorflow:global_step/sec: 635.299
INFO:tensorflow:loss = 38.736504, step = 1

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x111fc2fd0>

#### Evaluate the model

In [32]:
eval_result = classifier.evaluate(
    input_fn=lambda:eval_input_fn(X_tmp, y_tmp,batch_size)
)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-03-19-02:00:39
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/y7/1vddc8q51zq_q_dc0gw581480000gn/T/tmpmyzztixq/model.ckpt-2100
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-03-19-02:00:39
INFO:tensorflow:Saving dict for global step 2100: accuracy = 0.7932961, accuracy_baseline = 0.6424581, auc = 0.8495244, auc_precision_recall = 0.82188904, average_loss = 0.4385036, global_step = 2100, label/mean = 0.3575419, loss = 39.24607, prediction/mean = 0.35008296


#### Generate predictions from the model

In [33]:
predictions = classifier.predict(
    input_fn=lambda:eval_input_fn(test,labels=None,
    batch_size=batch_size))

In [34]:
results = list(predictions)

def x(res,j):
    class_id = res[j]['class_ids'][0]
    probability = int(results[i]['probabilities'][class_id] *100)

    if int(class_id) == 0:
        return ('%s%% probalitity to %s' % (probability,'Not survive'))
    else:
        return ('%s%% probalitity to %s' % (probability,'Survive!'))

print ('Predictions for 10 first records on test(dataset):')

for i in range(0,10):    
    print (x(results,i))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/y7/1vddc8q51zq_q_dc0gw581480000gn/T/tmpmyzztixq/model.ckpt-2100
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Predictions for 10 first records on test(dataset):
94% probalitity to Not survive
52% probalitity to Not survive
90% probalitity to Not survive
87% probalitity to Not survive
55% probalitity to Survive!
86% probalitity to Not survive
80% probalitity to Survive!
90% probalitity to Not survive
81% probalitity to Survive!
94% probalitity to Not survive


#### Generate the csv to submit. 

In [35]:
len(results)

418

In [36]:
train.tail(1)

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,IsAlone,Title
890,3,1,1,0,2,1,1


In [37]:
passengers = {}
i = 892
for x in results:
    passengers[i] = int(x['class_ids'][0])
    i+=1

In [38]:
import csv
csvfile = './submissions.csv'
with open(csvfile, 'w') as f:
    outcsv = csv.writer(f, delimiter=',')
    header = ['PassengerId','Survived']
    outcsv.writerow(header)
    for k,v in passengers.items():
        outcsv.writerow([k,v])

In [39]:
submissions = pd.read_csv(csvfile)
submissions.head(5)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
