# Titanic Classification Exercise - First Effort

### Motivation

This is a practical exercise based on what I learned from the <a href="https://www.udemy.com/complete-guide-to-tensorflow-for-deep-learning-with-python/">Complete Guide to TensorFlow for Deep Learning in Python</a> course on Udemy.

It is also the culmination of my 10% Personal Goal at work for 2019.  My goal was to complete the course and then apply what I learned to a machine learning problem, preferrably before 7/17/19 in order to get my "5" rating.

This file is complete and was submitted to the Titanic contest on 5/24/19.  It resulted in a 0.65550 Public Score.

I'm OK with the result as a first effort using TensorFlow.  I have a lot to learn, particularly about feature engineering for TensorFlow.  My top score is 0.75660 based on a Scikit-learn LinearDiscriminantAnalysis model and better feature engineering.

# Do Not use this file again.

Lesson learned - once I have submitted an entry to Kaggle, don't change the file.

### This works with the *tfdeeplearning* conda environment on my laptop.

# TODO (in a copy of this file)

<ol>
<li><b><i>Done 5/24/19 </i></b>make a copy and clean it up; store on Github
<li>how to do cross_validation in TF?
<li>work with categorical data: Pclass and Sex
<li>try scikit-learn models instead
</ol>

### THE DATA

We'll be working with some Titanic Training Data, we'll be trying to use various features of an individual to predict whether they **survive** the sinking.

Data is from https://www.kaggle.com/c/titanic/data

Here is some information about the data:

<table>
<thead>
<tr>
<th>Column Name</th>
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>PassengerId</td>
<td>Continuous</td>
<td>ID of the passenger.  Artificial, **not used in the model.**</td>
</tr>
<tr>
<td>Survived</td>
<td>Categorical</td>
<td>**The label.**</td>
</tr>
<tr>
<td>Pclass</td>
<td>Categorical</td>
<td>The ticket class.  1=1st, 2=2nd, 3=3rd.</td>
</tr>
<tr>
<td>Name</td>
<td>Categorical</td>
<td>Passenger name.  **not used in the model**</td>
</tr>
<tr>
<td>Sex</td>
<td>Categorical</td>
<td>Sex</td>
</tr>
<tr>
<td>Age</td>
<td>Continuous</td>
<td>Age in years.</td>
</tr>
<tr>
<td>SibSp</td>
<td>Categorical</td>
<td># of siblings / spouses aboard the Titanic  (could be continuous?)</td>
</tr>
<tr>
<td>ParCh</td>
<td>Categorical</td>
<td># of parents / children aboard the Titanic  (could be continuous?)</td>
</tr>
<tr>
<td>Ticket</td>
<td>Categorical</td>
<td>ticket number.  **not used in model**</td>
</tr>
<tr>
<td>Fare</td>
<td>Continuous</td>
<td>The price of the ticket.</td>
</tr>
<tr>
<td>Cabin</td>
<td>Categorical</td>
<td>cabin number</td>
</tr>
<tr>
<td>Embarked</td>
<td>Categorical</td>
<td>Port of embarcation: C = Cherbourg, Q = Queenstown, S = Southampton</td>
</tr>
</tbody>
</table>

** Read in the titanic_train.csv data with pandas**

In [60]:
import pandas as pd
import numpy as np

In [61]:
titanic = pd.read_csv('data/train.csv')

In [62]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [63]:
titanic_only = titanic.drop('PassengerId', axis=1) \
                      .drop('Name', axis=1) \
                      .drop('Ticket', axis=1)

In [64]:
titanic_only[['Pclass', 'Age', 'SibSp', 'Parch']].corrwith(titanic_only['Survived'])

Pclass   -0.338481
Age      -0.077221
SibSp    -0.035322
Parch     0.081629
dtype: float64

In [65]:
# detect NaN values: Age, Cabin, Embarked
titanic_only.Age.isnull().values.any()

True

In [66]:
# fix NaN values
values = {'Age': 35, 'Cabin': 'unk', 'Embarked': 'S'}
titanic_only = titanic_only.fillna(value=values)

In [67]:
# spot check NaN values to verify fix: Age, Cabin, Embarked
titanic_only.Age.isnull().values.any()

False

### Perform a Train Test Split on the Data

This is left here for reference.  Kaggle provides test data so there is no need to split the training file.

In [68]:
from sklearn.model_selection import train_test_split

In [69]:
X = titanic_only.drop('Survived',axis=1)
y = titanic_only['Survived']

In [70]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20, random_state=101)

### Create the Feature Columns for tf.esitmator

** Take note of categorical vs continuous values! **

In [71]:
titanic_only.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin',
       'Embarked'],
      dtype='object')

** Import Tensorflow **

In [72]:
import tensorflow as tf

** Create the tf.feature_columns for the categorical values. Use vocabulary lists or just use hash buckets. **

In [73]:
# categorical feature columns: 'Pclass', 'Sex', 'SibSp', 'Parch', 
#'Cabin', 'Embarked'
#pclass = tf.feature_column.categorical_column_with_vocabulary_list('Pclass', ['1','2','3'])
sex = tf.feature_column.categorical_column_with_vocabulary_list('Sex', ['male','female'])
sibSp = tf.feature_column.categorical_column_with_hash_bucket('SibSp', hash_bucket_size=10)
cabin = tf.feature_column.categorical_column_with_hash_bucket('Cabin', hash_bucket_size=10)
embarked = tf.feature_column.categorical_column_with_vocabulary_list('Embarked', ['S', 'C', 'Q'])


Make Embedding Columns ref https://www.tensorflow.org/guide/feature_columns#indicator_and_embedding_columns

Given
categorical_column = ... # Create any categorical column

Represent the categorical column as an embedding column.
This means creating an embedding vector lookup table with one element for each category.

embedding_column = tf.feature_column.embedding_column(
    categorical_column=categorical_column,
    dimension=embedding_dimensions)
    

In [74]:
sex_ec = tf.feature_column.embedding_column(categorical_column=sex,dimension=1)
#sibSp_ec = tf.feature_column.embedding_column(categorical_column=sibSp,dimension=1)
#cabin_ec = tf.feature_column.embedding_column(categorical_column=cabin,dimension=4)

** Create the continuous feature_columns for the continuous values using numeric_column **

In [75]:
# continuous feature columns: age
pclass = tf.feature_column.numeric_column('Pclass')
age = tf.feature_column.numeric_column('Age')
fare = tf.feature_column.numeric_column('Fare')
parch = tf.feature_column.numeric_column('Parch')
sibSpN = tf.feature_column.numeric_column('SibSp')


** Put all these variables into a single list with the variable name feat_cols **

In [76]:
#feat_cols = [pclass, sex, sibSp, parch, cabin, embarked, age, fare]
#feat_cols = [pclass, parch, age, fare, sex_ec, sibSpN]
# this is the one that works - only numeric feature columns at this point
feat_cols = [pclass, parch, age, fare, sibSpN]
#feat_cols = [pclass, parch, age, fare]
#feat_cols = [fare]

### Create Input Function

** Batch_size (auth used 30) is up to you. But do make sure to shuffle!**

In [83]:
input_func = tf.estimator.inputs.pandas_input_fn(x=X,y=y,batch_size=30,num_epochs=1000,shuffle=True)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,3,male,22.0,1,0,7.25,unk,S
1,1,female,38.0,1,0,71.2833,C85,C
2,3,female,26.0,0,0,7.925,unk,S
3,1,female,35.0,1,0,53.1,C123,S
4,3,male,35.0,0,0,8.05,unk,S


#### Create your model with tf.estimator

**Create a LinearClassifier.(If you want to use a DNNClassifier, keep in mind you'll need to create embedded columns out of the cateogrical feature that use strings, check out the previous lecture on this for more info.)**

In [78]:
model = tf.estimator.DNNClassifier(hidden_units=[5, 4],feature_columns=feat_cols,n_classes=2)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_master': '', '_service': None, '_tf_random_seed': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x11083f128>, '_device_fn': None, '_experimental_distribute': None, '_save_summary_steps': 100, '_eval_distribute': None, '_global_id_in_cluster': 0, '_is_chief': True, '_protocol': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_log_step_count_steps': 100, '_save_checkpoints_steps': None, '_task_id': 0, '_keep_checkpoint_every_n_hours': 10000, '_train_distribute': None, '_evaluation_master': '', '_task_type': 'worker', '_keep_checkpoint_max': 5, '_num_worker_replicas': 1, '_save_checkpoints_secs': 600, '_model_dir': '/var/folders/1g/92h4zhdx7bd2999w69pr2b9h0000gn/T/tmpol10it7o', '_num_ps_replicas': 0}


** Train your model on the data, for at least 5000 steps. **

In [79]:
model.train(input_fn=input_func,steps=5000)

INFO:tensorflow:Calling model_fn.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Use tf.cast instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /var/folders/1g/92h4zhdx7bd2999w69pr2b9h0000gn/T/tmpol10it7o/model.ckpt.
INFO:tensorflow:loss = 63.838085, step = 1
INFO:tensorflow:global_step/sec: 160.138
INFO:tensorflow:loss = 18.7236, step = 101 (0.631 sec)
INFO:tensorflow:global_step/sec: 258.013
INFO:tensorflow:loss = 19.290203, step = 201 (0.385 sec)
INFO:tensorflow:global_step/sec: 256.259
INFO:tensorflow:loss = 21.02531, step = 301 (0.392 sec)
INFO:tensorflow:global_step/sec: 238.999
INFO:tensorflow:loss = 21.047491, step = 401 (0.4

<tensorflow_estimator.python.estimator.canned.dnn.DNNClassifier at 0x13845b048>

### Load test data from Kaggle.  use this instead of the train_test_split portion of the training data.

In [86]:
X_test = pd.read_csv('data/test.csv') \
                      .drop('Name', axis=1) \
                      .drop('Ticket', axis=1) \
                      .drop('PassengerId', axis=1) 
X_test.head()


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,3,male,34.5,0,0,7.8292,,Q
1,3,female,47.0,1,0,7.0,,S
2,2,male,62.0,0,0,9.6875,,Q
3,3,male,27.0,0,0,8.6625,,S
4,3,female,22.0,1,1,12.2875,,S


### Fix nulls

Fill missing data values with medians.  Assuming the upper class passengers skew the means of both Fare and Age.

In [99]:
values = {'Age': 27, 'Fare': 14.45} #'Cabin': 'unk', 'Embarked': 'S'}
X_test = titanic_only.fillna(value=values)

### Evaluation

** Create a prediction input function. Remember to only supply X_test data and keep shuffle=False. **

In [100]:
pred_fn = tf.estimator.inputs.pandas_input_fn(x=X_test,batch_size=len(X_test),shuffle=False)

** Use model.predict() and pass in your input function. This will produce a generator of predictions, which you can then transform into a list, with list() **

In [101]:
predictions = list(model.predict(input_fn=pred_fn))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/1g/92h4zhdx7bd2999w69pr2b9h0000gn/T/tmpol10it7o/model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


** Each item in your list will look like this: **

In [102]:
predictions[0]

{'class_ids': array([0]),
 'classes': array([b'0'], dtype=object),
 'logistic': array([0.15570652], dtype=float32),
 'logits': array([-1.6905273], dtype=float32),
 'probabilities': array([0.8442935 , 0.15570651], dtype=float32)}

** Create a list of only the class_ids key values from the prediction list of dictionaries, these are the predictions you will use to compare against the real y_test values. **

In [103]:
final_preds = []
for pred in predictions:
    final_preds.append(pred['class_ids'][0])

In [104]:
final_preds[:10]

[0, 1, 1, 1, 0, 0, 0, 0, 1, 1]

** Import classification_report from sklearn.metrics (*BillN: google if needed or LU in solution*) and then see if you can figure out how to use it to easily get a full report of your model's performance on the test data. **

In [105]:
from sklearn.metrics import classification_report

Precision and Recall definitions: https://en.wikipedia.org/wiki/Precision_and_recall

In [106]:
# this step works if we have a y_test from a train test split.  
# since we don't have that, we don't run this

# print(classification_report(y_test,final_preds))

# Save the Model

see http://shzhangji.com/blog/2018/05/14/serve-tensorflow-estimator-with-savedmodel/

In [107]:
feat_cols

[NumericColumn(key='Pclass', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='Parch', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='Age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='Fare', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 EmbeddingColumn(categorical_column=VocabularyListCategoricalColumn(key='Sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0), dimension=1, combiner='mean', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal object at 0x13659f198>, ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True),
 NumericColumn(key='SibSp', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

In [108]:
feature_spec = tf.feature_column.make_parse_example_spec(feat_cols)

Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


In [109]:
# Build receiver function, and export.
serving_input_receiver_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)
export_dir = model.export_savedmodel('export', serving_input_receiver_fn)

export_dir
#model.export_saved_model(export_dir_base=".", serving_input_receiver_fn=input_func)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Signatures INCLUDED in export for Classify: ['serving_default', 'classification']
INFO:tensorflow:Signatures INCLUDED in export for Regress: ['regression']
INFO:tensorflow:Signatures INCLUDED in export for Predict: ['predict']
INFO:tensorflow:Signatures INCLUDED in export for Eval: None
INFO:tensorflow:Signatures INCLUDED in export for Train: None
INFO:tensorflow:Restoring parameters from /var/folders/1g/92h4zhdx7bd2999w69pr2b9h0000gn/T/tmpol10it7o/model.ckpt-5000
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: export/temp-b'1558728262'/saved_model.pb


b'export/1558728262'

# Load the saved model and use it for prediction

Inspect the saved model from the command line using

saved_model_cli show --dir export/1553868795 --tag_set serve --signature_def serving_default

In [110]:
predict_fn = tf.contrib.predictor.from_saved_model(export_dir)

INFO:tensorflow:Restoring parameters from export/1558728262/variables/variables


In [111]:
# Test inputs represented by Pandas DataFrame.

# Not used.

# This is shown here as an alternate technique.

inputs = pd.DataFrame({
    'Pclass': 3,
    'Parch': 4,
    'Age': 20,
    'Fare': 100,
    'SibSp': 3,}, 
    index=[0]
)
inputs.dtypes

Age        int64
Fare       int64
Parch      int64
Pclass     int64
Sex       object
SibSp      int64
dtype: object

In [130]:
# https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
TAB = np.array([[''      ,'PassengerId','Pclass','Parch','Age','Fare','SibSp'],
                 ['Row1' ,   123       ,   3    ,   4   ,  60 ,  30  ,   3   ],     # notional deceased
                 ['Row2' ,   124       ,   1    ,   2   ,  20 , 300  ,   0   ]])   # notional survived

data = TAB[1:,1:]
index = TAB[1:,0]
columns = TAB[0,1:]

inputs = pd.DataFrame(
    data=data,
    index=index,
    columns=columns,
    dtype='int64'
)

inputs

Unnamed: 0,PassengerId,Pclass,Parch,Age,Fare,SibSp
Row1,123,3,4,60,30,3
Row2,124,1,2,20,300,0


In [131]:
# Convert input data into serialized Example strings.
examples = []
for index, row in inputs.iterrows():
    feature = {}
    for col, value in row.iteritems():
        feature[col] = tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
    example = tf.train.Example(
        features=tf.train.Features(
            feature=feature
        )
    )
    examples.append(example.SerializeToString())

In [132]:
# Make predictions.
predictions = predict_fn({'inputs': examples})
predictions  # looks like a survivor

{'classes': array([[b'0', b'1'],
        [b'0', b'1']], dtype=object),
 'scores': array([[0.8442935 , 0.15570651],
        [0.22101201, 0.778988  ]], dtype=float32)}

### Get this down to a single classification

In [133]:
y_preds = [int(x[1].round(0)) for x in predictions['scores']]
y_preds

[0, 1]

### Put this back together with the PassengerIds

In [134]:
passengerids = inputs['PassengerId'].values
passengerids
mysub = pd.DataFrame({'PassengerId': passengerids, 'Survived': y_preds})
mysub.head()

Unnamed: 0,PassengerId,Survived
0,123,0
1,124,1


### Kaggle: Load the Test Data, Predict, and Assemble the submission file

In [123]:
X_test = pd.read_csv('data/test.csv')
X_test.head()
#inputs = X_test.drop('PassengerId', axis=1) \
inputs = X_test.drop('Name', axis=1) \
                      .drop('Ticket', axis=1) \
                      .drop('Cabin', axis=1) \
                      .drop('Sex', axis=1) \
                      .drop('Embarked', axis=1)

values = {'Age': 27, 'Fare': 14.45} #'Cabin': 'unk', 'Embarked': 'S'}
X_test = inputs.fillna(value=values)

X_test.head()
X_test.dtypes

PassengerId      int64
Pclass           int64
Age            float64
SibSp            int64
Parch            int64
Fare           float64
dtype: object

In [124]:
X.dtypes

Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Cabin        object
Embarked     object
dtype: object

In [125]:
# Convert input data into serialized Example strings.
def makeExamplesFromRaw(inputs):
    examples = []
    for index, row in inputs.iterrows():
        feature = {}
        for col, value in row.iteritems():
            feature[col] = tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
        example = tf.train.Example(
            features=tf.train.Features(
                feature=feature
            )
        )
        examples.append(example.SerializeToString())
    return examples

examples = makeExamplesFromRaw(X_test)

In [126]:
# Make predictions.
predictions = predict_fn({'inputs': examples})
predictions  # looks like a survivor
y_preds = [int(x[1].round(0)) for x in predictions['scores']]
np.sum(y_preds)

20

In [127]:
passengerids = inputs['PassengerId'].values
passengerids
mysub = pd.DataFrame({'PassengerId': passengerids, 'Survived': y_preds})
mysub.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0


In [129]:
mysub.to_csv('Submissions/190521/first_cleanB.csv', index=False)

### Example Submission file

In [150]:
pd.read_csv('data/gender_submission.csv').head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [166]:
pd.read_csv('Submissions/190521/first_cleanB.csv').head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
