# Kaggle with Tensorflow

Tensorflow is a numerical library from Google that is popular for machine learning. Many people think starting with Tensorflow is very difficult. Howver you can build quick models using a API called the Estimator API which is a high level way to build models using Tensorflow. This API is quite useful for doing the following:

<ul>

<li> Training a model </li>
<li> Evaluating a model </li>
<li> Creating a series of predictions with that trained model </li>

</ul>

<p>This notebook will cover how to use the Estimator API to solve the <a href ="https://www.kaggle.com/c/titanic"> Titanic Kaggle Challenge. </a></p>

The Titanic Challenge is one of the most common challenges for Machine Learning beginners. The goal of this challenge is to predict who survived the Titanic. 

First I am going to train my model on the training data provided on <a href="https://www.kaggle.com/c/titanic/data"> Titantic data page </a>.

Then I test my model using the test data on the same webpage. 

The goal of this notebook is to show how you can use the Estimator API from Tensorflow to solve machine learning problems. Since I do not come from a technical background I wanted to show how easy it is to use Tensorflow with a bit of practice. 

And a shout out to Jose Portilla for the great course on Tensorflow located here: <a href="https://www.udemy.com/complete-guide-to-tensorflow-for-deep-learning-with-python"> Jose Portilla Complete Guide to Tensorflow for Deeplearning with Python.</a>. This is a great course that really helped me understand TensorFlow.

<h2> Importing libraries </h2>

First I import the necessary libraries. Pandas and Numpy for working with the data. These libraries will help us create our training dataset. Then we will import the Tensorflow library.

In [172]:
import pandas as pd
import numpy as np
import tensorflow as tf

<h2>Import the dataset</h2>
Using Pandas we wil use the read_csv function to import the dataset as a CSV. Then we will execute the head() function to just take a look at our dataset.

In [173]:
# Import data
df = pd.read_csv("train.csv")

In [174]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Removing unnecessary columns

Some columns are not really necessary for our models. Remember we are trying to predict who survived so we want to find indicators that help us understand the relationship between our dependent and independent variables. 

The passenger ID, Name and ticket column do not really provide any value to my model. So I will be removing them. 

<ul>
<li>Passenger ID will not provide any value to our model. Since this is really used to identify the passenger. And it is not a useful feature for our model.</li>
<li>The Name column is similar to Passenger ID. It is more used for identification then for a indicator of survival.</li>
<li>Ticket is the ticket number which again is not a useful indicator for survival.</li>
</ul>

In [175]:
# No name, passenger ID or ticket column
df1 = df.drop(["PassengerId","Name","Ticket"],axis=1)

In [176]:
df1.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,,S


# Finding null values and replacing them. 

I want to keep rows with null values but I cannot have null values in my data. So I can see below that Age, Cabin and Embraked contained null values.

What I will do is replace all null values for Age with the average age. 

For Cabin I decided to drop it and for Embarked I will randomly replace the null values. 

First I will use a function to help me identify which columns have null values

In [177]:
df1.isnull().any()

Survived    False
Pclass      False
Sex         False
Age          True
SibSp       False
Parch       False
Fare        False
Cabin        True
Embarked     True
dtype: bool

In [178]:
## I decided to remove cabin as there are too many different styles of cabin
df1 = df1.drop("Cabin",axis=1)
df1.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


# Replacing null values with average age.

What I did below is I took the average age and replaced nulls with avg age and then I check again to make sure I no longer have null values

In [179]:
df1["Age"].fillna(df1["Age"].mean(),inplace=True)

In [180]:
df1.isnull().any()

Survived    False
Pclass      False
Sex         False
Age         False
SibSp       False
Parch       False
Fare        False
Embarked     True
dtype: bool

Below I created a list of unique values for the Embarked column. Then I used a random choice generator to replace those NaN with those values

In [181]:
df1["Embarked"].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [182]:
import random

In [183]:
embarked_cabin_list = ["S","C","Q"]
df1["Embarked"].fillna(random.choice(embarked_cabin_list),inplace=True)

In [184]:
df1.isnull().any()

Survived    False
Pclass      False
Sex         False
Age         False
SibSp       False
Parch       False
Fare        False
Embarked    False
dtype: bool

# Using TensorFlow

Now that we have cleaned up our data set we can use tensorflow to create a series of predictions. 

In [185]:
df1.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
       'Embarked'],
      dtype='object')

# Creating continous feature columns
Below we are going to create a variable for all our numerical columns. This helps place them later into a Python list.
This is important for TensorFlow to understand which columns are numerical and which are categorical. 

In [186]:
# Creating continous feature columns
SibSp = tf.feature_column.numeric_column("SibSp")
Parch = tf.feature_column.numeric_column("Parch")
Fare = tf.feature_column.numeric_column("Fare")
Age = tf.feature_column.numeric_column("Age")
Pclass = tf.feature_column.numeric_column("Pclass")

# Creating caterogical columns
Here we will do the same as above but for our categorical values. Notice how I use a hash bucket to figure out the different types of categories. 

This is a very good way to not have to manually write each type possible catergorical values. 

In [187]:
Sex = tf.feature_column.categorical_column_with_hash_bucket("Sex", hash_bucket_size=1000)
Embarked = tf.feature_column.categorical_column_with_hash_bucket("Embarked", hash_bucket_size=1000)

In [188]:
feat_cols = [SibSp,Parch,Fare,Age,Pclass,Sex,Embarked]
print(feat_cols)

[_NumericColumn(key='SibSp', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), _NumericColumn(key='Parch', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), _NumericColumn(key='Fare', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), _NumericColumn(key='Age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), _NumericColumn(key='Pclass', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), _HashedCategoricalColumn(key='Sex', hash_bucket_size=1000, dtype=tf.string), _HashedCategoricalColumn(key='Embarked', hash_bucket_size=1000, dtype=tf.string)]


# Preparing my features columns(independent variables) and my labels(dependent variable)
x_data(independent) will represent my features. This why I drop "Survived" column which is what I am trying to predict.

In [189]:
x_data = df1.drop("Survived",axis=1)

In [190]:
x_data.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.925,S
3,1,female,35.0,1,0,53.1,S
4,3,male,35.0,0,0,8.05,S


The "labels" variable(dependent) will tell me who survived and who didn't. Then I make sure both my "x_data" and my "label" data have the same length. If they didn't I would run into a error later on.

In [191]:
# Create labels
labels = df1["Survived"]

In [192]:
print("Length of labels is {0} and length of variables is {1}".format(len(labels),len(x_data)))

Length of labels is 891 and length of variables is 891


# Creating a training dataset and testing dataset

What I do here is I create a training and testing dataset for my data. This testing dataset is not the one we want to submit to Kaggle. But the one we will use to evaluate the accurarcy of our model. 

I use SciKit Learn's train test split to break up my data. This is a quick and easy way to get our data ready to submit to your model

In [193]:
from sklearn.model_selection import train_test_split

In [194]:
X_train, X_test, y_train, y_test = train_test_split(x_data,labels,test_size=0.33, random_state=101)

# Creating a function to train the model

TensorFlow lets you feed Pandas dataframes into models using the "tf.estimator.inputs.pandas_input" function. This is a quick and easy way to pass our training data. 

Below I use a LinearClassifier for my model. I just picked this one just to show how easy it is to create a model. But very possible there are better options. 

In [195]:
input_func = tf.estimator.inputs.pandas_input_fn(x=X_train,y=y_train,batch_size=10,num_epochs=1000,shuffle=True)

In [196]:
model = tf.estimator.LinearClassifier(feature_columns=feat_cols,n_classes=2)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_save_checkpoints_steps': None, '_log_step_count_steps': 100, '_keep_checkpoint_every_n_hours': 10000, '_save_summary_steps': 100, '_tf_random_seed': 1, '_keep_checkpoint_max': 5, '_model_dir': '/var/folders/pm/_fh4rrbx7tggds75lnhcl7wm0000gn/T/tmpf5ytjyl6', '_session_config': None}


In [197]:
model.train(input_fn=input_func,steps=1000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/pm/_fh4rrbx7tggds75lnhcl7wm0000gn/T/tmpf5ytjyl6/model.ckpt.
INFO:tensorflow:step = 1, loss = 6.93147
INFO:tensorflow:global_step/sec: 246.763
INFO:tensorflow:step = 101, loss = 6.65586 (0.406 sec)
INFO:tensorflow:global_step/sec: 267.877
INFO:tensorflow:step = 201, loss = 5.56881 (0.374 sec)
INFO:tensorflow:global_step/sec: 265.519
INFO:tensorflow:step = 301, loss = 3.49919 (0.377 sec)
INFO:tensorflow:global_step/sec: 249.563
INFO:tensorflow:step = 401, loss = 3.36833 (0.401 sec)
INFO:tensorflow:global_step/sec: 241.594
INFO:tensorflow:step = 501, loss = 2.20794 (0.414 sec)
INFO:tensorflow:global_step/sec: 235.759
INFO:tensorflow:step = 601, loss = 7.89262 (0.424 sec)
INFO:tensorflow:global_step/sec: 200.779
INFO:tensorflow:step = 701, loss = 6.22134 (0.500 sec)
INFO:tensorflow:global_step/sec: 220.735
INFO:tensorflow:step = 801, loss = 7.7144 (0.452 sec)
INFO:tensorflow:global_step/s

<tensorflow.python.estimator.canned.linear.LinearClassifier at 0x113af2cc0>

# Model Evaluation

Tensorflow provides a way to evaluate your model and see the accuarcy based on the training. We see we get about 80% accuarcy which can obviously be improved in a number of different ways. 

However the goal is to just show how easy TensorFlow flow is to use. But feel free to experiment with different algorithims or feature engineering to get a higher level of accuracy. 

In [198]:
eval_input_func = tf.estimator.inputs.pandas_input_fn(
      x=X_test,
      y=y_test,
      batch_size=10,
      num_epochs=1,
      shuffle=False)

In [199]:
results = model.evaluate(eval_input_func)

INFO:tensorflow:Starting evaluation at 2018-04-08-18:34:40
INFO:tensorflow:Restoring parameters from /var/folders/pm/_fh4rrbx7tggds75lnhcl7wm0000gn/T/tmpf5ytjyl6/model.ckpt-1000
INFO:tensorflow:Finished evaluation at 2018-04-08-18:34:41
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.772881, accuracy_baseline = 0.572881, auc = 0.842233, auc_precision_recall = 0.841324, average_loss = 0.478636, global_step = 1000, label/mean = 0.427119, loss = 4.70659, prediction/mean = 0.480112


In [200]:
results

{'accuracy': 0.77288133,
 'accuracy_baseline': 0.57288134,
 'auc': 0.84223253,
 'auc_precision_recall': 0.84132445,
 'average_loss': 0.47863579,
 'global_step': 1000,
 'label/mean': 0.42711863,
 'loss': 4.7065854,
 'prediction/mean': 0.48011205}

# Creating predictions

Now we will use Tensorflow to predict our results. Again these predictions based on the training data we split and not the actual data we need to predict and submit to Kaggle. 

In [201]:
pred_input_func = tf.estimator.inputs.pandas_input_fn(
      x=X_test,
      batch_size=10,
      num_epochs=1,
      shuffle=False)

In [202]:
# Predictions is a generator! 
predictions = model.predict(pred_input_func)

In [203]:
list_pred = list(predictions)

INFO:tensorflow:Restoring parameters from /var/folders/pm/_fh4rrbx7tggds75lnhcl7wm0000gn/T/tmpf5ytjyl6/model.ckpt-1000


In [204]:
len(list_pred)

295

In [205]:
final_preds = []
for pred in list_pred:
    final_preds.append(pred['class_ids'][0])

In [206]:
len(final_preds)

295

# Comparing my predictions vs the real results

At the end I compare my "y_test" data which is the actual results and my predictions. We can see that the precision is about 81%. 

In [207]:
from sklearn.metrics import classification_report

In [208]:
print(classification_report(y_test,final_preds))

             precision    recall  f1-score   support

          0       0.80      0.81      0.80       169
          1       0.74      0.72      0.73       126

avg / total       0.77      0.77      0.77       295



# Using Real Data 

Now that the model has been trained we are going to upload the real test data. These predictions will then be submitted to Kaggle.

This data can be found on the following page: <a href="https://www.kaggle.com/c/titanic/data">Test data </a>

I downloaded the data ont my computer then read it from there.

In [209]:
real_df = pd.read_csv("test.csv")

In [210]:
real_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


# Prepare the data

We will go thru the same steps as above to remove any null values or columns that we do not need. 

In [211]:
predict_df = real_df.drop(["PassengerId","Name","Ticket","Cabin"],axis=1)

In [212]:
predict_df.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,34.5,0,0,7.8292,Q
1,3,female,47.0,1,0,7.0,S
2,2,male,62.0,0,0,9.6875,Q
3,3,male,27.0,0,0,8.6625,S
4,3,female,22.0,1,1,12.2875,S


In [213]:
predict_df["Age"].fillna(predict_df["Age"].mean(),inplace=True)

In [214]:
predict_embarked_cabin_list = ["S","C","Q"]
predict_df["Embarked"].fillna(random.choice(embarked_cabin_list),inplace=True)

Notice how below we have have "Fare" with null values which was not the case with our training dataset. This means we need to handle the null values. This can be done the same way we dealt with the age data. 

In [215]:
predict_df.isnull().any()

Pclass      False
Sex         False
Age         False
SibSp       False
Parch       False
Fare         True
Embarked    False
dtype: bool

In [216]:
predict_df["Fare"].fillna(predict_df["Fare"].mean(),inplace=True)

In [217]:
predict_df.isnull().any()

Pclass      False
Sex         False
Age         False
SibSp       False
Parch       False
Fare        False
Embarked    False
dtype: bool

# Create prediction function with real data

Below we do the same step again of creating a prediction function and creating a list. Except this time I will combine it with my PassengerID for the submission.

In [218]:
real_pred_input_func = tf.estimator.inputs.pandas_input_fn(
      x=predict_df,
      batch_size=10,
      num_epochs=1,
      shuffle=False)

In [219]:
real_predictions = model.predict(real_pred_input_func)

In [220]:
real_predictions

<generator object Estimator.predict at 0x112a297d8>

In [221]:
real_list_pred = list(real_predictions)

INFO:tensorflow:Restoring parameters from /var/folders/pm/_fh4rrbx7tggds75lnhcl7wm0000gn/T/tmpf5ytjyl6/model.ckpt-1000


In [222]:
len(real_list_pred)

418

In [223]:
real_preds = []
for pred in real_list_pred:
    real_preds.append(pred['class_ids'][0])

# Prepare submission CSV

In order to submit the results I need to save them to a CSV. The first thing I need to do is create a Series of my predictions. This will allow to add the results very easily to a Pandas dataframe.

After I convert it into a Series I create a dataframe with two columns "PassengerID" and "Survived". Then I convert it to a CSV that I submit to Kaggle. 

In [224]:
#convert list of predictions into series
series_predictions = pd.Series(real_preds)

In [225]:
submission_data["Survived"] = series_predictions.values

In [226]:
submission_df = pd.DataFrame({'PassengerId':real_df["PassengerId"],"Survived":series_predictions})

In [227]:
submission_df.to_csv("kaggle_submission.csv")