In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))
import tensorflow as tf

# Any results you write to the current directory are saved as output.

['test.csv', 'train.csv', 'gender_submission.csv']


# Read Data

Here's what we'll do 
- Read data into csv files (train and test)
- Print out a small summary of the data
- Combine them into one dataset if we require later on
- Find out how many examples of each class exist in the training data (check if skewed or not)
- Find out how many features have null values
- Fix null values for numerical features
- Fix null values with some values for categorical features

 ## Read data into csv files

In [3]:
# Read data into csv files
train_df = pd.read_csv("../input/train.csv")
test_df = pd.read_csv("../input/test.csv")
print("train_df shape : ",train_df.shape)
print("test_df shape : ",test_df.shape)

train_df shape :  (891, 12)
test_df shape :  (418, 11)


## Print out summary of data

In [4]:
# Print small summary
print("A look at training data:")
train_df.head()

A look at training data:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
print("A look at testing data:")
test_df.head()

A look at testing data:


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


> ***Obvious observation - 'Survived' column is missing in test_df***

## Find out how many examples of each class in training data

In [6]:
train_df.groupby('Survived')['PassengerId'].count()

Survived
0    549
1    342
Name: PassengerId, dtype: int64

**Observations** : 
1. 549+342 = 891. So no data in the training data is missing its class
2. It's not such a skewed dataset 

## How many features have null values

In [7]:
# What any does is return whether any element in a particular axis is true or not. So, it works for us in this case. For each column, it checks if any column has a NaN value or not.
train_df.isnull().any()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool

**Age**, **Cabin** and **Embarked** are the only ones having NaN values. We gotta fix them. 

In [8]:
# How many NaN values of Age in train_df?
train_df['Age'].isnull().sum()

177

In [9]:
# For Cabin
train_df['Cabin'].isnull().sum()

687

In [10]:
# For Embarked
train_df['Embarked'].isnull().sum()

2

## Fixing null / NaN values for each column one by one

### For embarked

In [11]:
train_df.groupby('Embarked')['PassengerId'].count()

Embarked
C    168
Q     77
S    644
Name: PassengerId, dtype: int64

We observed earlier that only 2 entries have NaN for Embarked. And here, we see there are only 3 possible values of Embarked - C, Q and S. Out of which, S has the most number. So, let's just assign the missing ones to S. 

In [12]:
train_df['Embarked'] = train_df['Embarked'].fillna('S')

Now, let's check again....

In [13]:
train_df.groupby('Embarked')['PassengerId'].count()

Embarked
C    168
Q     77
S    646
Name: PassengerId, dtype: int64

Perfect.

### For Age

In [14]:
train_df.groupby('Age')['PassengerId'].count()

Age
0.42      1
0.67      1
0.75      2
0.83      2
0.92      1
1.00      7
2.00     10
3.00      6
4.00     10
5.00      4
6.00      3
7.00      3
8.00      4
9.00      8
10.00     2
11.00     4
12.00     1
13.00     2
14.00     6
14.50     1
15.00     5
16.00    17
17.00    13
18.00    26
19.00    25
20.00    15
20.50     1
21.00    24
22.00    27
23.00    15
         ..
44.00     9
45.00    12
45.50     2
46.00     3
47.00     9
48.00     9
49.00     6
50.00    10
51.00     7
52.00     6
53.00     1
54.00     8
55.00     2
55.50     1
56.00     4
57.00     2
58.00     5
59.00     2
60.00     4
61.00     3
62.00     4
63.00     2
64.00     2
65.00     3
66.00     1
70.00     2
70.50     1
71.00     2
74.00     1
80.00     1
Name: PassengerId, Length: 88, dtype: int64

So, the first thing to note is, thie Age can be in decimals! So, it's more of a continuous variable than discrete one.
I think it would make sense to fix the missing ones by filling them with the mean?

In [15]:
train_df['Age'].mean()

29.69911764705882

In [16]:
train_df['Age'] = train_df['Age'].fillna(train_df['Age'].mean())

Now, let's check how many missing values remain.

In [17]:
train_df['Age'].isnull().sum()

0

Perfect.

### For Cabin

In [18]:
train_df.groupby('Cabin')['PassengerId'].count()

Cabin
A10      1
A14      1
A16      1
A19      1
A20      1
A23      1
A24      1
A26      1
A31      1
A32      1
A34      1
A36      1
A5       1
A6       1
A7       1
B101     1
B102     1
B18      2
B19      1
B20      2
B22      2
B28      2
B3       1
B30      1
B35      2
B37      1
B38      1
B39      1
B4       1
B41      1
        ..
E12      1
E121     2
E17      1
E24      2
E25      2
E31      1
E33      2
E34      1
E36      1
E38      1
E40      1
E44      2
E46      1
E49      1
E50      1
E58      1
E63      1
E67      2
E68      1
E77      1
E8       2
F E69    1
F G63    1
F G73    2
F2       3
F33      3
F38      1
F4       2
G6       4
T        1
Name: PassengerId, Length: 147, dtype: int64

Okay, So : 
- This can be alphanumeric
- 147 different vaulues exist for Cabin
- None of them seem to be far far greater in number than others
- A lot of values are actually missing - 687!

So, let's do one thing - Add a new 'Cabin' value as 'UNKNOWN' and fill the data with that

In [19]:
train_df['Cabin'] = train_df['Cabin'].fillna('UNKNOWN')

Check how many NaN now

In [20]:
train_df['Cabin'].isnull().sum()

0

Perfect.

### All NaN values fixed

In [21]:
# What any does is return whether any element in a particular axis is true or not. So, it works for us in this case. For each column, it checks if any column has a NaN value or not.
train_df.isnull().any()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin          False
Embarked       False
dtype: bool

## Helper Methods we learnt from above

We'll use these for testing dataset, and maybe in future as well.

In [22]:
def get_num_of_NaN_rows(df):
    return df.isnull().sum()

def fill_NaN_values_for_numerical_column(df, colname):
    df[colname] = df[colname].fillna(df[colname].mean())
    return df

def fill_NaN_values_for_categorical_column(df, colname, value):
    df[colname] = df[colname].fillna(value)
    return df

In [23]:
# Let's test them on test data (which still might have missing rows!)
num_of_NaN_rows_of_test_set = get_num_of_NaN_rows(test_df)
print("num_of_NaN_rows_of_test_set : ",num_of_NaN_rows_of_test_set)

num_of_NaN_rows_of_test_set :  PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


One chapter done. 

# Preprocessing Data

- Convert Categorical values to numerical ones
- Divide train_df into train_df_X and train_df_y
- One hot values

### Convert Categorical values to numerical ones

**1. Find which columns are categorical**

Ref : https://stackoverflow.com/questions/29803093/check-which-columns-in-dataframe-are-categorical/29803290#29803290

In [24]:
all_cols = train_df.columns

In [25]:
numeric_cols = train_df._get_numeric_data().columns

In [26]:
categorical_cols = set(all_cols) - set(numeric_cols)
categorical_cols

{'Cabin', 'Embarked', 'Name', 'Sex', 'Ticket'}

In [27]:
# Let's make a helper method from this now.
def find_categorical_columns(df):
    all_cols = df.columns
    numeric_cols = df._get_numeric_data().columns
    return set(all_cols) - set(numeric_cols)

**2. Convert to numerical ones using get_dummies of Pandas**

Ref : http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/

In [28]:
# First, let's backup our train_df and test_df till now
train_df_backup_filledna_still_having_categorical_data = train_df
train_df_backup_filledna_still_having_categorical_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,UNKNOWN,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,UNKNOWN,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,UNKNOWN,S


In [29]:
# Now, let's convert it.
train_df_dummies = pd.get_dummies(train_df, columns=categorical_cols)
train_df_dummies.shape

(891, 1732)

In [30]:
# However, backup's shape is still 
train_df_backup_filledna_still_having_categorical_data.shape

(891, 12)

In [31]:
# Let's check out data once
train_df_dummies.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Ticket_110152,...,"Name_Yrois, Miss. Henriette (""Mrs Harbeck"")","Name_Zabour, Miss. Hileni","Name_Zabour, Miss. Thamine","Name_Zimmerman, Mr. Leo","Name_de Messemaeker, Mrs. Guillaume Joseph (Emma)","Name_de Mulder, Mr. Theodore","Name_de Pelsmaeker, Mr. Alfons","Name_del Carlo, Mr. Sebastiano","Name_van Billiard, Mr. Austin Blyler","Name_van Melkebeke, Mr. Philemon"
0,1,0,3,22.0,1,0,7.25,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,1,1,38.0,1,0,71.2833,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,1,3,26.0,0,0,7.925,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,1,1,35.0,1,0,53.1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,0,3,35.0,0,0,8.05,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
train_df.shape

(891, 12)

### Another way to convert Categorical columns data into numerical is assigning them integers
Ref : https://stackoverflow.com/questions/42215354/pandas-get-mapping-of-categories-to-integer-value

In [33]:
# 2nd way to convert is having integers represent different values of each categorical column
train_df_numerical = train_df.copy()
for col in categorical_cols:
    train_df_numerical[col] = train_df_numerical[col].astype('category')
    train_df_numerical[col] = train_df_numerical[col].cat.codes
train_df_numerical.shape

(891, 12)

In [34]:
train_df_numerical.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,108,1,22.0,1,0,523,7.25,147,2
1,2,1,1,190,0,38.0,1,0,596,71.2833,81,0
2,3,1,3,353,0,26.0,0,0,669,7.925,147,2
3,4,1,1,272,0,35.0,1,0,49,53.1,55,2
4,5,0,3,15,1,35.0,0,0,472,8.05,147,2


In [35]:
# Let's make helper function here also
def convert_categorical_column_to_integer_values(df):
    df_numerical = df.copy()
    for col in find_categorical_columns(df):
        df_numerical[col] = df_numerical[col].astype('category')
        df_numerical[col] = df_numerical[col].cat.codes
    return df_numerical

*Perfect*.

Now, we have all of these available for our use : 

* **train_df**                    : original training dataset   (891,12)
* **train_df_dummies**  : training dataset with dummies (891, 1732)
* **train_df_numerical** : training dataset with integers for categorical attributes (891,12) 

# Running a model in Tensorflow

This will again involve a set of steps
- Get data converted to numpy arrays so tensorflow can read them
- Write tensorflow model
- Run a session of tensorflow model and check accuracy on training data set

Try the above for both train_df_dummies and train_df_numerical

In [36]:
# import tensorflow stuff...
import tensorflow as tf

In [37]:
# Dividing data between X and Y
# Ref : https://stackoverflow.com/questions/29763620/how-to-select-all-columns-except-one-column-in-pandas

train_df_dummies_Y = train_df_dummies['Survived']
# Don't worry. drop does not change the existing dataframe unless inplace=True is passed.
train_df_dummies_X = train_df_dummies.drop('Survived', axis=1)

train_df_numerical_X = train_df_numerical.drop('Survived', axis=1)
train_df_numerical_Y = train_df_numerical['Survived']

print("train_df_numerical_X shape : ",train_df_numerical_X.shape)
print("train_df_numerical_Y shape : ",train_df_numerical_Y.shape)
print("train_df_dummies_X shape : ",train_df_dummies_X.shape)
print("train_df_dummies_Y shape : ",train_df_dummies_Y.shape)

train_df_numerical_X shape :  (891, 11)
train_df_numerical_Y shape :  (891,)
train_df_dummies_X shape :  (891, 1731)
train_df_dummies_Y shape :  (891,)


### Converting to numpy arrays so tensorflow variables can pick it up

In [38]:
trainX_num = train_df_numerical_X.as_matrix()
trainY_num = train_df_numerical_Y.as_matrix()

trainX_dummies = train_df_dummies_X.as_matrix()
trainY_dummies = train_df_dummies_Y.as_matrix()

print("trainX_num.shape = ",trainX_num.shape)
print("trainY_num.shape = ",trainY_num.shape)
print("trainX_dummies.shape = ",trainX_dummies.shape)
print("trainY_dummies.shape = ",trainY_dummies.shape)

trainX_num.shape =  (891, 11)
trainY_num.shape =  (891,)
trainX_dummies.shape =  (891, 1731)
trainY_dummies.shape =  (891,)


In [39]:
# Reshaping the rank 1 arrays formed to proper 2 dimensions
trainY_num = trainY_num[:,np.newaxis]
trainY_dummies = trainY_dummies[:,np.newaxis]

print("trainX_num.shape = ",trainX_num.shape)
print("trainY_num.shape = ",trainY_num.shape)
print("trainX_dummies.shape = ",trainX_dummies.shape)
print("trainY_dummies.shape = ",trainY_dummies.shape)

trainX_num.shape =  (891, 11)
trainY_num.shape =  (891, 1)
trainX_dummies.shape =  (891, 1731)
trainY_dummies.shape =  (891, 1)


### Tensorflow Model

Now, let's build our model. 
We could use existing DNN classifier. But instead, we're gonna build this one with calculations ourselves.
2 layers. Hence, W1, b1, W2, b2 as parameters representing weights and biases to layer 1 and layer 2 respectively. 

We'll use RELU as our activation function for first layer. Why? Because it performs better in general.
And sigmoid for the 2nd layer. Since output is going to be a binary classification, it makes sense to use sigmoid.

In [40]:
### Tensorflow model
def model(learning_rate, X_arg, Y_arg, num_of_epochs):
    # 1. Placeholders to hold data
    X = tf.placeholder(tf.float32, [11,None])
    Y = tf.placeholder(tf.float32, [1, None])

    # 2. Model. 2 layers NN. So, W1, b1, W2, b2.
    # This is basically coding forward propagation formulaes
    W1 = tf.Variable(tf.random_normal((20,11)))
    b1 = tf.Variable(tf.zeros((20,1)))
    Z1 = tf.matmul(W1,X) + b1             # This is also called logits in tensorflow terms
    A1 = tf.nn.relu(Z1)

    W2 = tf.Variable(tf.random_normal((1, 20)))
    b2 = tf.Variable(tf.zeros((1,1)))
    Z2 = tf.matmul(W2,A1) + b2
    A2 = tf.nn.sigmoid(Z2)

    # 3. Calculate cost
    cost = tf.nn.sigmoid_cross_entropy_with_logits(logits=Z2, labels=Y)
    cost_mean = tf.reduce_mean(cost)

    # 4. Optimizer (Gradient Descent / AdamOptimizer ) - Using this line, tensorflow automatically does backpropagation
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost_mean)
    
    # 5. initialize variabls
    session = tf.Session()
    tf.set_random_seed(1)
    init = tf.global_variables_initializer()
    session.run(init)
    
    # 6. Actual loop where learning happens
    for i in range(num_of_epochs):
        _, cost_mean_val = session.run([optimizer, cost_mean], feed_dict={X:X_arg, Y:Y_arg})
        if i % 100 == 0:
            print("i : ",i,", cost : ",cost_mean_val)
            
    return session.run([W1,b1,W2,b2,A2,Y],feed_dict={X:X_arg, Y:Y_arg})

In [41]:
W1_tr,b1_tr,W2_tr,b2_tr,A2,Y = model(0.01, trainX_num.T, trainY_num.T, 3000)

i :  0 , cost :  3030.03
i :  100 , cost :  91.9023
i :  200 , cost :  40.3933
i :  300 , cost :  7.13846
i :  400 , cost :  12.6343
i :  500 , cost :  5.83331
i :  600 , cost :  2.08085
i :  700 , cost :  1.58742
i :  800 , cost :  5.6428
i :  900 , cost :  3.63964
i :  1000 , cost :  2.19955
i :  1100 , cost :  2.23141
i :  1200 , cost :  1.67917
i :  1300 , cost :  1.24362
i :  1400 , cost :  1.01248
i :  1500 , cost :  2.31802
i :  1600 , cost :  1.34666
i :  1700 , cost :  1.44293
i :  1800 , cost :  1.17003
i :  1900 , cost :  0.926903
i :  2000 , cost :  6.90624
i :  2100 , cost :  1.24673
i :  2200 , cost :  3.33068
i :  2300 , cost :  2.26824
i :  2400 , cost :  1.08037
i :  2500 , cost :  0.876024
i :  2600 , cost :  22.3073
i :  2700 , cost :  1.11995
i :  2800 , cost :  2.93875
i :  2900 , cost :  1.62486


In [42]:
# Validating that our formulaes were correct by checking shapes of ouput prediction
A2.shape

(1, 891)

In [43]:
Y.shape

(1, 891)

In [44]:
# Let's see the predictions variable
A2[:,0:5]

array([[ 0.        ,  1.        ,  1.        ,  0.92490357,  0.        ]], dtype=float32)

**As we see, our predictions array isn't 0s or 1s. So, we must convert it to 0s / 1s. **

In [45]:
A2_bool = A2 > 0.5
Y_prediction_training = A2_bool.astype(int)
Y_int = Y.astype(int)

In [46]:
Y_int

array([[0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,
        1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
        1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
        1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
        0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
        1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
        1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
        0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
        1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 

In [47]:
Y_prediction_training

array([[0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
        1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
        1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
        1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
        0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
        1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
        0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
        1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1,
        0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
        1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 

In [48]:
accuracy = (Y_prediction_training == Y_int).mean()
accuracy

0.80246913580246915

### Awesome

81.48% accuracy isn't bad on training dataset. That too, with just 3000 epochs!

People got near 85% with 40000 epochs. So, it's fine. This is good enough.


In [49]:
# First, let's list our helper functions we could make from logic used above.
def convert_sigmoid_output_to_boolean_array(array, threshold):
    array = array > threshold
    return array

def convert_boolean_array_to_binary_array(array):
    array_binary = array.astype(int)
    return array_binary

**Let's try now with dummies wala data.**

This is the time. Let's generalize the model we wrote above to take more arguments and not be specific to shapes of our X or Y.
Also, let's now print the training accuracy in the model itself with the cost at each 100th epoch!

In [50]:
### Tensorflow model
def model_generic(learning_rate, X_arg, Y_arg, num_of_epochs, hidden_units, threshold):
    # 1. Placeholders to hold data
    X = tf.placeholder(tf.float32, [X_arg.shape[0],None])
    Y = tf.placeholder(tf.float32, [1, None])

    # 2. Model. 2 layers NN. So, W1, b1, W2, b2.
    # This is basically coding forward propagation formulaes
    W1 = tf.Variable(tf.random_normal((hidden_units,X_arg.shape[0])))
    b1 = tf.Variable(tf.zeros((hidden_units,1)))
    Z1 = tf.matmul(W1,X) + b1
    A1 = tf.nn.relu(Z1)

    W2 = tf.Variable(tf.random_normal((1, hidden_units)))
    b2 = tf.Variable(tf.zeros((1,1)))
    Z2 = tf.matmul(W2,A1) + b2
    A2 = tf.nn.sigmoid(Z2)

    # 3. Calculate cost
    cost = tf.nn.sigmoid_cross_entropy_with_logits(logits=Z2, labels=Y)
    cost_mean = tf.reduce_mean(cost)

    # 4. Optimizer (Gradient Descent / AdamOptimizer ) - Using this line, tensorflow automatically does backpropagation
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost_mean)
    
    # 5. Accuracy methods
    predicted_class = tf.greater(A2,threshold)
    prediction_arr = tf.equal(predicted_class, tf.equal(Y,1.0))
    accuracy = tf.reduce_mean(tf.cast(prediction_arr, tf.float32))
    
    # 5. initialize variabls
    session = tf.Session()
    tf.set_random_seed(1)
    init = tf.global_variables_initializer()
    session.run(init)
    
    # 6. Actual loop where learning happens
    for i in range(num_of_epochs):
        _, cost_mean_val, accuracy_val = session.run([optimizer, cost_mean, accuracy], feed_dict={X:X_arg, Y:Y_arg})
        if i % 100 == 0:
            print("i:",i,", cost : ",cost_mean_val,", training accuracy : ",accuracy_val)
            
    return session.run([W1,b1,W2,b2,A2,Y,accuracy],feed_dict={X:X_arg, Y:Y_arg})

In [51]:
W1_dum,b1_dum,W2_dum,b2_dum,A2_dummies,Y_dummies,training_accuracy_val = model_generic(0.005, trainX_num.T, trainY_num.T, 3000, 100,0.5)

i: 0 , cost :  2395.99 , training accuracy :  0.616162
i: 100 , cost :  67.3294 , training accuracy :  0.61055
i: 200 , cost :  20.3746 , training accuracy :  0.676768
i: 300 , cost :  12.0113 , training accuracy :  0.735129
i: 400 , cost :  8.66318 , training accuracy :  0.762065
i: 500 , cost :  6.96889 , training accuracy :  0.780022
i: 600 , cost :  6.11968 , training accuracy :  0.790123
i: 700 , cost :  5.93834 , training accuracy :  0.805836
i: 800 , cost :  7.09067 , training accuracy :  0.790123
i: 900 , cost :  5.11551 , training accuracy :  0.801347
i: 1000 , cost :  4.36806 , training accuracy :  0.819304
i: 1100 , cost :  6.33762 , training accuracy :  0.799102
i: 1200 , cost :  4.45729 , training accuracy :  0.813693
i: 1300 , cost :  4.10381 , training accuracy :  0.82716
i: 1400 , cost :  3.39415 , training accuracy :  0.824916
i: 1500 , cost :  5.11004 , training accuracy :  0.803591
i: 1600 , cost :  5.91911 , training accuracy :  0.794613
i: 1700 , cost :  9.30856 , 

In [52]:
    training_accuracy_val

0.83501685

So, for when we use dummies data, accuracy goes up and down, and after 3000 epochs is somewhere near 85.52%. This is good only! 

# Prediction on Test Data

Let's use numerical wala data only now.
- Converting test data in the same form
- Pass it through the network to get the value of A2
- Concatenate this with the data and write that into csv
- Submit the csv

In [53]:
test_df

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.2250,,S
6,898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0000,,S
8,900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.1500,,S


In [54]:
test_df.isnull().any()

PassengerId    False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare            True
Cabin           True
Embarked       False
dtype: bool

In [55]:
test_df['Age'] = test_df['Age'].fillna(test_df['Age'].mean())
test_df['Fare'] = test_df['Fare'].fillna(test_df['Fare'].mean())
test_df['Cabin'] = test_df['Cabin'].fillna('UNKNOWN')

In [56]:
test_df.isnull().any()

PassengerId    False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin          False
Embarked       False
dtype: bool

In [57]:
# Converting to numerical data
test_df_numerical = test_df.copy()
for col in categorical_cols:
    test_df_numerical[col] = test_df_numerical[col].astype('category')
    test_df_numerical[col] = test_df_numerical[col].cat.codes
test_df_numerical.shape

(418, 11)

In [58]:
test_df_numerical.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,206,1,34.5,0,0,152,7.8292,76,1
1,893,3,403,0,47.0,1,0,221,7.0,76,2
2,894,2,269,1,62.0,0,0,73,9.6875,76,1
3,895,3,408,1,27.0,0,0,147,8.6625,76,2
4,896,3,178,0,22.0,1,1,138,12.2875,76,2


In [59]:
import math
# Ref : https://stackoverflow.com/questions/32109319/how-to-implement-the-relu-function-in-numpy
# Ref : https://stackoverflow.com/questions/3985619/how-to-calculate-a-logistic-sigmoid-function-in-python
def predict(W1,b1,W2,b2,X):
    
    Z1 = np.dot(W1,X) + b1
    A1 = np.maximum(Z1, 0, Z1)
    
    Z2 = np.dot(W2,A1) + b2
    A2 = 1 / (1 + np.exp(-Z2))
    return A2

In [60]:
# Let's predict
X_test = test_df_numerical.as_matrix()
X_test.shape

(418, 11)

In [61]:
W1_tr.shape

(20, 11)

In [62]:
W2_tr.shape

(1, 20)

In [63]:
final_prediction = predict(W1_tr,b1_tr,W2_tr,b2_tr,X_test.T)

In [64]:
final_prediction_int = final_prediction > 0.5
final_prediction_int = final_prediction_int.astype(int)
final_prediction_int.shape

(1, 418)

In [65]:
final_survived_df = pd.DataFrame(data=final_prediction_int.T, columns=['Survived'])
final_survived_df

Unnamed: 0,Survived
0,0
1,1
2,0
3,0
4,1
5,0
6,1
7,0
8,0
9,0


In [66]:
test_df['PassengerId']

0       892
1       893
2       894
3       895
4       896
5       897
6       898
7       899
8       900
9       901
10      902
11      903
12      904
13      905
14      906
15      907
16      908
17      909
18      910
19      911
20      912
21      913
22      914
23      915
24      916
25      917
26      918
27      919
28      920
29      921
       ... 
388    1280
389    1281
390    1282
391    1283
392    1284
393    1285
394    1286
395    1287
396    1288
397    1289
398    1290
399    1291
400    1292
401    1293
402    1294
403    1295
404    1296
405    1297
406    1298
407    1299
408    1300
409    1301
410    1302
411    1303
412    1304
413    1305
414    1306
415    1307
416    1308
417    1309
Name: PassengerId, Length: 418, dtype: int64

In [67]:
final_df = pd.concat([test_df['PassengerId'], final_survived_df], axis=1)
final_df

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,0
9,901,0


In [68]:
# Exporting to a csv file
final_df.to_csv("output-prediction.csv", index=False)

# One function to sum the whole notebook

Now that we've reached here, we would want to execute the same notebook for different values of hyperparameters - to see how well our ouput csv file does on the leaderboard, and if we can improve our position. 
For this, I've tried to utilize the helper functions we kept writing above and made one method which does everything from loading data, to fixing null values, to evaluating the model and then predicting and outputing a csv file. 
Ultimately, you could just call this method with a range of hyperparameters, and let it do its magic. I'm gonna do the same on a 

In [71]:
# helper exercise which does the whole thing for any training dataframe given 
def execute_steps_for_titanic(columns_to_use, output_file_name, learning_rate=0.01, num_of_epochs=3000, hidden_units=50, threshold_for_output=0.5, ):
    # read data
    training_df_orig = pd.read_csv("../input/train.csv")
    testing_df_orig = pd.read_csv("../input/test.csv")
    # get X and Y separated
    train_df_Y = training_df_orig['Survived']
    train_df_X = training_df_orig[columns_to_use]
    test_df_X = testing_df_orig[columns_to_use]
    # fix missing data
    categorical_columns = find_categorical_columns(train_df_X)
    replace_values_dict = {'Embarked':'S', 'Cabin':'UNKNOWN'}
    for col in columns_to_use:
        num_of_NaN_rows = get_num_of_NaN_rows(train_df_X)[col]
        num_of_NaN_rows_test = get_num_of_NaN_rows(test_df_X)[col]
        if(num_of_NaN_rows > 0):
            print("Filling NaN values for column:",col)
            if col not in categorical_columns:
                train_df_X[col] = train_df_X[col].fillna(train_df_X[col].mean())
            else:
                train_df_X[col] = train_df_X[col].fillna(replace_values_dict[col])
        if(num_of_NaN_rows_test > 0):
            print("Filling NaN values for column:",col," in test data")
            if col not in categorical_columns:
                test_df_X[col] = test_df_X[col].fillna(test_df_X[col].mean())
            else:
                test_df_X[col] = test_df_X[col].fillna(replace_values_dict[col])
    print("Fixed NaN values in training and testing data.")
    # convert categorical to numerical data
    train_df_X_num = convert_categorical_column_to_integer_values(train_df_X)
    test_df_X_num = convert_categorical_column_to_integer_values(test_df_X)
    # Get numpy arrays for this data
    train_X = train_df_X_num.as_matrix()
    test_X = test_df_X_num.as_matrix()
    train_Y = train_df_Y.as_matrix()
    # fix rank-1 array created
    train_Y = train_Y[:,np.newaxis]
    # call model and get values 
    W1,b1,W2,b2,A2,Y,final_tr_accuracy = model_generic(learning_rate, train_X.T, train_Y.T, num_of_epochs, hidden_units, threshold_for_output)
    print("Final training accuracy : ",final_tr_accuracy)
    # get prediction and save it to output file
    prediction = predict(W1,b1,W2,b2,test_X.T)
    # if prediction value > threshold, then set as True, else as False
    prediction = prediction > threshold_for_output
    # Convert the True/False array to a 0 , 1 array
    prediction = prediction.astype(int)
    # Convert back to dataframe and give the column name as 'Survived'
    prediction_df = pd.DataFrame(data=prediction.T, columns=['Survived'])
    # Make a final data frame of the required output and output to csv
    final_df = pd.concat([testing_df_orig['PassengerId'], prediction_df], axis=1)
    final_file_name = output_file_name+"_tr_acc_"+"{0:.2f}".format(final_tr_accuracy)+"_prediction.csv"
    final_df.to_csv(final_file_name, index=False)
    print("Done.")
    return final_file_name, final_tr_accuracy

In [70]:
# Let's try this once?
# All this while, we kept including Name and PassengerId as 2 important columns however in real life, they actually don't really matter in deciding whether a person would live or not. 
# So, now let's check without them.
columns_to_use = ['Pclass','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked']
execute_steps_for_titanic(columns_to_use, "bhavul", learning_rate=0.005, num_of_epochs=5000, hidden_units=30, threshold_for_output=0.5)

Filling NaN values for column: Age
Filling NaN values for column: Age  in test data


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Filling NaN values for column: Fare  in test data
Filling NaN values for column: Cabin
Filling NaN values for column: Cabin  in test data


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Filling NaN values for column: Embarked
Fixed NaN values in training and testing data.
i: 0 , cost :  622.927 , training accuracy :  0.615039
i: 100 , cost :  32.1764 , training accuracy :  0.638608
i: 200 , cost :  5.37419 , training accuracy :  0.676768
i: 300 , cost :  3.16623 , training accuracy :  0.721661
i: 400 , cost :  2.37026 , training accuracy :  0.732884
i: 500 , cost :  2.64006 , training accuracy :  0.705948
i: 600 , cost :  2.30937 , training accuracy :  0.707071
i: 700 , cost :  1.52118 , training accuracy :  0.780022
i: 800 , cost :  1.27412 , training accuracy :  0.782267
i: 900 , cost :  1.78044 , training accuracy :  0.739618
i: 1000 , cost :  1.10221 , training accuracy :  0.79798
i: 1100 , cost :  1.08912 , training accuracy :  0.796857
i: 1200 , cost :  1.99274 , training accuracy :  0.719416
i: 1300 , cost :  1.3371 , training accuracy :  0.762065
i: 1400 , cost :  1.36572 , training accuracy :  0.749719
i: 1500 , cost :  0.930596 , training accuracy :  0.79910

In [73]:
from hyperdash import monitor_cell

In [74]:
%%monitor_cell "Titanic all variations"

columns_to_use = ['Pclass','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked']
learning_rates = [0.001, 0.002, 0.005, 0.01, 0.02, 0.05]
num_of_epochs_arr = [1000, 5000, 10000, 30000]
hidden_units_arr = [3, 10, 15, 50, 100]

for learning_rate in learning_rates:
    for num_of_epochs in num_of_epochs_arr:
        for hidden_units in hidden_units_arr:
            filename, accuracy_val = execute_steps_for_titanic(columns_to_use, "bhavul", learning_rate=learning_rate, num_of_epochs=num_of_epochs, hidden_units=hidden_units, threshold_for_output=0.5)
            print("\n","="*50)
            print("[lr:",learning_rate,"][epoch:",num_of_epochs,"][hidden:",hidden_units,"][file:",filename,"] ACCURACY : ",accuracy_val)
            print("="*50,"\n")

Filling NaN values for column: Age
Filling NaN values for column: Age  in test data
Filling NaN values for column: Fare  in test data
Filling NaN values for column: Cabin
Filling NaN values for column: Cabin  in test data
Filling NaN values for column: Embarked
Fixed NaN values in training and testing data.
i: 0 , cost :  118.938 , training accuracy :  0.616162


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


i: 100 , cost :  86.9805 , training accuracy :  0.615039
i: 200 , cost :  59.4641 , training accuracy :  0.631874
i: 300 , cost :  38.533 , training accuracy :  0.662177
i: 400 , cost :  21.2574 , training accuracy :  0.659933
i: 500 , cost :  13.3527 , training accuracy :  0.620651
i: 600 , cost :  9.92742 , training accuracy :  0.620651
i: 700 , cost :  7.47914 , training accuracy :  0.620651
i: 800 , cost :  5.63277 , training accuracy :  0.619529
i: 900 , cost :  4.2046 , training accuracy :  0.615039
Final training accuracy :  0.612795
Done.

[lr: 0.001 ][epoch: 1000 ][hidden: 3 ][file: bhavul_tr_acc_0.61_prediction.csv ] ACCURACY :  0.612795

Filling NaN values for column: Age
Filling NaN values for column: Age  in test data
Filling NaN values for column: Fare  in test data
Filling NaN values for column: Cabin
Filling NaN values for column: Cabin  in test data
Filling NaN values for column: Embarked
Fixed NaN values in training and testing data.
i: 0 , cost :  324.82 , training a

  # Remove the CWD from sys.path while we load stuff.


Filling NaN values for column: Fare  in test data
Filling NaN values for column: Cabin
Filling NaN values for column: Cabin  in test data
Filling NaN values for column: Embarked
Fixed NaN values in training and testing data.
i: 0 , cost :  710.496 , training accuracy :  0.37486
i: 100 , cost :  62.9678 , training accuracy :  0.59596
i: 200 , cost :  18.1675 , training accuracy :  0.652076
i: 300 , cost :  1.61923 , training accuracy :  0.689113
i: 400 , cost :  1.12122 , training accuracy :  0.753086
i: 500 , cost :  0.877217 , training accuracy :  0.75982
i: 600 , cost :  0.721939 , training accuracy :  0.771044
i: 700 , cost :  0.646359 , training accuracy :  0.7789
i: 800 , cost :  0.601009 , training accuracy :  0.784512
i: 900 , cost :  0.571206 , training accuracy :  0.790123
i: 1000 , cost :  0.553248 , training accuracy :  0.781145
i: 1100 , cost :  0.535586 , training accuracy :  0.765432
i: 1200 , cost :  0.510556 , training accuracy :  0.769921
i: 1300 , cost :  0.511724 , t