<a href="https://colab.research.google.com/github/deborahtrez/machine-learning/blob/master/Training_and_Testing_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install sklearn

import os
import sys

import numpy as np
import pandas as pd #for visualizing data - data analytics eg if you need to remove some columns
import matplotlib.pyplot as plt #for plotting graphs and charts
from IPython.display import clear_output #for clearning the display
from six.moves import urllib

import tensorflow.compat.v2.feature_column as fc
import tensorflow as tf



In [None]:
# Load dataset.
dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv') #this dataset is for training the model
dfeval = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv') #this dataset is for testing the model

y_train = dftrain.pop('survived') 
y_eval = dfeval.pop('survived')

In [None]:
CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone'] #columns with non numerical data
#So categorical data is usually represented by numbers/integers so that it is easy to pass in, say females will be `1` and males will be `0`
#Tensorflow can help us achieve this categorization

NUMERIC_COLUMNS = ['age', 'fare']

feature_columns = []
for feature_name in CATEGORICAL_COLUMNS: #this will loop through all of the categorical columns and ...
  vocabulary = dftrain[feature_name].unique() #gets list of all the unique values from each feature column
  feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))

for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))

In [None]:
#lets see what this does
dftrain["sex"].unique()

array(['male', 'female'], dtype=object)

In [None]:
dftrain["embark_town"].unique()

array(['Southampton', 'Cherbourg', 'Queenstown', 'unknown'], dtype=object)

In [None]:
dftrain["n_siblings_spouses"].unique()

array([1, 0, 3, 4, 2, 5, 8])

In [None]:
print(feature_columns)

[VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='n_siblings_spouses', vocabulary_list=(1, 0, 3, 4, 2, 5, 8), dtype=tf.int64, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='parch', vocabulary_list=(0, 1, 2, 5, 3, 4), dtype=tf.int64, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='class', vocabulary_list=('Third', 'First', 'Second'), dtype=tf.string, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='deck', vocabulary_list=('unknown', 'C', 'G', 'A', 'B', 'D', 'F', 'E'), dtype=tf.string, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Southampton', 'Cherbourg', 'Queenstown', 'unknown'), dtype=tf.string, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='alone', vocabulary_list=('n', 'y'), dtype=tf.string, def

THE TRAINING PROCESS

When training the model, the input data is first broken down into batches of 32 and epochs. Epochs are how times the model is going to use the same data. You feed the data into the model again and again but in different orders, but careful not to do it too much or ele the model masters the data points and is able to give accurate results, but when you feed it new data, it will give very wrong errors. 

The inpt fuction below `make_input_fn` defines how the data is going to be broken down into batches and epochs to fit into the model. The input function encodes the data into a `tf.data.Dataset` object. The model can work with the object but not the data in its raw form.

In [None]:
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
  def input_function():
    ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df)) #create a tf.data.Dataset object with the data and its label
    if shuffle:
      ds = ds.shuffle(1000) #shuffle the dataset
    ds = ds.batch(batch_size).repeat(num_epochs) #this is going to take the data and split it into blocks of data. .repeat(num_epochs).. 
                                                #..defines how many blocks of data it should feed the model
    return ds #return a batch of the dataset
  return input_function

train_input_fn = make_input_fn(dftrain, y_train)
eval_input_fn = make_input_fn(dfeval, y_eval, num_epochs=1, shuffle=False) #here, we dont need to shuffle the data because we are not training it. We just need to test it.

After adding all the base features to the model, let's train the model. Training a model is just a single command using the `tf.estimator` API:

In [None]:
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns)

linear_est.train(train_input_fn) #this will grab all the input that we need and train the function
result = linear_est.evaluate(eval_input_fn) #after training we shall evaluate. test the data

clear_output()
print(result['accuracy'])

0.7537879


USE THE MODEL TO MAKE PREDICTIONS

To make a prediction, we have to use the eval input function 

In [None]:
result = list(linear_est.predict(eval_input_fn))
print(result) #returns a dictionary of all the predictions but what we ae interested in is probabilities

#Let us look at just ONE prediction
print(result[0])

#We want 'probabilities'
print(result[0]['probabilities'])

INFO:tensorflow:Calling model_fn.




INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp89ys2hy9/model.ckpt-200
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[{'logits': array([-2.1575084], dtype=float32), 'logistic': array([0.10363168], dtype=float32), 'probabilities': array([0.89636827, 0.10363167], dtype=float32), 'class_ids': array([0]), 'classes': array([b'0'], dtype=object), 'all_class_ids': array([0, 1], dtype=int32), 'all_classes': array([b'0', b'1'], dtype=object)}, {'logits': array([0.04180787], dtype=float32), 'logistic': array([0.5104504], dtype=float32), 'probabilities': array([0.48954958, 0.5104505 ], dtype=float32), 'class_ids': array([1]), 'classes': array([b'1'], dtype=object), 'all_class_ids': array([0, 1], dtype=int32), 'all_classes': array([b'0', b'1'], dtype=object)}, {'logits': array([1.5846708], dtype=float32), 'logistic': array([0.829865], dtype=float32), 'probabilities': array([0.1701349

In [None]:
#So what is the probability of survival? Survival =1, and not survival = 0
print(result[0]['probabilities'][1])

0.10363167


The first person's chance of survival

In [28]:
print(dfeval.loc[0]) #the predicted probability of survival
print(y_eval.loc[0]) #the actual value. (Remember this column was sliced at the beginning so we could predict it)
print(result[0]['probabilities'][1])

sex                          male
age                            35
n_siblings_spouses              0
parch                           0
fare                         8.05
class                       Third
deck                      unknown
embark_town           Southampton
alone                           y
Name: 0, dtype: object
0
0.10363167


The second person's chances of survival...

In [27]:
print(dfeval.loc[1])
print(y_eval.loc[1])
print(result[1]['probabilities'][1])

sex                          male
age                            54
n_siblings_spouses              0
parch                           0
fare                      51.8625
class                       First
deck                            E
embark_town           Southampton
alone                           y
Name: 1, dtype: object
0
0.5104505


The third person's chances of survival

In [26]:
print(dfeval.loc[2])
print(y_eval.loc[2])
print(result[2]['probabilities'][1])

sex                        female
age                            58
n_siblings_spouses              0
parch                           0
fare                        26.55
class                       First
deck                            C
embark_town           Southampton
alone                           y
Name: 2, dtype: object
1
0.829865


In [25]:
print(dfeval.loc[3])
print(y_eval.loc[3])
print(result[3]['probabilities'][1])

sex                        female
age                            55
n_siblings_spouses              0
parch                           0
fare                           16
class                      Second
deck                      unknown
embark_town           Southampton
alone                           y
Name: 3, dtype: object
1
0.7483945


In [29]:
print(dfeval.loc[100])
print(y_eval.loc[100])
print(result[100]['probabilities'][1])

sex                          male
age                            30
n_siblings_spouses              0
parch                           0
fare                         7.25
class                       Third
deck                      unknown
embark_town           Southampton
alone                           y
Name: 100, dtype: object
0
0.11186324
