load_data.ipynb
  is from https://www.tensorflow.org/tutorials/load_data/csv
  
[TensorFlow](https://www.tensorflow.org/) > [Learn](https://www.tensorflow.org/learn) > [TensorFlow Core](https://www.tensorflow.org/overview) > [Tutorials](https://www.tensorflow.org/tutorials) > [Load CSV data](https://www.tensorflow.org/tutorials/load_data/csv)


## Purpose
* This tutorial provides an example of ***how to load CSV data from a file into a tf.data.Dataset***. 
* The data used in this tutorial are taken from the ***Titanic passenger list***.
* The model will ***predict the likelihood a passenger survived based on characteristics*** like age, gender, ticket class, and whether the person was traveling alone.

## Contents
* Setup
* Load data
* Data preprocessing
  * Continuous data
  * Categorical data
  * Combined preprocessing layer
* Build the model
* Train, evaluate, and predict


## Setup
The Titanic passenger list is collected, cleaned and prepared into two .csv files: train.csv & eval.csv. These files are downloaded with tf.keras.utils.get_file.

In [2]:
from __future__ import absolute_import, division, print_function, unicode_literals
import functools

import numpy as np
import tensorflow as tf

TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)

Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv


The downloaded files are stored in the following paths.

In [5]:
train_file_path

'/home/aimldl/.keras/datasets/train.csv'

In [6]:
test_file_path

'/home/aimldl/.keras/datasets/eval.csv'

The following line is optional, but adjusts the precision of numpy values for better readability.

In [29]:
# Make numpy values easier to read.
np.set_printoptions(precision=3, suppress=True)

## Load data
TensorFlow's tutorial uses a Linux command head to take a quick look at the train.csv file.

In [8]:
!head {train_file_path}

survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n


It's fine to check the content of the file, but the readability is poor. So let's check the file in a well-formatted form with Pandas as follows.

In [15]:
import pandas as pd

train_csv_df = pd.read_csv( train_file_path )
train_csv_df.head(10)

Unnamed: 0,survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
2,1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
3,1,female,35.0,1,0,53.1,First,C,Southampton,n
4,0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
5,0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
6,1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
7,1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
8,1,female,4.0,1,1,16.7,Third,G,Southampton,n
9,0,male,20.0,0,0,8.05,Third,unknown,Southampton,y


In [17]:
train_csv_df.tail()

Unnamed: 0,survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
622,0,male,28.0,0,0,10.5,Second,unknown,Southampton,y
623,0,male,25.0,0,0,7.05,Third,unknown,Southampton,y
624,1,female,19.0,0,0,30.0,First,B,Southampton,y
625,0,female,28.0,1,2,23.45,Third,unknown,Southampton,n
626,0,male,32.0,0,0,7.75,Third,unknown,Queenstown,y


shape shows the table size. So it's possible to tell there are 627 people in the list.

In [14]:
train_csv_df.shape

(627, 10)

For the sake of completeness, let's take a look at eval.csv as well. The very first column is the index of each row added automatically by Pandas.

In [20]:
eval_csv_df = pd.read_csv( test_file_path )
eval_csv_df.head()

Unnamed: 0,survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,0,male,35.0,0,0,8.05,Third,unknown,Southampton,y
1,0,male,54.0,0,0,51.8625,First,E,Southampton,y
2,1,female,58.0,0,0,26.55,First,C,Southampton,y
3,1,female,55.0,0,0,16.0,Second,unknown,Southampton,y
4,1,male,34.0,0,0,13.0,Second,D,Southampton,y


In [21]:
eval_csv_df.shape

(264, 10)

We can tell the data is for 264 passengers. So the model, later, will be trained with the data from 627 passengers and evaluated with that of 264 passengers. 891 passengers in total. Roughly, 70% is used for training and 30% for evaluation. This ratio varies from dataset to dataset, but this is a reasonable ratio between the training and evaluation dataset.

Back to the tutorial, function get_dataset returns dataset from a file specified by file_path. In actuality, train.csv and eval.csv are retrieved and stored in raw_train_data and raw_test_data, respectively.

LABEL_COLUMN is one of the input arguments used by function get_dataset. It is possible to guess the column 'survived' is used out of ten columns. The values in the 'survived' column is either 0 or 1. So LABELS are also either 0 or 1.

In [22]:
LABEL_COLUMN = 'survived'
LABELS = [0, 1]

In [23]:
def get_dataset(file_path, **kwargs):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=5, # Artificially small to make examples easier to show.
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True, 
      **kwargs)
  return dataset

In [24]:
raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.


(Optional) That's enough information to understand this part. For the sake of completeness, get_dataset is explained further below. But understanding this is optional.

get_dataset is a wrapper around [tf.data.experimental.make_csv_dataset](https://www.tensorflow.org/api_docs/python/tf/data/experimental/make_csv_dataset) which reads a CSV file into a dataset. 

> A dataset, where each element is a (features, labels) tuple that corresponds to a batch of batch_size CSV rows.

The type of this dataset is tensorflow.python.data.ops.dataset_ops.PrefetchDataset. 

TODO: Explain a little more about this.

In [26]:
type( raw_train_data )

tensorflow.python.data.ops.dataset_ops.PrefetchDataset

In [27]:
type( raw_test_data )

tensorflow.python.data.ops.dataset_ops.PrefetchDataset

The settings about the dataset is specified here such as batch_size, num_epochs, na_value, and ignore_errors.

```
- batch_size: An int representing the number of records to combine in a single batch.
- num_epochs: An int specifying the number of times this dataset is repeated. If None, cycles through the dataset forever.
- na_value: Additional string to recognize as NA/NaN (Not a Number).
- ignore_errors: (Optional.) If True, ignores errors with CSV file parsing, such as malformed data or empty lines, and moves on to the next valid CSV record. Otherwise, the dataset raises an error and stops processing when encountering any invalid records. Defaults to False.
```

For more information, refer to [tf.data.experimental.make_csv_dataset](https://www.tensorflow.org/api_docs/python/tf/data/experimental/make_csv_dataset).

To display a dataset, show_batch is defined. 

In [30]:
def show_batch(dataset):
  for batch, label in dataset.take(1):
    for key, value in batch.items():
      print("{:20s}: {}".format(key,value.numpy()))

In the tutorial,
> Each item in the dataset is a batch, represented as a tuple of (many examples, many labels). The data from the examples is organized in column-based tensors (rather than row-based tensors), each with as many elements as the batch size (5 in this case).

Recall each item of a dataset is a (features, labels) tuple. In show_batch, batch stores features. batch has five items because "batch_size=5". Each item in a batch consists of key and value. All the items are printed iteratively by the second for loop. key is the column name and value is a "batch" of 5 values.

In [31]:
show_batch( raw_train_data )

sex                 : [b'male' b'female' b'male' b'female' b'male']
age                 : [28. 28. 17. 21. 65.]
n_siblings_spouses  : [0 0 0 0 0]
parch               : [0 0 0 0 0]
fare                : [ 7.775  7.879  8.663 10.5   26.55 ]
class               : [b'Third' b'Third' b'Third' b'Second' b'First']
deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'E']
embark_town         : [b'Southampton' b'Queenstown' b'Southampton' b'Southampton' b'Southampton']
alone               : [b'y' b'y' b'y' b'y' b'y']


In [None]:
def custom_show_batch(dataset):
  for features, labels in dataset.take(1):
    for column_name, value in features.items():
      print("{:20s}: {}".format(key,value.numpy()))

In [None]:
custom_show_batch( raw_train_data )

In [32]:
train_csv_df.head()

Unnamed: 0,survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
2,1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
3,1,female,35.0,1,0,53.1,First,C,Southampton,n
4,0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
