In [17]:
import functools
import numpy as np
import os
import tensorflow as tf
print("TensorFlow version: ",tf.version.VERSION)

TensorFlow version:  2.2.0


In [18]:
# Define data path
cwd = os.getcwd()
train_file_path = os.path.join(cwd, "train.csv")
test_file_path = os.path.join(cwd, "test.csv")

In [None]:
# Make numpy values easier to read.
np.set_printoptions(precision=3, suppress=True)

Inspect CSV file

In [51]:
with open(test_file_path) as f:
    for _ in range(3): # first 10 lines
        print(f.readline())

PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked

892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q

893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S



Specify label column:

In [52]:
LABEL_COLUMN = 'Survived'
LABELS = [0, 1]

Read the CSV data from the file and create a dataset. 

In [53]:
def get_dataset(file_path, **kwargs):
    "Loads a CSV file into a dataset"
    dataset = tf.data.experimental.make_csv_dataset(
        file_pattern=file_path,
        batch_size=5,
        label_name=LABEL_COLUMN,
        na_value="?",
        num_epochs=1,
        **kwargs
        )

    return dataset

In [54]:
raw_train_data = get_dataset(train_file_path)

In [65]:
def show_batch(dataset):
    # take one batch of 5 and print it
    for batch, label in dataset.take(1):
        for key, value in batch.items():
            print("{:20s}: {}".format(key,value.numpy()))
        print("Label:", label)

Each item in the dataset is a batch, represented as a tuple of (*many examples*, *many labels*). The data from the examples is organized in column-based tensors (rather than row-based tensors), each with as many elements as the batch size (5 in this case).

In [66]:
show_batch(raw_train_data)

PassengerId         : [143 544 853 741 349]
Pclass              : [3 2 3 1 3]
Name                : [b'Hakkarainen, Mrs. Pekka Pietari (Elin Matilda Dolck)'
 b'Beane, Mr. Edward' b'Boulos, Miss. Nourelain'
 b'Hawksford, Mr. Walter James' b'Coutts, Master. William Loch "William"']
Sex                 : [b'female' b'male' b'female' b'male' b'male']
Age                 : [24. 32.  9.  0.  3.]
SibSp               : [1 1 1 0 1]
Parch               : [0 0 1 0 1]
Ticket              : [b'STON/O2. 3101279' b'2908' b'2678' b'16988' b'C.A. 37671']
Fare                : [15.85   26.     15.2458 30.     15.9   ]
Cabin               : [b'' b'' b'' b'D45' b'']
Embarked            : [b'S' b'S' b'C' b'S' b'S']
Label: tf.Tensor([1 1 0 1 1], shape=(5,), dtype=int32)


The columns in the CSV are named. The dataset constructor will pick these names up automatically. If the file you are working with does not contain the column names in the first line, pass them in a list of strings to  the `column_names` argument in the `make_csv_dataset` function.

This example is going to use all the available columns. If you need to omit some columns from the dataset, create a list of just the columns you plan to use, and pass it into the (optional) `select_columns` argument of the constructor.

In [69]:
SELECT_COLUMNS = ['Survived', 'Age', 'SibSp', 'Pclass', 'Cabin', 'Sex']

temp_dataset = get_dataset(train_file_path, select_columns=SELECT_COLUMNS)

show_batch(temp_dataset)

Pclass              : [3 1 3 3 1]
Sex                 : [b'male' b'male' b'female' b'female' b'male']
Age                 : [ 1. 26. 29.  2. 40.]
SibSp               : [5 0 1 3 0]
Cabin               : [b'' b'C148' b'G6' b'' b'']
Label: tf.Tensor([0 1 0 0 0], shape=(5,), dtype=int32)
