# **Load CSV and NumPy File Types in TensorFlow 2.0**

**Learning Objectives**
1. Load a CSV file into a `tf.data.Dataset`
2. Load NumPy data

## **Introduction**

We load CSV data from a file into a `tf.data.Dataset`. We also load NumPy data to a `tf.data.Dataset`.

## **Load necessary libraries**

In [16]:
import functools

import numpy as np
import tensorflow as tf

print("TensorFlow version: {}".format(tf.__version__))

TensorFlow version: 2.4.1


Data can be loaded from an URL using `tf.keras.utils.get_file()`

In [17]:
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

# Download a file from URL if it is not already in cache using `tf.keras.utils.get_file()`
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("test.csv", TEST_DATA_URL)

In [18]:
# Make NumPy values easier to read
np.set_printoptions(precision=3, suppress=True)

## **Load data**

This section provides an example of how to load CSV data from a file into a `tf.data.Dataset`. The data used in this tutorial are taken from the Titanic passenger list. The model will predict the likelihood a passenger survived based on characteristics like age, geneder, ticket class, and whether the person was travelling alone.

To start, let's look at the top of the CSV file to see how it is formatted.

In [19]:
# `head()` function is used to get the first n rows
!head {train_file_path}

survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n


We can load this using Pandas, and pass the NumPy arrays to TensorFlow. If we need to scale up to a large set of files, or need a loader that integrates with TensorFlow and `tf.data`, then we can use the `tf.data.experimental.make_csv_dataset()` function

The only column we need to identify explicitly is the one with the value that the model is intended to predict.

In [20]:
LABEL_COLUMN = "survived"
LABELS = [0, 1]

Now let's read the CSV data from the file and create a data set.

In [22]:
# get_dataset() retrieves a Dataverse data set or its metadata
def get_dataset(file_path, **kwargs):
    # Use `tf.data.experimental.make_csv_dataset()` to read CSV files into a data set
    dataset = tf.data.experimental.make_csv_dataset(
        file_path,
        batch_size=5, # Artificially small to make examples easier to display
        label_name=LABEL_COLUMN,
        na_value="?",
        num_epochs=1,
        ignore_errors=True,
        **kwargs)
    return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

In [25]:
def show_batch(dataset):
    for batch, label in dataset.take(1):
        for key, value in batch.items():
            print("{:20s}: {}".format(key, value.numpy()))

Each item in the data set is a **batch**, represented as a tuple of `(examples, labels)`. The data from the examples is organised in column-based tensors (rather than row-based tensors), each with as many elements as the `batch_size`.

In [30]:
show_batch(raw_train_data)

sex                 : [b'male' b'female' b'male' b'female' b'male']
age                 : [18. 27. 31. 16. 28.]
n_siblings_spouses  : [0 0 1 0 0]
parch               : [0 2 1 0 1]
fare                : [11.5   11.133 37.004  7.75  33.   ]
class               : [b'Second' b'Third' b'Second' b'Third' b'Second']
deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Cherbourg' b'Queenstown' b'Southampton']
alone               : [b'y' b'n' b'n' b'y' b'n']


As we can see, the columns in the CSV are named. The data set constructor will pick these names up automatically. If the file we are working with does not contain the columns names in the first line, we shall pass them as a list of `str` to the `column_names` argument in the `tf.data.experimental.make_csv_dataset()` function.