# **Load CSV and NumPy File Types in TensorFlow 2.0**

**Learning Objectives**
1. Load a CSV file into a `tf.data.Dataset`
2. Load NumPy data

## **Introduction**

We load CSV data from a file into a `tf.data.Dataset`. We also load NumPy data to a `tf.data.Dataset`.

## **Load necessary libraries**

In [16]:
import functools

import numpy as np
import tensorflow as tf

print("TensorFlow version: {}".format(tf.__version__))

TensorFlow version: 2.4.1


Data can be loaded from an URL using `tf.keras.utils.get_file()`

In [17]:
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

# Download a file from URL if it is not already in cache using `tf.keras.utils.get_file()`
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("test.csv", TEST_DATA_URL)

In [18]:
# Make NumPy values easier to read
np.set_printoptions(precision=3, suppress=True)

## **Load data**

This section provides an example of how to load CSV data from a file into a `tf.data.Dataset`. The data used in this tutorial are taken from the Titanic passenger list. The model will predict the likelihood a passenger survived based on characteristics like age, geneder, ticket class, and whether the person was travelling alone.

To start, let's look at the top of the CSV file to see how it is formatted.

In [19]:
# `head()` function is used to get the first n rows
!head {train_file_path}

survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n


We can load this using Pandas, and pass the NumPy arrays to TensorFlow. If we need to scale up to a large set of files, or need a loader that integrates with TensorFlow and `tf.data`, then we can use the `tf.data.experimental.make_csv_dataset()` function

The only column we need to identify explicitly is the one with the value that the model is intended to predict.

In [20]:
LABEL_COLUMN = "survived"
LABELS = [0, 1]

Now let's read the CSV data from the file and create a data set.

In [22]:
# get_dataset() retrieves a Dataverse data set or its metadata
def get_dataset(file_path, **kwargs):
    # Use `tf.data.experimental.make_csv_dataset()` to read CSV files into a data set
    dataset = tf.data.experimental.make_csv_dataset(
        file_path,
        batch_size=5, # Artificially small to make examples easier to display
        label_name=LABEL_COLUMN,
        na_value="?",
        num_epochs=1,
        ignore_errors=True,
        **kwargs)
    return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

In [25]:
def show_batch(dataset):
    for batch, label in dataset.take(1):
        for key, value in batch.items():
            print("{:20s}: {}".format(key, value.numpy()))

Each item in the data set is a **batch**, represented as a tuple of `(examples, labels)`. The data from the examples is organised in column-based tensors (rather than row-based tensors), each with as many elements as the `batch_size`.

In [30]:
show_batch(raw_train_data)

sex                 : [b'male' b'female' b'male' b'female' b'male']
age                 : [18. 27. 31. 16. 28.]
n_siblings_spouses  : [0 0 1 0 0]
parch               : [0 2 1 0 1]
fare                : [11.5   11.133 37.004  7.75  33.   ]
class               : [b'Second' b'Third' b'Second' b'Third' b'Second']
deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Cherbourg' b'Queenstown' b'Southampton']
alone               : [b'y' b'n' b'n' b'y' b'n']


As we can see, the columns in the CSV are named. The data set constructor will pick these names up automatically. If the file we are working with does not contain the columns names in the first line, we shall pass them as a list of `str` to the `column_names` argument in the `tf.data.experimental.make_csv_dataset()` function.

In [34]:
CSV_COLUMNS = ["survived", "sex", "age", "n_siblings_spouses", "parch", 
               "fare", "class", "deck", "embark_town", "alone"]

# Pass column names as a list of `str` to the column_names argument
temp_dataset = get_dataset(train_file_path, column_names=CSV_COLUMNS)

show_batch(temp_dataset)

sex                 : [b'male' b'male' b'male' b'male' b'male']
age                 : [20. 28. 19. 26. 22.]
n_siblings_spouses  : [0 0 0 0 0]
parch               : [0 0 0 0 0]
fare                : [ 8.05  15.5    7.65   7.896  9.35 ]
class               : [b'Third' b'Third' b'Third' b'Third' b'Third']
deck                : [b'unknown' b'unknown' b'F' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Queenstown' b'Southampton' b'Southampton' b'Southampton']
alone               : [b'y' b'y' b'y' b'y' b'y']


This example is going to use all the available column. If we would like to omit some columns from the data set, we shall create a list of just the columns we plan to use, and pass it into the (optional) `select_columns` argument of the constructor.

In [35]:
# If we need to omit some columns from the data set, we shall create a list of 
# just the columns we plan to use, and pass it into the (optional) `select_columns` argument of the constructor
SELECT_COLUMNS = ["survived", "age", "n_siblings_spouses", "class", "deck", "alone"]

temp_dataset = get_dataset(train_file_path, select_columns=SELECT_COLUMNS)

show_batch(temp_dataset)

age                 : [35. 25. 32. 62. 51.]
n_siblings_spouses  : [0 0 0 0 0]
class               : [b'Third' b'Third' b'Third' b'Second' b'Third']
deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']
alone               : [b'y' b'y' b'y' b'y' b'y']


## **Data preprocessing**

A CSV file can contain a variety of data types. Typically we want to convert from those mixed types to a fixed length vector before feeding the data into our model.

TensorFlow has a built-in system for describing common input conversions: `tf.feature_column`.

We can preprocess data using any tool we like (like [nltk](https://www.nltk.org/) or [sklearn](https://scikit-learn.org/stable/)), and just pass the processed output to TensorFlow.

The primary advantage of doing preprocessing inside the model is that when we export the model it includes the preprocessing. This way we can pass the raw data directly to the model.

### **Continuous data**

If the data is already in an appropriate numeric format, we can pack the data into a vector before passing it off to the model:

In [36]:
SELECT_COLUMNS = ["survived", "age", "n_siblings_spouses", "parch", "fare"]
DEFAULTS = [0, 0.0, 0.0, 0.0, 0.0]
temp_dataset = get_dataset(
    train_file_path,
    select_columns=SELECT_COLUMNS,
    column_defaults=DEFAULTS)
show_batch(temp_dataset)

age                 : [28. 25. 28. 24. 28.]
n_siblings_spouses  : [0. 0. 0. 0. 0.]
parch               : [0. 0. 0. 0. 0.]
fare                : [7.896 7.65  7.75  8.05  7.725]


In [37]:
example_batch, label_batch = next(iter(temp_dataset))

Here's a simple function that will pack together all the columns

In [38]:
# `pack()` function will pack together all the columns
def pack(features, label):
    # `tf.stack()` stacks a list of rank-R tensors into one rank-(R+1) tensor
    return tf.stack(list(features.values()), axis=-1), label

Apply this to each element of the data set

In [39]:
packed_dataset = temp_dataset.map(pack)

for features, labels in packed_dataset.take(1):
    print(features.numpy())
    print()
    print(labels.numpy())

[[ 34.      1.      1.     32.5  ]
 [ 21.      0.      0.      7.75 ]
 [ 15.      0.      1.    211.337]
 [ 28.      0.      0.      7.733]
 [ 28.      0.      0.     12.35 ]]

[1 0 1 1 1]


If we have mixed data types we may want to separate out these simple-numeric fields. The `tf.feature_column` API can handle them, but this incurs some overhead and should be avoided unless really necessary. Let's switch back to the mixed data set:

In [40]:
show_batch(raw_train_data)

sex                 : [b'male' b'male' b'female' b'male' b'male']
age                 : [29. 47. 41. 28. 48.]
n_siblings_spouses  : [1 0 0 0 0]
parch               : [0 0 5 0 0]
fare                : [21.    15.    39.688  7.75   7.854]
class               : [b'Second' b'Second' b'Third' b'Third' b'Third']
deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Queenstown' b'Southampton']
alone               : [b'n' b'y' b'n' b'y' b'y']


In [41]:
example_batch, labels_batch = next(iter(temp_dataset))

So define a more general preprocessor that selects a list of numeric features and pack them into a single column:

In [42]:
class PackNumericFeatures(object):
    def __init__(self, names):
        self.names = names
        
    def __call__(self, features, labels):
        numeric_features = [features.pop(name) for name in self.names]
        numeric_features = [tf.cast(feat, tf.float32) for feat in numeric_features]
        numeric_features = tf.stack(numeric_features, axis=1)
        features["numeric"] = numeric_features
        
        return features, labels

In [47]:
NUMERIC_FEATURES = ["age", "n_siblings_spouses", "parch", "fare"]

packed_train_data = raw_train_data.map(
    PackNumericFeatures(NUMERIC_FEATURES))

packed_test_data = raw_test_data.map(
    PackNumericFeatures(NUMERIC_FEATURES))

In [48]:
show_batch(packed_train_data)

sex                 : [b'male' b'female' b'male' b'male' b'male']
class               : [b'Third' b'Second' b'Third' b'Third' b'First']
deck                : [b'unknown' b'E' b'unknown' b'unknown' b'D']
embark_town         : [b'Cherbourg' b'Queenstown' b'Southampton' b'Southampton' b'Cherbourg']
alone               : [b'y' b'y' b'y' b'y' b'n']
numeric             : [[29.     0.     0.     7.896]
 [28.     0.     0.    12.35 ]
 [28.     0.     0.     7.896]
 [19.     0.     0.     7.896]
 [23.     0.     1.    63.358]]


In [49]:
example_batch, labels_batch = next(iter(packed_train_data))

**Data Normalisation**

Continuous data should **always be normalised**.

In [51]:
# Pandas is used for data manipulation and analysis
import pandas as pd
# Pandas module `read_csv()` function reads the CSV file into a DataFrame object
desc = pd.read_csv(train_file_path)[NUMERIC_FEATURES].describe()
desc

Unnamed: 0,age,n_siblings_spouses,parch,fare
count,627.0,627.0,627.0,627.0
mean,29.631308,0.545455,0.379585,34.385399
std,12.511818,1.15109,0.792999,54.59773
min,0.75,0.0,0.0,0.0
25%,23.0,0.0,0.0,7.8958
50%,28.0,0.0,0.0,15.0458
75%,35.0,1.0,0.0,31.3875
max,80.0,8.0,5.0,512.3292


In [52]:
MEAN = np.array(desc.T["mean"])
STD = np.array(desc.T["std"])

In [53]:
def normalize_numeric_data(data, mean, std):
    return (data - mean) / std

In [54]:
print(MEAN, STD)

[29.631  0.545  0.38  34.385] [12.512  1.151  0.793 54.598]


Now let's create a numeric column. The `tf.feature_columns.numeric_column()` API accepts a `normalizer_fn` argument, which will be run on each batch.

Bind the `MEAN` and `STD` variables to the `normalizer_fn` using `functools.partial`

In [55]:
# See what we just created
# Bind the `MEAN` and `STD` variables to the `normalizer_fn` using `functools.partial`
normalizer = functools.partial(normalize_numeric_data, mean=MEAN, std=STD)

# `tf.feature_columns.numeric_column()` represents real-valued or numerical features
numeric_column = tf.feature_column.numeric_column(
    "numeric", normalizer_fn=normalizer, shape=[len(NUMERIC_FEATURES)])
numeric_columns = [numeric_column]
numeric_column

NumericColumn(key='numeric', shape=(4,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function normalize_numeric_data at 0x7fec312f7a60>, mean=array([29.631,  0.545,  0.38 , 34.385]), std=array([12.512,  1.151,  0.793, 54.598])))

When we train the model, we shall include this feature column to select and center this block of numeric data:

In [56]:
example_batch["numeric"]

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[33.   ,  0.   ,  0.   ,  5.   ],
       [19.   ,  0.   ,  0.   ,  6.75 ],
       [35.   ,  0.   ,  0.   , 26.288],
       [17.   ,  4.   ,  2.   ,  7.925],
       [22.   ,  0.   ,  0.   , 10.517]], dtype=float32)>

In [57]:
# `tf.keras.layers.DenseFeatures()` produces a dense Tensor based on given `feature_columns`
numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)
numeric_layer(example_batch).numpy()

array([[ 0.269, -0.474, -0.479, -0.538],
       [-0.85 , -0.474, -0.479, -0.506],
       [ 0.429, -0.474, -0.479, -0.148],
       [-1.01 ,  3.001,  2.043, -0.485],
       [-0.61 , -0.474, -0.479, -0.437]], dtype=float32)

The mean-based normalisation used here requires knowing the means of each column ahead of time.

### **Categorical data**

Some of the columns in the CSV data are categorical columns. That is, the content should be one of a limited set of options.

Let's use the `tf.feature_column` API to create a collection with a `tf.feature_column.indicator_column` for each categorical column.

In [58]:
CATEGORIES = {
    "sex": ["male", "female"],
    "class": ["First", "Second", "Third"],
    "deck": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"],
    "embark_town": ["Cherbourg", "Southampton", "Queenstown"],
    "alone": ["y", "n"]
}

In [59]:
categorical_columns = []
for feature, vocab in CATEGORIES.items():
    # Use the `tf.feature_column` API to create a collection with a `tf.feature_column.indicator_column`
    # for each categorical column
    cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature, vocabulary_list=vocab
    )
    categorical_columns.append(tf.feature_column.indicator_column(cat_col))

In [60]:
# See what we've just created
categorical_columns

[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southampton', 'Queenstown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]

In [61]:
# `tf.keras.layers.DenseFeatures()` produces a dense Tensor based on given feature_columns.
categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)
print(categorical_layer(example_batch).numpy()[0])

[1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0.]


This will be part of a data processing input layer when we build the model.

### **Combined preprocessing layer**

Let's add the two `feature_column` collections and pass them to a `tf.keras.layers.DenseFeatures()` to create an input layer that will extract and preprocess both input types:

In [62]:
# Add the two `feature_column` collections
# Pass them to a `tf.keras.layers.DenseFeatures()` to create an input layer
preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numeric_columns)

In [63]:
print(preprocessing_layer(example_batch).numpy()[0])

[ 1.     0.     1.     0.     0.     0.     1.     0.     0.     0.
  0.     0.     0.     0.     0.     0.     1.     0.     0.269 -0.474
 -0.479 -0.538  1.     0.   ]


### **Next Step**

A next step would be to build a `tf.keras.Sequential` neural network model, starting with a `preprocessing_layer`.

## **Load NumPy data**

### **Load necessary libraries**