<a id='top'></a><a name='top'></a>
# Chapter 13 – Loading and Preprocessing Data with TensorFlow

[Version 3](https://github.com/ageron/handson-ml3/blob/main/13_loading_and_preprocessing_data.ipynb)

The original notebook is split into two separate notebooks, due to length:

1. tf.data
2. TFRecords

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ageron/handson-ml2/blob/master/13_loading_and_preprocessing_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table>

1. [Setup](#setup)<a name="setup_top"></a>
2. [The Data API](#2.0)<a name="1.0_top"></a>
    * [2.1 Chaining transformations](#2.1)
        - [2.1.1 Dataset methods and immutable datasets](#2.1.1)
    * [2.2 Simple transformations](#2.2)
    * [2.3 More complicated chaining transformations](#2.3)
    * [2.4 Shuffling the Data](#2.4)
3.  [tf.data API: End-to-end example](#3.0)
    * [3.1 Load and split dataset to multiple CSV files](#3.1)
    * [3.2 Building an Input Pipeline](#3.2)
    * [3.3 Interleaving lines from multiple files](#3.3)
    * [3.4 Preprocessing the Data](#3.4)
    * [3.5 Putting everything together (w/o Prefetching)](#3.5)
    * [3.6 Putting everything together (with Prefetching)](#3.6)
    * [3.7 Using the Dataset with tf.keras](#3.7)
    * [3.8 Custom training loop](#38)
    * [3.9 Creating a TF Function to perform training loop](#3.9)

---
<a id='setup'></a><a name='setup'></a>
# 1. Setup
<a href="#top">[back to top]</a>

In [1]:
from tensorflow import keras
import numpy as np
from pathlib import Path
import os
import pprint
import matplotlib.pyplot as plt
import sys
import sklearn
import tensorflow as tf

# global seed
tf.random.set_seed(42)

pp = pprint.PrettyPrinter(indent=4)

IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules

# Installing tensorflow_transform takes a long time on COLAB,
# should wait until we actually need it, which is later.
# if IS_COLAB or IS_KAGGLE:
#     !pip install tensorflow_transform -q

def HR():
    print("-"*40)
    
print("Loaded libraries..")

Loaded libraries..


In [2]:
DATA_ROOT = 'data_chp13'

---
<a id='2.0'></a><a name='2.0'></a>
# 2. tf.data API: Introduction
<a href="#top">[back to top]</a>

<a id='2.1'></a><a name='2.1'></a>
## 2.1 Creating tf.data.Dataset
<a href="#top">[back to top]</a>

The easiest way to create a tf.data.Dataset from in-memory data is via `tf.data.Dataset.from_tensor_slices()`

In [3]:
def listing_2_1():
    
    # Creates a sequence of numbers, equivalent to np.arange
    X = tf.range(10)

    dataset1 = tf.data.Dataset.from_tensor_slices(X)
    for item in dataset1:
        print(item)
    
    HR()
    
    # Alternative using tf.data.Dataset.range
    dataset2 = tf.data.Dataset.range(10)
    for item in dataset2:
        print(item)
    
    HR()
    
    X_nested = {"a": ([1,2,3], [4,5,6]), "b": [7,8,9]}
    dataset = tf.data.Dataset.from_tensor_slices(X_nested)
    for item in dataset:
        print(item)
    
listing_2_1()

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
----------------------------------------
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)
----------------------------------------
{'a': (<tf.Tensor: shape=(), dtype=int32, numpy=1>, <tf.Tensor: shape=(), dtype=int32, numpy=4>), 'b': <tf.Tensor: shape=(), dtype=int32, numpy=7>}
{'a': (<tf.Tensor: shape=(), dtype=int32, numpy=2>

2022-07-25 00:36:25.919472: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


<a id='2.1.1'></a><a name='2.1.1'></a>
### 2.1.1 Dataset methods and immutable datasets
<a href="#top">[back to top]</a>

Key points:

* All tensors are immutable (similar to Python numbers and strings): 
you can never update the contents of a tensor, only create a new one,
or in this case, essentially rebind to the same variable name.
The value at the memory address is not changed.
* In the case of dataset methods, they do not modify datasets, they create new ones, so we have to make sure to keep a reference to these new datasets (e.g., with = ... ), or else there is no binding going on.
* For convenience with these immutable datasets, we can *reuse* variable names (eg `dataset`): 
    - In programming languages like Elixir, this is referred to as rebinding, used just as a convenience. 
    - In F#, we similarly rebind a variable identifier, not change the value at that memory location. This shadows the original name, where F# will create a new name internally.

In [4]:
def listing2_1_1():

    # Testing passing by reference in Python
    a = 1
    print(f"{(id(a))}: memory address of variable a")
    
    b = 2
    print(f"{(id(b))}: memory address of variable b")
    
    print("var a and b point to different memory addresses.\n")
        
    # Passing-by-reference in Python
    # Python variables work with references to objects representing the values.
    # Here, we have made both a and b references to the same object.
    a = b
    print(f"{(id(a))}: memory address of variable a")
    
    try:
        assert id(a) == id(b)
    except Exception as e:
        print(f"Error: {repr(e)}")
    else:
        print("var a and b now point to the same address.")
    
    HR()
    
    tf.random.set_seed(42)
    
    dataset = tf.data.Dataset.range(10).repeat(3)
    print(f"{(id(dataset))}: memory address of first variable dataset")

    dataset = dataset.shuffle(buffer_size=3, seed=42).batch(7)
    print(f"{(id(dataset))}: memory address of second variable dataset")
    
    HR()
    
    for item in dataset:
        print(item)
        
listing2_1_1()

4308535824: memory address of variable a
4308535856: memory address of variable b
var a and b point to different memory addresses.

4308535856: memory address of variable a
var a and b now point to the same address.
----------------------------------------
4944869504: memory address of first variable dataset
4944869360: memory address of second variable dataset
----------------------------------------
tf.Tensor([1 3 0 4 2 5 6], shape=(7,), dtype=int64)
tf.Tensor([8 7 1 0 3 2 5], shape=(7,), dtype=int64)
tf.Tensor([4 6 9 8 9 7 0], shape=(7,), dtype=int64)
tf.Tensor([3 1 4 5 2 8 7], shape=(7,), dtype=int64)
tf.Tensor([6 9], shape=(2,), dtype=int64)


<a id='2.2'></a><a name='2.2'></a>
## 2.2 Simple transformations
<a href="#top">[back to top]</a>

Once we have a dataset, we can apply all sorts of transformations to it by calling its transformation methods. Each method returns a new dataset, so you can chain transformations.

In [5]:
def listing2_2():

    # Create initial dataset
    dataset = tf.data.Dataset.from_tensor_slices(tf.range(10))
    print("1. Original dataset, range [1..10]:\n")
    for item in dataset:
        print(item)
    HR()
    
    dataset2 = dataset.map(lambda x: x * 2)
    print("2. map and lambda (x * 2):\n")
    for item in dataset2:
        print(item)
    HR()
    
    dataset3 = dataset.filter(lambda x: x > 3)
    print("3. filter and lambda (x > 3):\n")
    for item in dataset3:
        print(item)
    HR()
            
    dataset4 = dataset.filter(lambda x: tf.reduce_sum(x) > 3)
    print("4. filter, lambda, reduce (tf.reduce_sum(x) > 3):\n")
    for item in dataset4:
        print(item)
        
listing2_2()

1. Original dataset, range [1..10]:

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
----------------------------------------
2. map and lambda (x * 2):

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(14, shape=(), dtype=int32)
tf.Tensor(16, shape=(), dtype=int32)
tf.Tensor(18, shape=(), dtype=int32)
----------------------------------------
3. filter and lambda (x > 3):

tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dt

<a id='2.3'></a><a name='2.3'></a>
## 2.3 More complicated chaining transformations
<a href="#top">[back to top]</a>

Here we use the same `dataset` variable. We end up essentially doing a single very long chaining-transformation on it.

**Note**: 

We are not really mutating it, but continually binding the new value to a different object (even though it has the same name)


In [6]:
def listing2_3():

    # Create initial dataset
    dataset = (
        tf.data.Dataset.from_tensor_slices(tf.range(10))
        .repeat(3)
        .batch(7)
    )
    print("1. Original dataset, range [1..10], repeat(3), batch(7):\n")
    for item in dataset:
        print(item)
    HR()
        
        
    dataset = dataset.map(lambda x: x * 2)
    print("2. map and lambda (x * 2):\n")
    for item in dataset:
        print(item)
    HR()
        
    dataset = dataset.filter(lambda x: tf.reduce_sum(x) > 50)
    print("3. filter, lambda, reduce (tf.reduce_sum(x) > 50):\n")
    for item in dataset:
        print(item)
    HR()
    
    print("dataset.take(2):\n")
    for item in dataset.take(2):
        print(item)
        
listing2_3()

1. Original dataset, range [1..10], repeat(3), batch(7):

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)
----------------------------------------
2. map and lambda (x * 2):

tf.Tensor([ 0  2  4  6  8 10 12], shape=(7,), dtype=int32)
tf.Tensor([14 16 18  0  2  4  6], shape=(7,), dtype=int32)
tf.Tensor([ 8 10 12 14 16 18  0], shape=(7,), dtype=int32)
tf.Tensor([ 2  4  6  8 10 12 14], shape=(7,), dtype=int32)
tf.Tensor([16 18], shape=(2,), dtype=int32)
----------------------------------------
3. filter, lambda, reduce (tf.reduce_sum(x) > 50):

tf.Tensor([14 16 18  0  2  4  6], shape=(7,), dtype=int32)
tf.Tensor([ 8 10 12 14 16 18  0], shape=(7,), dtype=int32)
tf.Tensor([ 2  4  6  8 10 12 14], shape=(7,), dtype=int32)
----------------------------------------
dataset.take(2):

tf.Tensor([14 

In [7]:
def listing2_3b():

    dataset = (
        tf.data.Dataset.from_tensor_slices(tf.range(10))
        .repeat(3)
        .batch(7)
        .map(lambda x: x * 2)
        .filter(lambda x: tf.reduce_sum(x) > 50)
    )

    for item in dataset.take(2):
        print(item)
    
listing2_3b()

tf.Tensor([14 16 18  0  2  4  6], shape=(7,), dtype=int32)
tf.Tensor([ 8 10 12 14 16 18  0], shape=(7,), dtype=int32)


<a id='2.4'></a><a name='2.4'></a>
## 2.4 Shuffling the Data
<a href="#top">[back to top]</a>

Gradient Descent works best when the instances in the training set are i.i.d. We can use Python method chaining to shuffle the instances as appropriate.

In [8]:
def listing2_4():

    dataset = (tf.data.Dataset
               .range(10)
               .repeat(2)
               .batch(7)
              )
    print("No shuffling:\n")
    for item in dataset:
        print(item)
    HR()
    
    dataset = (tf.data.Dataset
               .range(10)
               .repeat(2)
               .shuffle(buffer_size=4, seed=42)
               .batch(7)
              )
    print("With shuffling:\n")
    for item in dataset:
        print(item)
    
listing2_4()

No shuffling:

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64)
tf.Tensor([4 5 6 7 8 9], shape=(6,), dtype=int64)
----------------------------------------
With shuffling:

tf.Tensor([3 0 1 6 2 5 7], shape=(7,), dtype=int64)
tf.Tensor([8 4 1 9 4 2 3], shape=(7,), dtype=int64)
tf.Tensor([7 5 0 8 9 6], shape=(6,), dtype=int64)


---
<a id='3.0'></a><a name='3.0'></a>
# 3. tf.data API: End-to-end example
<a href="#top">[back to top]</a>

The tasks to explore here are:

* Interleaving lines from multiple lines
* Build an input pipeline
* Preprocess the data
* Prefetching the data

<a id='3.1'></a><a name='3.1'></a>
## 3.1 Load and split dataset to multiple CSV files
<a href="#top">[back to top]</a>

Fetch, split and normalize the [California housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html). 

---

**Note**:

[Example for reshape API:](https://stackoverflow.com/questions/18691084/what-does-1-mean-in-numpy-reshape)

<sup>
    
```python
import numpy as np
x = np.array([[2,3,4], [5,6,7]]) 

# Convert any shape to 1D shape
x = np.reshape(x, (-1)) # Making it 1 row -> (6,)

# When you don't care about rows and just want to fix number of columns
x = np.reshape(x, (-1, 1)) # Making it 1 column -> (6, 1)
x = np.reshape(x, (-1, 2)) # Making it 2 column -> (3, 2)
x = np.reshape(x, (-1, 3)) # Making it 3 column -> (2, 3)

# When you don't care about columns and just want to fix number of rows
x = np.reshape(x, (1, -1)) # Making it 1 row -> (1, 6)
x = np.reshape(x, (2, -1)) # Making it 2 row -> (2, 3)
x = np.reshape(x, (3, -1)) # Making it 3 row -> (3, 2)
```
    
</sup>


In [9]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()

X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data,
    housing.target.reshape(-1, 1),
    random_state=42
)

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full,
    y_train_full,
    random_state=42
)

print(type(X_train))

# scaler = StandardScaler()
# scaler.fit(X_train)
# X_mean = scaler.mean_
# X_std = scaler.scale_

<class 'numpy.ndarray'>


For a very large dataset that does not fit in memory, you will typically want to split it into many files first, then have TensorFlow read these files in parallel. To demonstrate this, start by splitting the housing dataset and save it to 20 CSV files.

---

**Notes**:

`pathlib.Path()` provides more functionality than `os.path.join`.

`pathlib.Path()` returns a string, while `os.path.join` returns a pathlib.PosixPath. Also, pathlib.Path() also offers many built-in methods for file and directory handling, such as `mkdir`:


<sup> 
    
```
absolute, anchor, as_posix, as_uri, chmod, cwd, drive, exists, expanduser, glob, group, home, is_absolute, is_block_device, is_char_device, is_dir, is_fifo, is_file, is_mount, is_reserved, is_socket, is_symlink, iterdir, joinpath, lchmod, link_to, lstat, match, mkdir, name, open, owner, parent, parents, parts, read_bytes, read_text, relative_to, rename, replace, resolve, rglob, rmdir, root, samefile, stat, stem, suffix, suffixes, symlink_to, touch, unlink, with_name, with_suffix, write_bytes, write_text
```
    
</sup>

In [10]:
def save_to_csv_files(data, name_prefix, header=None, n_parts=10):
    print(f"n_parts: {n_parts}")
    
    # create pathlib.PosixPath
    housing_dir = Path() / DATA_ROOT / "datasets" / "housing"
    housing_dir.mkdir(parents=True, exist_ok=True)
    filename_format = "my_{}_{:02d}.csv"

    filepaths = []
    m = len(data)
    chunks = np.array_split(np.arange(m), n_parts)
    print(f"m: {m:,}")
    print(f"np.arange(m): {np.arange(m)}")
    print(f"len chunks: {len(chunks)}")
    print(f"len chunks[0]: {len(chunks[0])}")
    print(f"len chunks[1]: {len(chunks[1])}")
    HR()
    
    for file_idx, row_indices in enumerate(chunks):
        part_csv = housing_dir / filename_format.format(name_prefix, file_idx)
        filepaths.append(str(part_csv))
        print(f"len {len(row_indices)}: {part_csv}")
        
        with open(part_csv, "w") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
                
            for row_idx in row_indices:
                f.write(",".join([repr(col) for col in data[row_idx]]))
                f.write("\n")
    return filepaths


# GB: Just a convenient way to create a dataset from two ndarrays??
train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)

# Create list of file paths for training, validation, testing datasets
train_filepaths = save_to_csv_files(train_data, "train", header, n_parts=20)
valid_filepaths = save_to_csv_files(valid_data, "valid", header, n_parts=10)
test_filepaths = save_to_csv_files(test_data, "test", header, n_parts=10)

n_parts: 20
m: 11,610
np.arange(m): [    0     1     2 ... 11607 11608 11609]
len chunks: 20
len chunks[0]: 581
len chunks[1]: 581
----------------------------------------
len 581: data_chp13/datasets/housing/my_train_00.csv
len 581: data_chp13/datasets/housing/my_train_01.csv
len 581: data_chp13/datasets/housing/my_train_02.csv
len 581: data_chp13/datasets/housing/my_train_03.csv
len 581: data_chp13/datasets/housing/my_train_04.csv
len 581: data_chp13/datasets/housing/my_train_05.csv
len 581: data_chp13/datasets/housing/my_train_06.csv
len 581: data_chp13/datasets/housing/my_train_07.csv
len 581: data_chp13/datasets/housing/my_train_08.csv
len 581: data_chp13/datasets/housing/my_train_09.csv
len 580: data_chp13/datasets/housing/my_train_10.csv
len 580: data_chp13/datasets/housing/my_train_11.csv
len 580: data_chp13/datasets/housing/my_train_12.csv
len 580: data_chp13/datasets/housing/my_train_13.csv
len 580: data_chp13/datasets/housing/my_train_14.csv
len 580: data_chp13/datasets/hous

In [11]:
# This should be an extended note!
# numpy.c_
# Indexing routine
# Translates slice objects to concatenation along the second axis.
# This is short-hand for np.r_['-1,2,0', index expression], which is useful because of its common occurrence. In particular, arrays will be stacked along their last axis after being upgraded to at least 2-D with 1’s post-pended to the shape (column vectors made out of 1-D arrays).
# https://numpy.org/doc/stable/reference/generated/numpy.c_.html
# https://numpy.org/doc/stable/reference/generated/numpy.r_.html

test1 = np.c_[np.array([1,2,3])]
print(test1)
HR()

test2 = np.c_[np.array([1,2,3]), np.array([4,5,6])]
print(test2)
HR()

test3 = np.c_[np.array([[1,2,3]]), 0, 0, np.array([[4,5,6]])]
print(test3)

[[1]
 [2]
 [3]]
----------------------------------------
[[1 4]
 [2 5]
 [3 6]]
----------------------------------------
[[1 2 3 0 0 4 5 6]]


In [12]:
# Check the first few lines of these CSV files
import pandas as pd

pd.read_csv(train_filepaths[0]).head(3).T

Unnamed: 0,0,1,2
MedInc,3.5214,5.3275,3.1
HouseAge,15.0,5.0,29.0
AveRooms,3.049945,6.49006,7.542373
AveBedrms,1.106548,0.991054,1.591525
Population,1447.0,3464.0,1328.0
AveOccup,1.605993,3.44334,2.250847
Latitude,37.63,33.69,38.44
Longitude,-122.43,-117.39,-122.98
MedianHouseValue,1.442,1.687,1.621


In [13]:
# In text mode
with open(train_filepaths[0]) as f:
    for i in range(5):
        print(f.readline(), end="")

MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
3.5214,15.0,3.0499445061043287,1.106548279689234,1447.0,1.6059933407325193,37.63,-122.43,1.442
5.3275,5.0,6.490059642147117,0.9910536779324056,3464.0,3.4433399602385686,33.69,-117.39,1.687
3.1,29.0,7.5423728813559325,1.5915254237288134,1328.0,2.2508474576271187,38.44,-122.98,1.621
7.1736,12.0,6.289002557544757,0.9974424552429667,1054.0,2.6956521739130435,33.55,-117.7,2.621


In [14]:
# Alternative
print("".join(open(train_filepaths[0]).readlines()[:5]))

MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
3.5214,15.0,3.0499445061043287,1.106548279689234,1447.0,1.6059933407325193,37.63,-122.43,1.442
5.3275,5.0,6.490059642147117,0.9910536779324056,3464.0,3.4433399602385686,33.69,-117.39,1.687
3.1,29.0,7.5423728813559325,1.5915254237288134,1328.0,2.2508474576271187,38.44,-122.98,1.621
7.1736,12.0,6.289002557544757,0.9974424552429667,1054.0,2.6956521739130435,33.55,-117.7,2.621



In [15]:
train_filepaths

['data_chp13/datasets/housing/my_train_00.csv',
 'data_chp13/datasets/housing/my_train_01.csv',
 'data_chp13/datasets/housing/my_train_02.csv',
 'data_chp13/datasets/housing/my_train_03.csv',
 'data_chp13/datasets/housing/my_train_04.csv',
 'data_chp13/datasets/housing/my_train_05.csv',
 'data_chp13/datasets/housing/my_train_06.csv',
 'data_chp13/datasets/housing/my_train_07.csv',
 'data_chp13/datasets/housing/my_train_08.csv',
 'data_chp13/datasets/housing/my_train_09.csv',
 'data_chp13/datasets/housing/my_train_10.csv',
 'data_chp13/datasets/housing/my_train_11.csv',
 'data_chp13/datasets/housing/my_train_12.csv',
 'data_chp13/datasets/housing/my_train_13.csv',
 'data_chp13/datasets/housing/my_train_14.csv',
 'data_chp13/datasets/housing/my_train_15.csv',
 'data_chp13/datasets/housing/my_train_16.csv',
 'data_chp13/datasets/housing/my_train_17.csv',
 'data_chp13/datasets/housing/my_train_18.csv',
 'data_chp13/datasets/housing/my_train_19.csv']

<a id='3.2'></a><a name='2.1'></a>
## 3.2 Building an Input Pipeline
<a href="#top">[back to top]</a>

**API Note:**

[tf.data.Dataset.list_files](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#list_files)

```python
@staticmethod
list_files(
    file_pattern, shuffle=None, seed=None, name=None
)
```

A dataset of all files matching one or more glob patterns.

The file_pattern argument should be a small number of glob patterns. If your filenames have already been globbed, use Dataset.from_tensor_slices(filenames) instead, as re-globbing every filename with list_files may result in poor performance with remote storage systems.
Note: The default behavior of this method is to return filenames in a non-deterministic random shuffled order. Pass a seed or shuffle=False to get results in a deterministic order.


In [16]:
# Create a dataset containing these file paths
# .list_files() returns a dataset that shuffles the file paths 
# Create a dataset of all files matching a pattern via tf.data.Dataset.list_file

filepath_dataset = tf.data.Dataset.list_files(
    train_filepaths, 
    seed=42
)

print(filepath_dataset)

HR()

for filepath in filepath_dataset:
    print(filepath)

<ShuffleDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>
----------------------------------------
tf.Tensor(b'data_chp13/datasets/housing/my_train_15.csv', shape=(), dtype=string)
tf.Tensor(b'data_chp13/datasets/housing/my_train_08.csv', shape=(), dtype=string)
tf.Tensor(b'data_chp13/datasets/housing/my_train_03.csv', shape=(), dtype=string)
tf.Tensor(b'data_chp13/datasets/housing/my_train_01.csv', shape=(), dtype=string)
tf.Tensor(b'data_chp13/datasets/housing/my_train_10.csv', shape=(), dtype=string)
tf.Tensor(b'data_chp13/datasets/housing/my_train_05.csv', shape=(), dtype=string)
tf.Tensor(b'data_chp13/datasets/housing/my_train_19.csv', shape=(), dtype=string)
tf.Tensor(b'data_chp13/datasets/housing/my_train_16.csv', shape=(), dtype=string)
tf.Tensor(b'data_chp13/datasets/housing/my_train_02.csv', shape=(), dtype=string)
tf.Tensor(b'data_chp13/datasets/housing/my_train_09.csv', shape=(), dtype=string)
tf.Tensor(b'data_chp13/datasets/housing/my_train_00.csv', sh

<a id='3.3'></a><a name='3.3'></a>
## 3.3 Interleaving lines from multiple files
<a href="#top">[back to top]</a>

Use `.interleave()` to read from five files at a time, 
and interleave their lines (skipping first line of each file)

In [17]:
n_readers = 5

dataset = filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length=n_readers,
    # Determine num of threads dynamically based on available CPU
    num_parallel_calls=tf.data.AUTOTUNE 
)

print(dataset)
HR()

for line in dataset.take(5):
    print(line.numpy())

<ParallelInterleaveDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>
----------------------------------------
b'4.6477,38.0,5.03728813559322,0.911864406779661,745.0,2.5254237288135593,32.64,-117.07,1.504'
b'8.72,44.0,6.163179916317992,1.0460251046025104,668.0,2.794979079497908,34.2,-118.18,4.159'
b'3.8456,35.0,5.461346633416459,0.9576059850374065,1154.0,2.8778054862842892,37.96,-122.05,1.598'
b'3.3456,37.0,4.514084507042254,0.9084507042253521,458.0,3.2253521126760565,36.67,-121.7,2.526'
b'3.6875,44.0,4.524475524475524,0.993006993006993,457.0,3.195804195804196,34.04,-118.15,1.625'


In [18]:
# Note that field 4 is interpreted as a string
record_defaults = [
    0, 
    np.nan, 
    tf.constant(np.nan, dtype=tf.float64),
    "Hello",
    tf.constant([])
]

parsed_fields = tf.io.decode_csv('1,2,3,4,5', record_defaults)
parsed_fields

[<tf.Tensor: shape=(), dtype=int32, numpy=1>,
 <tf.Tensor: shape=(), dtype=float32, numpy=2.0>,
 <tf.Tensor: shape=(), dtype=float64, numpy=3.0>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'4'>,
 <tf.Tensor: shape=(), dtype=float32, numpy=5.0>]

In [19]:
# All missing fields are replaced with their default value, when provided
parsed_fields = tf.io.decode_csv(',,,,5', record_defaults)
parsed_fields

[<tf.Tensor: shape=(), dtype=int32, numpy=0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=nan>,
 <tf.Tensor: shape=(), dtype=float64, numpy=nan>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'Hello'>,
 <tf.Tensor: shape=(), dtype=float32, numpy=5.0>]

In [20]:
# The 5th field is compulsory (since we provided tf.constant([]),
# a the default_value, so we get an exception is we do not provide it.
try:
    parsed_fields = tf.io.decode_csv(',,,,', record_defaults)
except tf.errors.InvalidArgumentError as e:
    print(f"Error: {e}")

Error: Field 4 is required but missing in record 0! [Op:DecodeCSV]


In [21]:
# The number of fields should match exactly the number of fields in the record_defaults
try:
    parsed_fields = tf.io.decode_csv('1,2,3,4,5,6,7', record_defaults)
except tf.errors.InvalidArgumentError as e:
    print(f"Error: {e}")

Error: Expect 5 fields but have 7 in record 0 [Op:DecodeCSV]


<a id='3.4'></a><a name='3.4'></a>
## 3.4 Preprocessing the Data
<a href="#top">[back to top]</a>

Compute the mean and standard deviation of each feature.

In [22]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

StandardScaler()

In [23]:
X_mean, X_std = scaler.mean_, scaler.scale_
n_inputs = 8 # X_train.shape[-1]

def parse_csv_line(line):
    defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
    fields = tf.io.decode_csv(line, record_defaults=defs)
    return tf.stack(fields[:-1]), tf.stack(fields[-1:])

@tf.function
def preprocess(line):
    x, y = parse_csv_line(line)
    return (x - X_mean) / X_std, y

result = preprocess(b'4.2083,44.0,5.3232,0.9171,846.0,2.3370,37.47,-122.2,2.782')
print(result)

(<tf.Tensor: shape=(8,), dtype=float32, numpy=
array([ 0.16579157,  1.216324  , -0.05204565, -0.39215982, -0.5277444 ,
       -0.2633488 ,  0.8543046 , -1.3072058 ], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=float32, numpy=array([2.782], dtype=float32)>)


In [24]:
# n_inputs = 8 # X_train.shape[-1]

# @tf.function
# def preprocess(line):
#     defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
#     fields = tf.io.decode_csv(line, record_defaults=defs)
#     x = tf.stack(fields[:-1])
#     y = tf.stack(fields[-1:])
#     return (x - X_mean) / X_std, y

# result = preprocess(b'4.2083,44.0,5.3232,0.9171,846.0,2.3370,37.47,-122.2,2.782')
# print(result)

<a id='3.5'></a><a name='3.5'></a>
## 3.5 Putting everything together (w/o Prefetching)
<a href="#top">[back to top]</a>

In [25]:
def csv_reader_dataset_no_prefetch(
    filepaths,
    n_readers=5,
    n_read_threads=None,
    n_parse_threads=5,
    shuffle_buffer_size=10_000,
    seed=42,
    batch_size=32
    ):
    dataset = tf.data.Dataset.list_files(filepaths, seed=seed)
    dataset = dataset.interleave(
        lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
        cycle_length=n_readers, 
        num_parallel_calls=n_read_threads
    )
    dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
    dataset = dataset.shuffle(shuffle_buffer_size, seed=seed)
    dataset = dataset.batch(batch_size)
    return dataset
    
    # dataset = (
    #     dataset.map(preprocess, num_parallel_calls=n_parse_threads)
    #     .shuffle(shuffle_buffer_size, seed=seed)
    #     .batch(batch_size)
    # )
    # return dataset

In [26]:
example_set = csv_reader_dataset_no_prefetch(train_filepaths, batch_size=3)
for X_batch, y_batch in example_set.take(2):
    print("X =", X_batch)
    print("y =", y_batch)
    HR()

X = tf.Tensor(
[[-1.2345318   0.1879177  -0.18384208  0.19340092 -0.4273575   0.49201018
   1.0838584  -1.3871703 ]
 [-1.3836461  -0.7613805  -0.3076956  -0.07978077 -0.05045014  0.32237166
   0.50294524 -0.1027696 ]
 [-0.41767654 -0.91959685 -0.5876468  -0.01253252  2.441884   -0.30059808
  -0.68699217  0.521939  ]], shape=(3, 8), dtype=float32)
y = tf.Tensor(
[[0.804]
 [0.53 ]
 [1.745]], shape=(3, 1), dtype=float32)
----------------------------------------
X = tf.Tensor(
[[-0.58831733  0.02970133 -0.70486885  0.16348003  0.8174406  -0.29916376
  -0.70573175  0.6568782 ]
 [-1.3526396  -1.868895   -0.84703934 -0.0277291   0.58563805 -0.10333684
  -1.3756571   1.2116159 ]
 [-0.16590534  1.8491895  -0.24013318 -0.0694841  -0.141711   -0.41202638
   0.994848   -1.4321475 ]], shape=(3, 8), dtype=float32)
y = tf.Tensor(
[[2.045  ]
 [3.25   ]
 [5.00001]], shape=(3, 1), dtype=float32)
----------------------------------------


<a id='3.6'></a><a name='3.6'></a>
## 3.6 Putting everything together (with Prefetching)
<a href="#top">[back to top]</a>

In [27]:
def csv_reader_dataset(
    filepaths,
    n_readers=5,
    n_read_threads=None,
    n_parse_threads=5,
    shuffle_buffer_size=10_000,
    seed=42,
    batch_size=32
    ):
    dataset = tf.data.Dataset.list_files(filepaths, seed=seed)
    dataset = dataset.interleave(
        lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
        cycle_length=n_readers, 
        num_parallel_calls=n_read_threads
    )
    dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
    dataset = dataset.shuffle(shuffle_buffer_size, seed=seed)
    dataset = dataset.batch(batch_size)
    dataset = dataset.prefetch(1)
    return dataset

In [28]:
# Show the first couple of batches produced by the dataset
example_set = csv_reader_dataset(train_filepaths, batch_size=3)
for X_batch, y_batch in example_set.take(2):
    print("X =", X_batch)
    print("y =", y_batch)
    HR()

X = tf.Tensor(
[[-1.2345318   0.1879177  -0.18384208  0.19340092 -0.4273575   0.49201018
   1.0838584  -1.3871703 ]
 [-1.3836461  -0.7613805  -0.3076956  -0.07978077 -0.05045014  0.32237166
   0.50294524 -0.1027696 ]
 [-0.41767654 -0.91959685 -0.5876468  -0.01253252  2.441884   -0.30059808
  -0.68699217  0.521939  ]], shape=(3, 8), dtype=float32)
y = tf.Tensor(
[[0.804]
 [0.53 ]
 [1.745]], shape=(3, 1), dtype=float32)
----------------------------------------
X = tf.Tensor(
[[-0.58831733  0.02970133 -0.70486885  0.16348003  0.8174406  -0.29916376
  -0.70573175  0.6568782 ]
 [-1.3526396  -1.868895   -0.84703934 -0.0277291   0.58563805 -0.10333684
  -1.3756571   1.2116159 ]
 [-0.16590534  1.8491895  -0.24013318 -0.0694841  -0.141711   -0.41202638
   0.994848   -1.4321475 ]], shape=(3, 8), dtype=float32)
y = tf.Tensor(
[[2.045  ]
 [3.25   ]
 [5.00001]], shape=(3, 1), dtype=float32)
----------------------------------------


In [29]:
# Short description of each method
input = tf.data.Dataset
print(str(f"{input.__name__} methods:"))
HR()
# m: function/attributes
for m in dir(input):
    if not (m.startswith("_") or m.endswith("_")):
        func = getattr(tf.data.Dataset, m)
        if hasattr(func, "__doc__"):
            print("● {:22s}\t{}".format(m + "()", func.__doc__.split("\n")[0]))

DatasetV2 methods:
----------------------------------------
● apply()               	Applies a transformation function to this dataset.
● as_numpy_iterator()   	Returns an iterator which converts all elements of the dataset to numpy.
● batch()               	Combines consecutive elements of this dataset into batches.
● bucket_by_sequence_length()	A transformation that buckets elements in a `Dataset` by length.
● cache()               	Caches the elements in this dataset.
● cardinality()         	Returns the cardinality of the dataset, if known.
● choose_from_datasets()	Creates a dataset that deterministically chooses elements from `datasets`.
● concatenate()         	Creates a `Dataset` by concatenating the given dataset with this dataset.
● element_spec()        	The type specification of an element of this dataset.
● enumerate()           	Enumerates the elements of this dataset.
● filter()              	Filters this dataset according to `predicate`.
● flat_map()            	Maps `ma

<a id='3.7'></a><a name='3.7'></a>
## 3.7 Using the Dataset with tf.keras
<a href="#top">[back to top]</a>

In [30]:
train_set = csv_reader_dataset(train_filepaths)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)

print(train_set.__dict__)
HR()
print(valid_set.__dict__)
HR()
print(test_set.__dict__)

{'_input_dataset': <BatchDataset element_spec=(TensorSpec(shape=(None, 8), dtype=tf.float32, name=None), TensorSpec(shape=(None, 1), dtype=tf.float32, name=None))>, '_buffer_size': <tf.Tensor: shape=(), dtype=int64, numpy=1>, '_metadata': , '_variant_tensor_attr': <tf.Tensor: shape=(), dtype=variant, value=<PrefetchDatasetOp::Dataset>>, '_graph_attr': <tensorflow.python.framework.ops.Graph object at 0x122b2a6a0>, '_options_attr': <tensorflow.python.data.ops.options.Options object at 0x1272bdf10>}
----------------------------------------
{'_input_dataset': <BatchDataset element_spec=(TensorSpec(shape=(None, 8), dtype=tf.float32, name=None), TensorSpec(shape=(None, 1), dtype=tf.float32, name=None))>, '_buffer_size': <tf.Tensor: shape=(), dtype=int64, numpy=1>, '_metadata': , '_variant_tensor_attr': <tf.Tensor: shape=(), dtype=variant, value=<PrefetchDatasetOp::Dataset>>, '_graph_attr': <tensorflow.python.framework.ops.Graph object at 0x122b2a6a0>, '_options_attr': <tensorflow.python.data

In [31]:
# Clear Keras global state for Functional model-building API
keras.backend.clear_session()

# For reproducibility
np.random.seed(42)
tf.random.set_seed(42)


model = keras.models.Sequential([
    keras.layers.Dense(30, activation="relu", kernel_initializer="he_normal", input_shape=X_train.shape[1:]),
    keras.layers.Dense(1),
])

model.compile(loss="mse", optimizer="sgd")

# batch_size = 32

hist = model.fit(
    train_set,
    #steps_per_epoch=len(X_train) // batch_size,
    validation_data=valid_set,
    epochs=5,
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [32]:
test_mse = model.evaluate(test_set)
new_set = test_set.take(3) # pretend we have 3 new samples
y_pred = model.predict(new_set) # can just pass a Numpy array

print("Evaluate result: {test_mse}")

Evaluate result: {test_mse}


<a id='3.8'></a><a name='3.8'></a>
## 3.8 Custom training loop
<a href="#top">[back to top]</a>

In [33]:
# Define the optimizer and loss function for training
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
loss_fn = keras.losses.mean_squared_error

n_epochs = 5

for epoch in range(n_epochs):
    for X_batch, y_batch in train_set:
        # Do one Gradient step
        print("\rEpoch {}/{}".format(epoch + 1, n_epochs), end="")
        
        with tf.GradientTape() as tape:
            y_pred = model(X_batch)
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            loss = tf.add_n([main_loss] + model.losses)
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))


Epoch 5/5

<a id='3.9'></a><a name='3.9'></a>
## 3.9 Creating a TF Function to perform training loop
<a href="#top">[back to top]</a>

In [34]:
# Clear Keras global state for Functional model-building API
keras.backend.clear_session()

np.random.seed(42)
tf.random.set_seed(42)

optimizer = keras.optimizers.Nadam(learning_rate=0.01)
loss_fn = keras.losses.mean_squared_error


# Creating a TF Function the performs the whole training loop
@tf.function
def train_one_epoch(model, optimizer, loss_fn, train_set):
    for X_batch, y_batch in train_set:
        with tf.GradientTape() as tape:
            y_pred = model(X_batch)
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            loss = tf.add_n([main_loss] + model.losses)
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
loss_fn = tf.keras.losses.mean_squared_error

print(f"n_epochs: {n_epochs}")
HR()

for epoch in range(n_epochs):
    # '\r' is the carriage return
    # Use this to overwrite text, and stay on the same line.
    print("\rEpoch {}/{}".format(epoch + 1, n_epochs), end="")
    train_one_epoch(model, optimizer, loss_fn, train_set)

print()
print("Done training..")

n_epochs: 5
----------------------------------------
Epoch 5/5
Done training..
