## The Data API

The main object used in the data API are ```datasets```. Usually you will use datasets that gradually read data from disk, but let's start with an example that creates a dataset entirely in RAM.

In [1]:
import tensorflow as tf
from tensorflow import keras

In [2]:
X = tf.range(10)
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

```from_tensor_slices()``` takes a tensor and creates a Dataset whose elements are all slices of X (along the first dimension). This dataset contains 10 items: tensors 0, 1, 2, ..., 9.

In [3]:
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


## Chaining Transformations

Dataset methods reeturn a new Dataset, so we can chain transformations like so

In [4]:
dataset = dataset.repeat(3).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


Note that calling the repeat method, it returns a new dataset that repeats the items of the original dataset 3 times, however it does not copy data in memory 3 times!

batch() returned 2 remaining items in the last tensor, if we wanted to keep the tensors with the same size we could pass ```drop_remainder=True```

We can also apply transformations to each item with the ```map``` method. Sometimes we will perform expensive computations with this method, such as rotating or reshaping an image. To spawn multiple threads to speed things up, set the ```num_parallel_calls``` argument. Note that the function you pass to ```map``` must be convertible to a TF function.

In [5]:
for item in dataset.map(lambda x: x**2):
    print(item)

Cause: could not parse the source code:

for item in dataset.map(lambda x: x**2):

This error may be avoided by creating the lambda in a standalone statement.

Cause: could not parse the source code:

for item in dataset.map(lambda x: x**2):

This error may be avoided by creating the lambda in a standalone statement.

tf.Tensor([ 0  1  4  9 16 25 36], shape=(7,), dtype=int32)
tf.Tensor([49 64 81  0  1  4  9], shape=(7,), dtype=int32)
tf.Tensor([16 25 36 49 64 81  0], shape=(7,), dtype=int32)
tf.Tensor([ 1  4  9 16 25 36 49], shape=(7,), dtype=int32)
tf.Tensor([64 81], shape=(2,), dtype=int32)


The ```apply()```  methods applies a transformation to the dataset as whole. Using the ```unbatch``` function, each item in the new dataset will be a single-integer instead of a batch of seven items.

In [6]:
dataset = dataset.apply(tf.data.Dataset.unbatch)
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype

We can also filter data with ```filter```

In [7]:
dataset = dataset.filter(lambda x: x < 10)

And to look at a t few items from the dataset use ```take```

In [8]:
for item in dataset.take(3):
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)


## Shuffling the data

The ```shuffle()``` method creates a new dataset that fills up a buffer with the first items of the source dataset. Whenever it is asked for an item, it will pull one out randomly from the buffer and replace it with a fresh new one from the source dataset, until it has iterated thorugh the source dataset. At this point it continues to  pull out items randomly from the buffer until it is empty.

You must specify the buffer size and it is important to make it large enough else shuffling will not be very effective

In [9]:
dataset = tf.data.Dataset.range(10).repeat(3)
dataset = dataset.shuffle(buffer_size=5, seed=42).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 2 3 6 7 9 4], shape=(7,), dtype=int64)
tf.Tensor([5 0 1 1 8 6 5], shape=(7,), dtype=int64)
tf.Tensor([4 8 7 1 2 3 0], shape=(7,), dtype=int64)
tf.Tensor([5 4 2 7 8 9 9], shape=(7,), dtype=int64)
tf.Tensor([3 6], shape=(2,), dtype=int64)


For large datasets tha do not fit im memory, this simple shuffling-buffer approach may not be sufficient, since the buffer will be small compared to the size of the dataset.

One solution to this is to shuffle the source data itself, for example on Linux you can use ```shuf``` to shuffle text files. 
Even if the source data is shuffled, you migh want to shuffle it more or else the same order will be repeated at each epoch and the model might end up biased. To shuffle some more, a common approach is to split the source data into multiple files, then read them in a random order during training. With this, instances located in the same file will still end up close to each other. To avoid this you can pick multiple files randomly and read them simutaneously, interleaving their records.
Then on top of that we can add a shuffling buffer with ```shuffle```.

The best part about this: the Data API makes it easy for you to do all this.

### Interleaving lines from multiple files.

Let's start by loading the California Housing dataset, shuffling it, split into a training and validation set and a test. Finally we split each set into many csv files

In [10]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
scaler.fit(X_train)
X_mean = scaler.mean_
X_std = scaler.scale_

In [11]:
import os

def save_to_multiple_csv_files(data, name_prefix, header=None, n_parts=10):
    housing_dir = os.path.join("datasets", "housing")
    os.makedirs(housing_dir, exist_ok=True)
    path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")

    filepaths = []
    m = len(data)
    for file_idx, row_indices in enumerate(np.array_split(np.arange(m), n_parts)):
        part_csv = path_format.format(name_prefix, file_idx)
        filepaths.append(part_csv)
        with open(part_csv, "wt", encoding="utf-8") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
            for row_idx in row_indices:
                f.write(",".join([repr(col) for col in data[row_idx]]))
                f.write("\n")
    return filepaths

In [12]:
import numpy as np

train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]
header_cols = housing.feature_names + ['MedianHouseValue']
header = ",".join(header_cols)

train_filepaths = save_to_multiple_csv_files(train_data, 'train', header, n_parts=20)
valid_filepaths = save_to_multiple_csv_files(valid_data, 'valid', header, n_parts=10)
test_filepaths = save_to_multiple_csv_files(test_data, 'test', header, n_parts=10)

In [13]:
import pandas as pd
pd.read_csv(train_filepaths[0]).head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
0,3.5214,15.0,3.049945,1.106548,1447.0,1.605993,37.63,-122.43,1.442
1,5.3275,5.0,6.49006,0.991054,3464.0,3.44334,33.69,-117.39,1.687
2,3.1,29.0,7.542373,1.591525,1328.0,2.250847,38.44,-122.98,1.621
3,7.1736,12.0,6.289003,0.997442,1054.0,2.695652,33.55,-117.7,2.621
4,2.0549,13.0,5.312457,1.085092,3297.0,2.244384,33.93,-116.93,0.956


Now let's create a dataset using only these filepaths. By default, ```list_files``` retuns a dataset that shuffles the file paths. You can set shuffle=False if you don't want it

In [14]:
filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)

Next we use the ```interleave``` method to read from five files at a time and interleave their lines

In [15]:
n_readers = 5
dataset = filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length=n_readers)

This pulls five files at a time, skipping the first line (containing headers) then constructs a dataset by reading one line from each file. It then pulls the next five file paths, interleaves them in the same way and so on until it runs out of file paths

In [16]:
for line in dataset.take(5):
    print(line.numpy())

b'4.2083,44.0,5.323204419889502,0.9171270718232044,846.0,2.3370165745856353,37.47,-122.2,2.782'
b'4.1812,52.0,5.701388888888889,0.9965277777777778,692.0,2.4027777777777777,33.73,-118.31,3.215'
b'3.6875,44.0,4.524475524475524,0.993006993006993,457.0,3.195804195804196,34.04,-118.15,1.625'
b'3.3456,37.0,4.514084507042254,0.9084507042253521,458.0,3.2253521126760565,36.67,-121.7,2.526'
b'3.5214,15.0,3.0499445061043287,1.106548279689234,1447.0,1.6059933407325193,37.63,-122.43,1.442'


These are the first 5 lines of the dataset, but they are byte strings, we now need to parse them.

### Preprocessing Data

In [17]:
# Using X_mean, X_std from above...
n_inputs = 8

def preprocess(line):
    defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
    fields = tf.io.decode_csv(line, record_defaults=defs)
    x = tf.stack(fields[:-1])  # Convert features to 1D tensor array
    y = tf.stack(fields[-1:])  # Convert target to 1D tensor array
    return (x - X_mean) / X_std, y

In [18]:
for line in dataset.take(1):
    print(preprocess(line.numpy()))

(<tf.Tensor: shape=(8,), dtype=float32, numpy=
array([ 0.36618188, -0.998705  ,  0.00781878, -0.00675364, -0.06140145,
        0.0072037 , -0.94465536,  0.9367464 ], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=float32, numpy=array([2.418], dtype=float32)>)


### Putting everything together / Prefetching

In [19]:
def csv_reader_dataset(filepaths, repeat=1, n_readers=5,
                       n_read_threads=None, shuffle_buffer_size=10000,
                       n_parse_threads=5, batch_size=32):
    dataset = tf.data.Dataset.list_files(filepaths)
    dataset = dataset.interleave(
        lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
        cycle_length=n_readers, num_parallel_calls=n_read_threads)
    dataset = dataset.map(preprocess, num_parallel_calls=n_read_threads)
    dataset = dataset.shuffle(shuffle_buffer_size).repeat(repeat)
    return dataset.batch(batch_size).prefetch(1)

The last line in the function above uses the ```prefetch``` method. This creates a dataset that tries to be one batch ahead. For example if we are training a model on this dataset, while the training is happening, the dataset will already be working in parallel on getting the next batch ready. This can dramatically improve performance.

### Using the Dataset with tf.keras

Let's use the ```csv_reader_dataset``` function above to create a dataset for training.

In [20]:
train_set = csv_reader_dataset(train_filepaths, repeat=None)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)

And then build a model. We use the steps per epoch to specify the number of training steps as the dataset has not been loaded yet, so it is deemed 'infinite'

In [21]:
model = keras.models.Sequential([
    keras.layers.Dense(100, activation='elu', kernel_initializer='he_normal', 
                       input_shape=X_train.shape[1:]),
    keras.layers.Dense(1, activation = 'relu')
])
model.compile(optimizer='nadam', loss='mse')
batch_size=32
model.fit(train_set, epochs=10, validation_data=valid_set, steps_per_epoch=len(X_train)//batch_size)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f777599be20>

In [22]:
model.evaluate(test_set, steps=len(X_test)//batch_size)



0.37539780139923096

In [23]:
new_set = test_set.take(3).map(lambda X, y: X)
model.predict(new_set)

array([[3.9838567 ],
       [2.162112  ],
       [0.9305432 ],
       [2.865344  ],
       [0.9184288 ],
       [1.7495688 ],
       [1.753883  ],
       [1.5726929 ],
       [3.2011614 ],
       [1.3680485 ],
       [3.3565168 ],
       [1.5491389 ],
       [2.347571  ],
       [2.6110601 ],
       [1.9164817 ],
       [2.6606073 ],
       [0.8301281 ],
       [0.48761353],
       [2.5339346 ],
       [1.9821576 ],
       [1.1782722 ],
       [2.9220848 ],
       [2.7169647 ],
       [3.6323676 ],
       [2.7950797 ],
       [3.7449813 ],
       [3.7057714 ],
       [1.3309131 ],
       [2.0386727 ],
       [0.88035214],
       [1.4802783 ],
       [3.1646976 ],
       [0.6698898 ],
       [1.2645179 ],
       [2.0718355 ],
       [1.8404658 ],
       [1.6014103 ],
       [1.9512341 ],
       [2.2154174 ],
       [1.3453177 ],
       [0.48185408],
       [2.3247495 ],
       [0.9696155 ],
       [1.926871  ],
       [1.4848202 ],
       [2.8618097 ],
       [2.9666567 ],
       [3.292

## The TFRecord Format

TFRecord is TensorFlow's preferred format for storing large amounts of data and reading it efficiently. We can use ```tf.io.TFRecordWriter``` for creating these objects

In [24]:
with tf.io.TFRecordWriter('my_data.tfrecord') as f:
    f.write(b'This is the first record')
    f.write(b'And this is the second record')

And then we use ```tf.data.TFRecordDataset``` to red one or more TFRecord files

In [25]:
filepaths = ['my_data.tfrecord']
dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
    print(item)

tf.Tensor(b'This is the first record', shape=(), dtype=string)
tf.Tensor(b'And this is the second record', shape=(), dtype=string)


We can also set ```num_parallel_reads```, and the ```interleave``` methods as we did earlier

TFRecords can be compressed, which can be useful especially if they need to be loaded via a network connection.

In [26]:
options = tf.io.TFRecordOptions(compression_type='GZIP')
with tf.io.TFRecordWriter('my_compressed.tfrecord', options) as f:
        f.write(b'This is the first compressed record')
        f.write(b'And this is the second compressed record')

When reading a compressed file we need to specify the compression type

In [27]:
dataset = tf.data.TFRecordDataset(['my_compressed.tfrecord'],
                                  compression_type='GZIP')

### A brief introduction to protocol buffers

Protocol buffers (*protobufs*) is a portable, extensible and efficient binary format developed at Google. They are defined like so

In [28]:
%%writefile person.proto
syntax = "proto3";
message Person {
    string name = 1;
    int32 id = 2;
    repeated string email = 3;
}

Overwriting person.proto


Each Person object may (optionally) have a name of type string, an id of type int32 and zero or more email fields, each of type string. The numbers 1, 2 and 3 are the field identifiers.

Once our definition is ready in a *.proto* file, we can compile it using **protoc**, the protobuf compiler. See book pg 426 for example and explanations

### TensorFlow Protobufs

The main protobuf typically used in a TFRecord file is the Example protobuf, which represents one instance in a dataset. It contains a list of named features, where each feature can either be a list of byte strings, a list of floats or a list of integers. Here is the definition

In [29]:
%%writefile example.proto
syntax = "proto3";
message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1 [packed = true]; }
message Int64List { repeated int64 value = 1 [packed = true]; }
message Feature{
    oneof kind {
        BytesList bytes_list = 1;
        FloatList float_list = 2;
        Int64List int64_list = 3;
    }
};
message Features { map<string, Feature> feature = 1};
message Example { Features features = 1; };

Overwriting example.proto


```[packed = true]``` is used for repeated numerical fields for a more efficient encoding. A ```Feature``` contains either a BytesList, a FloatList or an Int64List. a ```Features``` contains a dictionary that maps a feature name to the corresponding feature value. Finally, an ```Example``` contains a Features object.

Here's how you could write a tf.train.Example representing the same person and write it to a TFRecord

In [30]:
from tensorflow.train import BytesList, FloatList, Int64List
from tensorflow.train import Feature, Features, Example

person_example = Example(
    features = Features(
        feature = {
            "name": Feature(bytes_list=BytesList(value=[b"Alice"])),
            "id": Feature(int64_list=Int64List(value=[123])),
            "emails": Feature(bytes_list=BytesList(value=[b"a@b.com",
                                                          b"c@d.com"])),
        }))

You could also wrap this code inside a small helper function. Now that we have an example protobuf, we can serialize it by calling its ```SerializeToString()``` method.

In [31]:
with tf.io.TFRecordWriter("my_contacts.tfrecord") as f:
    f.write(person_example.SerializeToString())

Typically we would create a conversion script that reads from a current format (e.g. CSV), creates an example protobuf for each instance, serializes them, and saves them to several TFRecord files, ideally shuffling them in the process.

### Loading and parsing examples

To load the serialized Example protobuf we will use a tf.data.TFRecordDataset again, parsing each example using ```tf.io.parse_single_example()```, as it is a TF operation it can be included in a TF function. 

The second argument to the function is a dictionary that maps each feature name to either a ```tf.io.FixedLenFeature``` descriptor, indicating the feature's shape, type and default value. or a ```tf.io.VarLenFeature``` descriptor indicating only the type. 

In [32]:
feature_description = {
    "name": tf.io.FixedLenFeature([], tf.string, default_value=""),
    "id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
    "email": tf.io.VarLenFeature(tf.string),
}
for serialized_example in tf.data.TFRecordDataset(["my_contacts.tfrecord"]):
    parsed_example = tf.io.parse_single_example(serialized_example,
                                                feature_description)

The fixed lenght features are parsed as tensors, while the variable length feature is parsed as a sparse tensor. 

We can convert sparse tensors with ```tf.sparse.to_dense()``` but in this case it is simpler to access it's values.

In [33]:
tf.sparse.to_dense(parsed_example['email'], default_value=b"")

<tf.Tensor: shape=(0,), dtype=string, numpy=array([], dtype=object)>

In [34]:
parsed_example['email'].values

<tf.Tensor: shape=(0,), dtype=string, numpy=array([], dtype=object)>

We could have also parsed the examples in batch using ```tf.io.parse_example()```

In [35]:
dataset = tf.data.TFRecordDataset(['my_contacts.tfrecord']).batch(10)
for serialized_examples in dataset:
    parsed_examples = tf.io.parse_example(serialized_examples, feature_description)

The Example protobuf will probably be sufficient for most cases. It might be a bit cumbersome when dealing with lists of lists. For this case we can use ```SequenceExample```

### Handling Lists of Lists using the SequenceExample Protobuf

Here is the definition of the SequenceExample protobuf

In [36]:
%%writefile sequence_example.protobuf
message FeatureList { repeated Feature feature = 1; };
message FeatureLists { map<string, FeatureList> feature_list = 1; };
message SequenceExample {
    Features context = 1;
    FeatureLists feature_lists = 2;
}

Overwriting sequence_example.protobuf


A SequenceExample has a Features object for the contextual data and a FeatureLists object that contains one or more named FeatureLists object. Each FeatureList contains a list of Feature objects, each of which may be a list of byte strings, a list of 64-bit integers or a list of floats.

To parse it, we use ```tf.io.parse_single_sequence_example``` or ```tf.io.parse_sequence_example``` to parse a batch. If the feature lists contain sequences of varyin sizes you may want to convert them to ragged tensors using ```tf.RaggedTensor.from_sparse()```. See notebook for full code

## Preprocessing the Input Features

This section looks at a including a preprocessing layer in a model, to handle data type conversion, one-hot-encoding, etc...

Here's an example of how to implement a standardization layer using a Lambda layer

In [37]:
means = np.mean(X_train, axis=0, keepdims=True)
stds = np.std(X_train, axis=0, keepdims=True)
eps = keras.backend.epsilon()
model = keras.models.Sequential([
    keras.layers.Lambda(lambda inputs: (inputs - means)/ (stds + eps))
])

To make it neater, we can sublclass the Layer class

In [38]:
class Standardization(keras.layers.Layer):
    def adapt(self, data_sample):
        self.means_ = np.mean(data_sample, axis=0, keepdims=True)
        self.std_ = np.std(data_sample, axis=0, keepdims=True)
    def call(self, inputs):
        return (inputs - self.mean_) / (self.stds_ + keras.backend.epsilon())

Before using it, we need to call the adapt method passing a data sample, allowing it to use the appropriate mean and standard deviation for each feature

In [39]:
std_layer = Standardization()
std_layer.adapt(X_train)

model = keras.Sequential()
model.add(std_layer)
... 

Ellipsis

### Encoding categorical features using One-Hot Vectors

Recall the California housing dataset has the ocean proximity feature with five possible values. We need to encode this feature before we feed it to a neural network. We can do this using a lookup table

In [40]:
vocab = ["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"]
indices = tf.range(len(vocab), dtype=tf.int64)
table_init = tf.lookup.KeyValueTensorInitializer(vocab, indices)
num_oov_buckets = 2
table = tf.lookup.StaticVocabularyTable(table_init, num_oov_buckets)

The number of *out of vocabulary (oov)* buckets is the number of extra values added if it doesn't exist in our vocabulary. In this case, when a new value is found, the table  computes its hash and adds them to categories 5 and 6.

Why use oov buckets? If the number of categories is large (zip codes, cities, words, ...) or the dataset keeps changing, getting the full list of categories might not be convenient. One solution is to base the vocabulary on a sample of the data, using oov buckets for the unknowns. If there are not enough oov buckets, ther will be collisions with different categories ending up in the same bucket so the neural net is not able to distinguish them.

Let's put this into practice to one-hot encode some samples

In [41]:
categories = tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND"])
cat_indices = table.lookup(categories)
cat_indices

<tf.Tensor: shape=(4,), dtype=int64, numpy=array([3, 5, 1, 1])>

In [42]:
cat_one_hot = tf.one_hot(cat_indices, depth=len(vocab) + num_oov_buckets)
cat_one_hot

<tf.Tensor: shape=(4, 7), dtype=float32, numpy=
array([[0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.]], dtype=float32)>

Note that the DESERT category was added to bucket 5 and how we had to specify the number of indices with the depth parameter.

Keras has the ```keras.layers.TextVectorization``` layer to create the lookup table with distinct values for you. You can add it to your model followed by the tf.one_hot() function if you want to convert it to one-hot vectors.

This might not be the best solution; the size of each one-hot vector is the vocabulary lenght plus the number of oov buckets. If the vocabulary is large, it is more efficient to use *embeddings* instead.

*Tip*: As a rule of thumb, if the number of categories is less than 10, one-hot encoding is the way to go. If it is greater than 50, embeddings are better. In between you can experiment with both and see which works best.

### Encoding Categorical Features using Embeddings

In [47]:
(np.random.random(), np.random.random())

(0.14572394467959282, 0.8999561513490448)

An embedding is a trainable dense vector that represents a category. By default they're initialized randomly, for example the category "NEAR BAY" could be initialized as [0.1339, 0.6078]. This example shows a 2D embedding, but the number of dimensions is a hyperparameter you can tweak. 

As the embeddings are trainable, they will improve over training and Gradient Descent will push similar categories together, for example INLAND will end up being far from all the other categories above (NEAR OCEAN, <1H OCEAN, etc...)

[Word2Vec Reference](https://arxiv.org/abs/1310.4546)

We'll implement an embedding manually to understand how they work. Start with an *embedding matrix* containing each category's embedding, initialized randomly. The matrix contains one row per category and oov bucket and one column per embedding dimension

In [48]:
embedding_dim = 2
embed_init = tf.random.uniform([len(vocab) + num_oov_buckets, embedding_dim])
embedding_matrix = tf.Variable(embed_init)

In [57]:
embedding_matrix

<tf.Variable 'Variable:0' shape=(7, 2) dtype=float32, numpy=
array([[0.01089716, 0.93421566],
       [0.39437556, 0.2496841 ],
       [0.77807224, 0.40052128],
       [0.8927587 , 0.35086238],
       [0.506053  , 0.8094338 ],
       [0.387424  , 0.14231634],
       [0.39908934, 0.49101698]], dtype=float32)>

Now we'll encode the same batch of features but use embeddings instead of one_hot_vectors

In [54]:
categories

<tf.Tensor: shape=(4,), dtype=string, numpy=array([b'NEAR BAY', b'DESERT', b'INLAND', b'INLAND'], dtype=object)>

In [55]:
cat_indices = table.lookup(categories)
cat_indices

<tf.Tensor: shape=(4,), dtype=int64, numpy=array([3, 5, 1, 1])>

In [56]:
tf.nn.embedding_lookup(embedding_matrix, cat_indices)

<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[0.8927587 , 0.35086238],
       [0.387424  , 0.14231634],
       [0.39437556, 0.2496841 ],
       [0.39437556, 0.2496841 ]], dtype=float32)>

```tf.nn.embedding_lookup``` looks up rows in the embedding matrix, at the given indices.

With keras, we can use ```keras.layers.Embedding``` to handle the embedding matrix. When created, it initialized the matrix randomly and when it is called with category indices it returns the corresponding rows

In [60]:
embedding = keras.layers.Embedding(input_dim=len(vocab) + num_oov_buckets, output_dim=embedding_dim)
embedding(cat_indices)

<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[-0.03869525, -0.02267827],
       [ 0.00806278,  0.0328465 ],
       [ 0.00392421,  0.03095348],
       [ 0.00392421,  0.03095348]], dtype=float32)>

We can put everything together and create a Keras model to process categorical features (along with numerical ones) and learn an embedding for each category (and oov bucket)

In [71]:
regular_inputs = keras.layers.Input(shape=[8])
categories = keras.layers.Input(shape=[], dtype=tf.string)

cat_indices = keras.layers.Lambda(lambda cats: table.lookup(cats))(categories)
cat_embed = keras.layers.Embedding(input_dim=6, output_dim=2)(cat_indices)
encoded_inputs = keras.layers.concatenate([regular_inputs, cat_embed])

output = keras.layers.Dense(1)(encoded_inputs)

model = keras.models.Model(inputs=[regular_inputs, categories], outputs=[output])
...

Ellipsis

### Keras Preprocessing Layers

We already discussed two examples ```keras.layers.Normalization``` and ```keras.layers.TextVectorization```, which perform feature standardization and word encoding respectively. In both cases the pattern is the same: create the layer call it's ```adapt()``` method with a data sample and then use the layer in the model. Other preprocessing layers follow the same pattern.

The ```keras.layers.Discretization``` layer chops continuous data into different bins and encodes each bin as an one hot vector. For example we can discretize prices in three categories (low, medium, high) with encodings [1, 0, 0], [0, 1, 0], [0, 0, 1]. This compresses a lot of information but in some cases it can help the model detect patterns that would not be obvious when looking at the continuous values.

**Warning:** The preprocessing layers are frozen during training, so their parameters are not affected by Gradient Descent, hence they do not need to be differentiable. This also means you should not use an ```Embedding``` layer in a custom pre-processing layer as their weights will not be trainable. Instead, add it separately to the model.

You can also chain preprocessing steps using the ```PreprocessingStage``` class. For example, you can create a pipeline that normalizes the inputs then discretizes them. After adapting this layer to your data sample, you can use it as a regular layer (but again, only at the start since it contains a nondifferentiable preprocessing layer)

Note: at the time of writing, the discretization and preprocessingStage layers are still experimental. I'm omitting this snippet for now

In [75]:
# normalization = keras.layers.LayerNormalization()
# discretization = keras.layers.Discretization([...])
# pipelin = keras.layers.PreprocessingStage([normalization, ...])
# pipeline.adapt(...)

See note about ```TextVectorization``` bag of words and TF-IDF on page 438

## TF Transform

Preprocessing can be computationally expensive and in such cases, handling data before training rather than on the fly can give a significant speedup. If your dataset is small enough to fit in memory you can use it's cache method. But if it is larger Apache Beam or Spark can help. 

This is fine for training but what if once the model is trained you want to deploy it to a mobile app. You'll need to write some code in the app to take care of the preprocessing before it is fed to the model. If you also want to deploy to TensorFlow.js so that it runs on a browser you'll have more processing code and a maintenance nightmare. If the preprocessing changes you'll have to update the logic in all of these platforms which is time consuming and error prone. 

One solution is to take the trained model and before deploying it on your app, add extra preprocessing layers to take care of preprocessing on the fly. 

TF Transform was designed so that you only have to define your preprocessing once. It is part of the TensorFlow Extended package, an end-to-end platform for productionizing TF models, and does not come bundled with tensorflow so we have to install it.

Note: getting error on the installation below, might be due to running under WSL?

In [86]:
!pip3 install tensorflow-transform

Collecting tensorflow-transform
  Using cached tensorflow_transform-0.22.0-py3-none-any.whl (326 kB)
Collecting apache-beam[gcp]<3,>=2.20
  Using cached apache_beam-2.23.0-cp38-cp38-manylinux2010_x86_64.whl (9.8 MB)
[31mERROR: Could not find a version that satisfies the requirement tfx-bsl<0.23,>=0.22 (from tensorflow-transform) (from versions: none)[0m
[31mERROR: No matching distribution found for tfx-bsl<0.23,>=0.22 (from tensorflow-transform)[0m


TF we can then define the function just once in Python and using the respective TF Transform functions for scaling, bucketizing, etc...

In [87]:
# import tensorflow_transform as tft

# def preprocess(inputs):
#     median_age = inputs['hosing_median_age']
#     ocean_proximity = inputs['ocean_proximity']
#     standardized_age = tft.scale_to_z_score(median_age)
#     ocean_proximity_id =  tft.compute_and_apply_vocabulary(ocean_proximity)
#     return {
#         "standardized_median_age": standardized_age,
#         "ocean_proximity_id": ocean_proximity_id
#     }

Read more information on book..

## Tensorflow Datasets (TFDS) Project

The TFDS projects makes it easy to download common datasets, small ones like MNIST or Fashion MNIST to huge ones like ImageNet. [Link to full list of datasets](https://homl.info/tfds)

In [89]:
!pip3 install tensorflow-datasets



In [90]:
import tensorflow_datasets as tfds

dataset = tfds.load(name='mnist')
mnist_train, mnist_test = dataset['train'], dataset['test']

[1mDownloading and preparing dataset mnist/3.0.1 (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /home/carlos/tensorflow_datasets/mnist/3.0.1...[0m


local data directory. If you'd instead prefer to read directly from our public
GCS bucket (recommended if you're running on GCP), you can instead pass
`try_gcs=True` to `tfds.load` or set `data_dir=gs://tfds-data/datasets`.



HBox(children=(FloatProgress(value=0.0, description='Dl Completed...', max=4.0, style=ProgressStyle(descriptio…



[1mDataset mnist downloaded and prepared to /home/carlos/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.[0m


We can then transform it

In [94]:
mnist_train

<PrefetchDataset shapes: {image: (28, 28, 1), label: ()}, types: {image: tf.uint8, label: tf.int64}>

In [97]:
mnist_train = mnist_train.shuffle(10000).batch(32).prefetch(1)
for item in mnist_train:
    images = item['image']
    labels = item['label']
    ... # do things

Each item in the dataset is a dictionary containing features and labels, but Keras expects each item to be a tuple containing two elements (features and labels). Let's transform the dataset

In [98]:
mnist_train = mnist_train.shuffle(10000).batch(32)
mnist_train = mnist_train.map(lambda items: (items["image"], items["label"]))
mnist_train = mnist_train.prefetch(1)

We can conveniently do this by passing ```as_supervised=True``` to the load function. You can also specify the batch size and shuffle the datset with ```shuffle_files=True```

In [103]:
dataset = tfds.load(name='mnist', batch_size=32, as_supervised=True)
mnist_train = dataset['train'].prefetch(1)
model = keras.models.Sequential([
    keras.layers.Flatten(),
    keras.layers.Dense(10, activation='softmax')
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=['accuracy'])
model.fit(mnist_train, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f7767f9c5e0>

# Exercises