<a href="https://colab.research.google.com/github/adhadse/colab_repo/blob/master/homl/Ch%2013%20Loding%20and%20Preprocessing%20Data%20with%20TensorFlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 13: Loading and Preprocessing Data with TensorFlow
This work is partialy combined text and code from the book [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) and is only supposed to be used as reference and recommended to follow along with a copy of the Book puchased.

---
Deep Learning often requires which can not fit into Memory.

TensorFlow Data API: makes it very easy to point to data souce on the disk, and how to get transform it. TensorFlow takes care of all the implementation details, such as multithreading, queuing, batching, and prefetching. 

The Data API has also support for reading SQL databases, TensorFlow's TFRecord format, which is an efficient binary format based on Protocol Buffers. Many open source extensions are available to read from various other data sources.

We also requires preprocessing this Data before fedding it to the ML model. 

In this chapter focus will be on the Data API, TFRecord formaat and how to create a custom Preprocessing layer using Keras. At the end we will also take a look at few related project:

- *TF Transform (tf.Transform)*
- *TF Datasets (TFDS)*

In [1]:
from tensorflow import keras
import tensorflow as tf
import pandas as pd
import numpy as np

# The Data API
The whole Data AP revolves around the concept of a ***Dataset***: <mark>Represent a sequence of data items, which you usually will use to read data from the disk.</mark>

For now we create a dataset only on RAM, using

**`tf.data.Dataset.from_tensor_slices()`**: 

which takes tensor and creates a `tf.data.Dataset` whose elements are all the slices of X (along the first dimension). 

In [None]:
X = tf.range(10) # any data tensor
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

Iterating over this dataset is also simple:

In [None]:
for item in dataset:
  print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


## Chaining Transformations
Once we have datasets, we can apply any type of transformations and even chain them.

To transform we neet to call its transformation methods, each of which returns a new dataset.

In [None]:
# drop_remainder drops the batch which it can accomodate for
# exact same shape as specified
dataset = dataset.repeat(3).batch(7, drop_remainder=True)
for item in dataset:
  print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)


>🟠 This dataset methods DO NOT modify the datasets, they create a new ones.

**map()**: applies transformation to each item.

We can also use lambda to apply transformations to the datset using the `map()` method. This is where most of work will happen related to preprocessing, the funciton called here must be convertible to TF function.

It can quite become intensive, so setting the `num_parallel_calls` argument can speed up by distributing the workload on multiple threads.

In [None]:
dataset = dataset.map(lambda x: x *2)
for item in dataset:
  print(item)

tf.Tensor([ 0  2  4  6  8 10 12], shape=(7,), dtype=int32)
tf.Tensor([14 16 18  0  2  4  6], shape=(7,), dtype=int32)
tf.Tensor([ 8 10 12 14 16 18  0], shape=(7,), dtype=int32)
tf.Tensor([ 2  4  6  8 10 12 14], shape=(7,), dtype=int32)


**`apply()`**: Applies transformation to the whole dataset.


In [None]:
dataset = dataset.apply(tf.data.experimental.unbatch())
for item in dataset:
  print(item)

Instructions for updating:
Use `tf.data.Dataset.unbatch()`.
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(14, shape=(), dtype=int32)
tf.Tensor(16, shape=(), dtype=int32)
tf.Tensor(18, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(14, shape=(), dtype=int32)
tf.Tensor(16, shape=(), dtype=int32)
tf.Tensor(18, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=

**`filter()`**: This can be used to filter the datset, based on a condition.

In [None]:
dataset = dataset.filter(lambda x: x< 3)
for item in dataset:
  print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)


**`take()`**: Will let to take a look at just a few items from a dataset.

In [None]:
for item in dataset.take(3):
  print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)


## Shuffling the Data
`shuffle()`

Start by filling up a buffer with a first items of the sauce dataset. Then it pull out randomly from the buffer, and constantly replacing it fresh ones, until it has iterated entire dataset. After which it just randomly picks up from the buffer, until the buffer itself is empty.

<mark>We must specify the buffer size and make it large enough, or else the shuffling will not be very effective</mark>

In [None]:
dataset = tf.data.Dataset.range(10).repeat(3)
dataset = dataset.shuffle(buffer_size=5, seed=36).batch(7)
for item in dataset:
  print(item)

tf.Tensor([1 2 4 5 3 7 6], shape=(7,), dtype=int64)
tf.Tensor([8 9 0 1 2 4 5], shape=(7,), dtype=int64)
tf.Tensor([3 7 6 8 9 0 1], shape=(7,), dtype=int64)
tf.Tensor([2 4 5 3 7 6 8], shape=(7,), dtype=int64)
tf.Tensor([9 0], shape=(2,), dtype=int64)


### Interleaving lines from multiple files
Let's suppose we have loaded California Housing dataset and shuffled it.

Then we split each set into many CSV files and also have `train_filepaths` and `test_filepaths` listing all paths to the respective splitted files.

In [None]:
X_train, X_test = pd.read_csv("/content/sample_data/california_housing_train.csv"),pd.read_csv("/content/sample_data/california_housing_test.csv")

In [None]:
X_train.shape
train_filepaths = []
for index in np.arange(17000).reshape((1700, 10)):
  path = "/content/housing/train/train_{}.csv".format(index[0])
  X_train.iloc[index].to_csv(path, na_rep='NULL', index=False)
  train_filepaths.append(path)

In [None]:
X_test.shape
test_filepaths = []
for index in np.arange(3000).reshape((300, 10)):
  path = "/content/housing/test/test_{}.csv".format(index[0])
  X_test.iloc[index].to_csv(path, na_rep='NULL', index=False)
  test_filepaths.append(path)

Now, we'll create a dataset containing only these file paths:

By default, `list_files()` returns a dataset that shuffles the file paths.

In [None]:
filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)

Next, we can call the `interleave()` method.

In [None]:
n_readers = 5
dataset = filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length=n_readers
)

This code above create a dataset that will pull five file paths from the `filepath_dataset`, and for each one it will call the function we gave (a lambda in our case). 

This dataset on interation, will cycle through these five `TextLineDatasets` reading one line at a time from each until alll datasets are out of items, then get the next set of five file paths and repeating the cycle until all paths are done.

> 🟢 <mark>For interleaving to work best, it is best considered to have files of identical length</mark>; otherwise the ends of the longest file will not be interleaved.



In [None]:
for line in dataset.take(5):
  print(line.numpy())

b'-117.09,32.79,20.0,2183.0,534.0,999.0,496.0,2.8631,169700.0'
b'-121.46,38.55,52.0,2094.0,463.0,1364.0,407.0,1.2235,68500.0'
b'-117.35,34.09,14.0,5983.0,1224.0,3255.0,1150.0,2.5902,111500.0'
b'-118.06,34.07,30.0,2308.0,674.0,3034.0,691.0,2.3929,184400.0'
b'-119.34,36.31,14.0,1635.0,422.0,870.0,399.0,2.7,88900.0'


Ok looks good so far. But these are byte strings for which we need to parse them and scale the data.


## Preprocessing the Data


In [None]:
# mean and scale of each feature in the training set
X_mean, X_std = tf.constant([X_train[col].mean() for col in X_train.columns[:-1]]), tf.constant([X_train[col].std() for col in X_train.columns[:-1]]) 
n_inputs = 8

def preprocess(line):
  defaults = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
  fields = tf.io.decode_csv(line, record_defaults=defaults)
  x = tf.stack(fields[:-1])
  y = tf.stack(fields[-1:])
  return (x - X_mean)/ X_std, y


- The `preprocessing()` method accepts one csv line and start by parsing it by the use of `tf.io.decode_csv` which accept the line to parse, and the default value for each column in the CSV file. 

  The `defaults` array tells the TF not only the default value as well as the type. The last value is an empty array of type `tf.float32` as the the default value for the Target column, i.e., there is no default value, which will raise an exception it it encounters a missing value.

- Next, we use `tf.stack()` to convert the scalar tensors returned by `decode_csv()` to 1D tensor arrays.
- Finally we scale the input features by substracting the feature means and then dividing by the feature standard devaition.

In [None]:
preprocess(b'-117.09,32.79,20.0,2183.0,534.0,999.0,496.0,2.8631,169700.0')

(<tf.Tensor: shape=(8,), dtype=float32, numpy=
 array([ 1.2328726 , -1.3265201 , -0.6824022 , -0.21131904, -0.01283709,
        -0.3751125 , -0.01358042, -0.53479785], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([169700.], dtype=float32)>)

## Putting everything Together
Let's put everything inside a helper function.

In [None]:
def csv_reader_dataset(filepaths, repeat=1, n_readers=5, 
                       n_read_threads=None, shuffle_buffer_Size=10000,
                       n_parse_threads=5, batch_size=32):
  dataset = tf.data.Dataset.list_files(filepaths)
  dataset = dataset.interleave(
      lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
      cycle_length=n_readers,
      num_parallel_calls=n_read_threads
  )
  dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
  dataset = dataset.shuffle(shuffle_buffer_Size).repeat(repeat)
  return dataset.batch(batch_size).prefetch(1)

## Prefetching
Prefetching make sure that while our training algorithm is working on one batch, the dataset will already be working in prallel on getting the next batch ready.


## Using the Dataset with tf.keras
Now, we can use the `csv_reader_dataset()` method to create datsets for training and testing.

In [None]:
train_set = csv_reader_dataset(train_filepaths)
test_set = csv_reader_dataset(test_filepaths)

We can now just build the model and train using this datasets.

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=(8,)),
    keras.layers.Dense(100, activation='selu', kernel_initializer='he_normal'),
    keras.layers.Dense(30, activation='selu', kernel_initializer='he_normal'),
    keras.layers.Dense(10, activation='selu', kernel_initializer='he_normal'),
    keras.layers.Dense(1)
])
model.compile(optimizer=keras.optimizers.Adam(clipvalue=1.0),
              loss='mean_squared_error'
              )
model.fit(train_set, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f357266e850>

In [None]:
model.evaluate(test_set)



3693629184.0

In [None]:
new_set = test_set.take(3).map(lambda X, y: X) # pretend we have 3 new instances
model.predict(new_set)

array([[215925.94 ],
       [ 82064.44 ],
       [225029.   ],
       [ 71812.34 ],
       [175324.67 ],
       [ 86000.54 ],
       [ 50630.04 ],
       [170799.67 ],
       [146458.31 ],
       [171650.22 ],
       [172896.1  ],
       [ 84453.24 ],
       [157570.78 ],
       [367245.94 ],
       [140307.11 ],
       [162180.53 ],
       [379032.06 ],
       [234662.23 ],
       [117128.08 ],
       [306170.1  ],
       [ 97702.76 ],
       [220105.28 ],
       [207660.23 ],
       [382949.22 ],
       [177859.62 ],
       [117736.375],
       [248225.8  ],
       [235437.73 ],
       [200412.02 ],
       [293587.38 ],
       [135386.12 ],
       [126311.32 ],
       [166202.89 ],
       [297741.2  ],
       [ 80680.92 ],
       [216568.31 ],
       [117326.96 ],
       [ 75087.89 ],
       [147954.94 ],
       [171751.1  ],
       [ 99163.31 ],
       [226181.95 ],
       [164042.94 ],
       [118487.555],
       [112824.055],
       [199234.33 ],
       [180697.38 ],
       [11946

It is even possible to create a TF Function that performs the whole training loop:

In [None]:
@tf.function
def train(model, optimizer, loss_fn, n_epochs):
  train_set = csv_reader_dataset(train_filepaths, repeat=n_epochs)
  for X_batch, y_batch in train_set:
    with tf.GradientTape as tape:
      y_pred = model(X_batch)
      main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
      loss = tf.add_n([main_loss] + model.losses)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradient(zip(grads, model.trainaible_variables))

The CSV files are easy to handle with, but they are not effective and do not support complex data structure or (such as images, audio, video). 

In that case, it is preferrable to use TFRecords instead.

# The TFRecord Format
Binary format that comprises of sequences of binary records of varying sizes. Each record is comprised of a length, a CRC checksum to check that the length was not corrupted, then the data, and finally a CRC checksum for the data.



In [None]:
# Creating a TFRecord
with tf.io.TFRecordWriter("my_data.tfrecord") as f:
  f.write(b"This is a simple line of Text")
  f.write(b"And this second line of Text")

**`tf.data.TFRecordDataset`**: To read a TFRecord.

In [None]:
filepaths = ["my_data.tfrecord"]
dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
  print(item)

tf.Tensor(b'This is a simple line of Text', shape=(), dtype=string)
tf.Tensor(b'And this second line of Text', shape=(), dtype=string)


>🟢You can make `TFRecordDataset` read multiple files in parallel and interleave their records by setting `num_parallel_calls`.

## Compressed TFRecord Files
We can compress a TFRecord file by setting the `options` argument like this:

In [None]:
options = tf.io.TFRecordOptions(compression_type="GZIP")
with tf.io.TFRecordWriter("my_compressed.tfrecord", options) as f:
  f.write(b"This is a compressed record")

Specify the compression type when reading a compressed TFRecord.

In [None]:
dataset = tf.data.TFRecordDataset(["my_compressed.tfrecord"], compression_type="GZIP")

## A Brief Introduction to Protocol Buffers
We can use any binary record format to create records But TFRecord files usually contain serialized protocol buffers (alos called *protobufs*). It is an efficient binary format developed at Google back in 2001, and the open sourced in 2008. 

This is defined using a synatx like this:

In [None]:
# Not to be executed | For illustration purpose
syntax= "proto3";
message Person {
    string name = 1;
    int32 id = 2;
    repeated string email = 3;
}

The defination says, we are going to use Version 3 of Protocolbuf format. Then we go on specifiying the message `Person` which may contain a `name` of type `string`, and `id` of type `int32`, and zero or more `email`. The numbers corresponging to each field are called field identifiers, using to storing binary representation of that particular record.

After this we save this to a *.proto* file and compile it using `protoc` (a protobuf compiler) to generatte access class in any language we may want. 

For now, the definations we will have already been compiled and access class are going to be part of TensorFlow, so we have to focus only on use of protobuf access class.

In [None]:
# For illustration purpose only | Don't execute

from person_pb2 import Person # import the generated access class
person = Person(name="AL", id=123, email=["a@b.com"])
print(person)

person.name # display the field
person.name = "Anurag" # modify the field
person.email[0] # repeated fields are accessed like array
person.email.append("c@d.com") # add an email address

# serialize the object to byte string
s = person.SerializeToString() 

# create a new Person
person2 = Person() 
person2.ParseFromString(s) # parse the byte string (27 bytes long)
person == person2 # now they are equal

We could save the serialized Person Object to a TFRecord file, and then use it as a dataset as we have seen before. 

However, the functions used here are not TF operations and hence can not be put inside a TF Function. Fortunately, TensorFlow does include special protobuf definitions for which it provides parsing operations.

## TensorFlow Protobufs
<mark>The main protobuf typically used in a TFRecord file is the `Example` protobuf</mark>, which represent one instance in a Dataset. It contains a list of named features, where each feature can either be a list of byte strings, a list of floats, or a list of integers.

The protobuf definitions goes like this:

In [None]:
# For illustration purpose | don't execute
syntax = "proto3";
message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1 [packed = true]; }
message Int64List { repeated int64 value = 1 [packed = true]; }

message Feature {
    oneof kind {
        BytesList bytes_list = 1;
        FloatList float_list = 2;
        Int64List int64_list = 3;
    }
};

message Features {map<string, Feature> feature = 1; };
message Example { Features feature = 1; };

A bit of explaination might be required. 

- `[packed = true]` used for repeated numerical values, for more efficient encoding.
- A `Feature` object can contain either a `BytesList` or `FloatList` or `Int64List` message/object.
- A `Features` containes a dictionary that maps a feature name to the corresponding feature value.
- An `Example` contains only a `Features` object.

Here is how you create an `Example` object using `tf.train.Example` storing the same `Person` as earlier.

In [3]:
from tensorflow.train import BytesList, FloatList, Int64List
from tensorflow.train import Feature, Features, Example

person_example = Example(
    features= Features(
        feature={
            "name": Feature(bytes_list=BytesList(value=[b"Alic"])),
            "id": Feature(int64_list=Int64List(value=[123])),
            "emails": Feature(bytes_list=BytesList(value=[b"a@b.com", 
                                                          b"c@d.com"]))
        }
    )
)

Now we can write the resulting data to TFRecord by serializing it using `SerializeToString()` method.

In [4]:
with tf.io.TFRecordWriter("my_contacts.tfrecord") as f:
  f.write(person_example.SerializeToString())

Now that we have a nice TFRecord file containing a serialized `Example`, let's try to load it.

## Loading and Parsing Example
To load the serialized `Example` protobuf in TFRecord, we will use `tf.data.TFRecordDataset`, and we will parse each `Example` using `tf.io.parse_single_example()`.

This are TF opeations and hence can be written in TF Functions.

**`tf.io.parse_single_example()`**: <mark>Requires two arguments, namely, a string scalar tensor containing the serialized data, and a description of each feature.</mark>

The description is a dictionary with the key being the feature name, and value is either a object of `tf.io.FixedLenFeature` (which describes feature's shape, type, and default value) or `tf.io.VarLenFeature` (indicating only the type). 

In [5]:
feature_description = {
    "name": tf.io.FixedLenFeature([], tf.string, default_value=""),
    "id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
    "emails": tf.io.VarLenFeature(tf.string)
}

for serialized_example in tf.data.TFRecordDataset(["my_contacts.tfrecord"]):
  parsed_example = tf.io.parse_single_example(serialized_example, 
                                              feature_description)

<mark>The fixed length features are parsed as regular tensors, but the variable-length features are parsed as sparse tensors.</mark>

We can use `tf.sparse.to_dense()` to convert sparse tensor to dense tensor.

In [6]:
tf.sparse.to_dense(parsed_example["emails"], default_value=b"")

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'a@b.com', b'c@d.com'], dtype=object)>

In [7]:
# But in this case it is just simpler to access its values.
parsed_example["emails"].values

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'a@b.com', b'c@d.com'], dtype=object)>

**A `BytesList` can contain any binary data we may want, including any serialized object.**

- Use `tf.io.encode_jpeg()` to encode an image using the JPEG format to put in Binary data. Later on when we want to read the image back, we can use `tf.io.decode_jpeg()` or `tf.io.decode_image()`.
- For tensors use `tf.io.serialize_tensor()` to get byte string, and then to parse the TFRecord using `tf.io.parse_tensor()`.



In [8]:
# Parsing in batch
dataset = tf.data.TFRecordDataset("my_contacts.tfrecord").batch(10)
for serialized_example in dataset:
  parsed_example = tf.io.parse_example(serialized_example,
                                       feature_description)

## Handling Lists of Lists Using the `SequenceExample` Protobuf
The definition of `SequenceExample` Protobuf goes like this:

In [None]:
# For illustration purpose only | Don't Execute
message FeatureList { repeated Feature feature = 1; };
message FeatureLists { map<string, FeatureList> feature_list = 1; };
message SequenceExample {
    Features context = 1;
    FeatureLists feature_lists = 2;
};

Explanation might be required:
- A `SequenceExample` contains:
  
  -  `Features` object for contexual data and (on parsing a dictionary)
  -  `FeatureLists` object that contains one or more named `FeatureList` objects. (on parsing a dictionary)
      - Each `FeatureList` contains a **list of `Feature`** object which as you may guess can either a `BytesList` or `FloatList` or `Int64List` message/object.

For example where this might be useful and give a good idea of its purpose.

Suppose An article, We can divide it's various content and save it in `SequenceExample` like this: 
- The `context` can be like the *author*, the *date* etc.
- A `FeatureList` named "*`content`*" and other one being "*`comment`*".
- Each `Feature` would then be a sentence. Remember a `FeatureList` is **a list of `Feature`**.

Parsing can be done with the help of 
- `tf.io.parse_single_sequence_example()`: when parsing single example
- `tf.io.parse_sequence_example()`: for batch.

<mark>If the feature lists contain sequences of varying sizes, we might want to convert them to ragged tensors</mark>, using `tf.RaggedTensor.from_sparse()`.


In [None]:
parsed_context, parsed_feature_lists = tf.io.parse_single_sequence_example(
    serialized_sequence_example, 
    context_feature_descriptions,
    sequence_feature_descriptions
)
parsed_content = tf.RaggedTensor.from_sparse(parsed_feature_lists["content"])

The next step becomes to prepare the data so that we can feed it to Neural Network.

# Preprocessing the Input Features
We can preprocess data generally at three points during the whole process:
- Ahead of time when preparing the data usually when we have data that can fit in memory. Using tools like Numpy, Scikit-Learn.
- On the fly when loading it with the Data APU (e.g. using the `map()` method). Suitable for huge amount of Data.
- Or including a preprocessing layer directly in our model.

Let's look at the last option.


For example, here we implemented a standardization layer using a `Lambda` layer.

In [None]:
means = np.mean(X_train, axis=0, keepdims=True)
stds = np.mean(X_train, axis=0, keepdims=True)
eps = keras.backend.epsilon() # smoothing term

model = keras.models.Sequential([
    keras.layers.lambda(lambda inputs: (inputs - mean) / (std + eps)),
    #... other layers
])

This looks hacky. Instead we might want to have a cutom layer itself.

In [13]:
class Standardization(keras.layers.Layer):
  def adapt(self, data_sample):
    self.means_ = np.mean(data_sample, axis=0, keepdims=True)
    self.stds_ = np.std(data_sample, axis=0, keepdims=True)
  def call(self, inputs):
    return (inputs - self.means_) / (self.stds_ + keras.backend.epsilon())

But before we can use this layer, we require it to adapt it to dataset i.e., initiliazing the variables by calling its `adapt()` method and passing it a data sample.

In [12]:
std_layer = Standardization()
std_layer.adapt(data_sample)

Soon keras will be going to provide a Standardization layer by default, using `keras.layers.Normalization`.

## Encoding Categorical Features Using One-Hot Vectors
One hot encoding is frequently used for encoding Categorical features. Here we take categorical feature *`ocean_proximity`* from famous California Housing Dataset.

**For this, we first need to map each category to its index (0 to 4), which can be done using a lookup table:**

In [2]:
vocab = ["<1H OCEAN", "INLAND", "NEAR_OCEAN", "NEAR BAY", "ISLAND"]
indices = tf.range(len(vocab), dtype=tf.int64)

table_init = tf.lookup.KeyValueTensorInitializer(vocab, indices)
num_oov_buckets = 2
table = tf.lookup.StaticVocabularyTable(table_init, num_oov_buckets)

What's written here can be explained like this:
- We defined the *vocabulary*: list of all possible categories.
- Then created a tensor containing the corresponding indices.
- After which we created an intializer for lookup table, passing it the vocabulary, and the indices.
- Finally, we created the lookup table, passing it the initializer created in last step and the number of <mark>***out-of-vocabulary***</mark> bucets. If we look up a category that does not exist in the vocabulary, the lookup table will compute a hash of this category and use it to assign the unknown category to one of the oov buckets. (which in current example starts with 5 and 6).


**Why oov buckets?**

If the amount of categories is large and the dataset is itself large as well, we may find getting every category listed inconvenient. For this we define vocab based on Data Sample and some oov buckets for the expected unknown categories we might find during training.

Let's try experimenting with this.

In [3]:
categories = tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND"])
cat_indices = table.lookup(categories)
cat_indices

<tf.Tensor: shape=(4,), dtype=int64, numpy=array([3, 5, 1, 1])>

In [5]:
cat_one_hot = tf.one_hot(cat_indices, depth=len(vocab)+num_oov_buckets)
cat_one_hot

<tf.Tensor: shape=(4, 7), dtype=float32, numpy=
array([[0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.]], dtype=float32)>

All this might be good for quite a small size ($<$10) of vocabulary. But if the vocabulary is large ($>$50), we use ***embeddings*** instead for more efficient econding. Otherwise in between check for both options and see which one works best for your case.

## Encoding Categorical Features Using Embeddings
<mark>An embedding is a trainable dense vector that represents a category.</mark>

So let's say the vetor `[0.341, 098]` represent a category. <mark>The number of dimension is a hyperprameter you can tweak.</mark>

The training makes the the embeddings a better representations of category as the model makes better predictions and gradient descent performs the adjustments to these vector (which are initially initlaized randomly). This is called ***representation learning***.

Let's look how they work by implementing is manually and then by using keras.

**First off, we need to create an *embedding matrix* containing each category's embedding, initialized randomly.**
- This matrix will be of dimension `(no_category + no_oov_buckets, embedding dimension)`

In [3]:
embedding_dim = 2
num_oov_buckets = 2
embed_init = tf.random.uniform([len(vocab) + num_oov_buckets, embedding_dim])
embedding_matrix = tf.Variable(embed_init)

<mark>As a rule of thumn embeddings typically have 10 to 300 dimensions</mark>, depending on the task and the vocabulary size.



In [7]:
embedding_matrix

<tf.Variable 'Variable:0' shape=(7, 2) dtype=float32, numpy=
array([[0.54719055, 0.23783255],
       [0.13420415, 0.09093356],
       [0.21342385, 0.9513116 ],
       [0.36454678, 0.91104376],
       [0.89632404, 0.5941464 ],
       [0.48755026, 0.57294965],
       [0.37951005, 0.816712  ]], dtype=float32)>

Now let's create the embeddings

In [8]:
categories = tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND"])
cat_indices = table.lookup(categories)
cat_indices

<tf.Tensor: shape=(4,), dtype=int64, numpy=array([3, 5, 1, 1])>

In [9]:
tf.nn.embedding_lookup(embedding_matrix, cat_indices)

<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[0.36454678, 0.91104376],
       [0.48755026, 0.57294965],
       [0.13420415, 0.09093356],
       [0.13420415, 0.09093356]], dtype=float32)>

`tf.nn.embedding_lookup()` does nothing but looks in the rows in the embedding matrix at the given indices.

Keras provides a `keras.layers.Embedding` layer that handles the embedding matrix (trainable, by default) which does the same work as the `embedding_lookup` if not trained.

In [10]:
embedding = keras.layers.Embedding(input_dim=len(vocab)+ num_oov_buckets,
                                   output_dim=embedding_dim)
embedding(cat_indices)

<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[-0.00481472,  0.0458963 ],
       [-0.04767811,  0.03206981],
       [-0.01720978,  0.01737824],
       [-0.01720978,  0.01737824]], dtype=float32)>

Putting this layer inside a model, will make it learn embeddings for categorical features.

In [7]:
regular_inputs = keras.layers.Input(shape=(8))
categories = keras.layers.Input(shape=(), dtype=tf.string)

cat_indices = keras.layers.Lambda(lambda cats: table.lookup(cats))(categories)
cat_embed = keras.layers.Embedding(input_dim=6, output_dim=2)(cat_indices)
encoded_inputs = keras.layers.concatenate([regular_inputs, cat_embed])
outputs = keras.layers.Dense(1)(encoded_inputs)

model = keras.models.Model(inputs=[regular_inputs, categories],
                          outputs=[outputs])

The model presented here features a input of 8 numerical features, and a categorical input. 

For embedding to work, we put a `Lambda` layer to look up each category's index and then on the following line we look for the embeddings for the indices. 

Next we contatenate the input and the embeddings and fed it to a Neural Network whihc for now is just a single Neuron.

---
When the `keras.layers.TextVectorization` layer is available, we can replace it with `Lambda` layer, eliminating the need for the looking table code too. The Layer will take care of creating the looking table by adapting to the data (using the `adapt()` method).

>🔵 One-hot encoding followed by a `Dense` layer (with no activation function and biases) is equivalent to an `Embedding` layer as the weight matrix acts as embedding matrix. 
>
> Hence, **It would be wasteful to use more embedding dimensions than the number of units in the layer that follows the `Embedding` layer.**



## Keras Preprocessing Layers
The TensorFlow community is working on a new set standard Keras Preprocesing layers. This new API will not only include layers like `keras.layers.Normalization` and `keras.layers.TextVectorization` but layers like `keras.layers.Discretization` which can <mark>chop continuos data into different bins and encode each bin as a one-hot vector.</mark>

>🟠 The `Discritization` layer will be nondifferentiable (and indeed doesn't need to be differentiable as during training the **Preprocessing layer will be frozen**). The layer should only be used at the start of the model. This also means that,
> 
> **`Embedding` layer should not be used directly in a cutom preprocessing layer**, as the `Embedding` layer requires training and as said earlier, during training a Preprocessing layer is Frozen.

We will also be able to chain Preprocessing operations with the help of `PreprocessingStage`. If this pipeline contains nondifferentiable preprocessing layer, then it can only be used at the start of the model. The pipeline will adapt to a data sample and then will be able to use like a regular Layer.
For example:

In [9]:
normalization = keras.layers.Normalization()
discretization = keras.layers.Discretization([...])
pipeline = keras.layers.PreprocessingStage([normalization, discretization])
pipeline.adapt(data_sample)

Object `keras.layers.Normalization` not found.


# TF Transform
If preprocessing is computationally expensive, then handling it before rather than on fly might give us better performance.

Also, if the dataset is small enough to git in RAM, then we can use its `cache()` method. But if it's too large, then tools like Apache Beam or Spark will be the need of hour.

But this also creates the probelem of writing preprocessing code for the platform it is targetted / deployed to. And this may lead to subtle differences between the preprocessing operations performed on different platforms of your deployed model depending on the code. Also adding the maintenance headache and being errorprone.

Another way is to add Preprocessing layers when we deploy the models to the already trained model from the preprocessed data by Apache Beam or Spark. 

But by far the Best way out there probably is to use TF Transform, which is part of TensorFlow Extended (TFX), an end-to-end platform for productionizing TensorFlow models. 

You can then create your preprocesing function, by using TF Transform Function and even TF Functions for scaling, bucketizing, and much more.

In [None]:
import tensorflow_transform as tft

def preprocess(inputs):
  """
  Pretending we just had two features
  """
  median_age = inputs["housing_median_age"]
  ocean_proximity = inputs["ocean_proximity"]
  standardized_age = tft.scale_to_z_score(median_age)
  ocean_proximity_id = tft.compute_and_apply_vocabulary(ocean_proximity)
  return {
      "standardized_median_age": standardized_age,
      "ocean_proximity_id": ocean_proximity_id
  }

TF Transform will also generate an equivalent TensorFlow Function that we can plug into the model we deploy.

# The TensorFlow Datasets (TFDS) Project
The TensorFlow Datasets project makes it very easy to download common datasets.

TFDS doesn't come preinstalled with TF, so get it (you know the black magic). 

In [14]:
import tensorflow_datasets as tfds

dataset = tfds.load(name="mnist")
mnist_train, mnist_test = dataset["train"], dataset["test"]

[1mDownloading and preparing dataset mnist/3.0.1 (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /root/tensorflow_datasets/mnist/3.0.1...[0m


local data directory. If you'd instead prefer to read directly from our public
GCS bucket (recommended if you're running on GCP), you can instead pass
`try_gcs=True` to `tfds.load` or set `data_dir=gs://tfds-data/datasets`.



HBox(children=(FloatProgress(value=0.0, description='Dl Completed...', max=4.0, style=ProgressStyle(descriptio…



[1mDataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.[0m


We can then apply any transformation to this datasets.

In [None]:
mnist_train = mnist_train.shuffle(10000).batch(32).prefetch(1)
for item in mnist_train:
  images = item["image"]
  labels = item["label"]
  [...]

Keras exepects each item in the dataset to be a tuple containing two elements (one for features, other for labels). The `load()` can itself do this for us.

In [None]:
dataset = tfds.load(name="mnist", batch_size=32, as_supervised=True)
mnist_train = dataset["train"].prefetch(1)
model = keras.models.Sequential([...])
model.compile(loss="sparse_categorical_entropy",
              optimizer="sgd")
model.fit(mnist_train, epochs=5)