# Data preparation with tensorflow

Up until now we have used Pandas to load our dataset but using Tensorflow to accomplish this task is more convenient it will integrate better with all the
components. The whole tf.data API revolves around the concept of a tf.data.Dataset, this represents a sequence of data items. Usually you will use datasets 
that gradually read data from disk, but for simplicity let’s create a dataset from a simple data tensor using _tf.data.Dataset.from\_tensor\_slices()_:

In [1]:
import tensorflow as tf

X = tf.range(10)
dataset = tf.data.Dataset.from_tensor_slices(X)

Once we have the dataset we can apply any transformation we want:

In [5]:
dataset = tf.data.Dataset.from_tensor_slices(tf.range(10))
print("Before transformation")
for item in dataset:
    print(item) # The content before transformation

dataset = dataset.repeat(3).batch(7)

print("After transformation")
for item in dataset:
    print(item) # And after transformation

Before transformation
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
After transformation
tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


In [None]:
# We can also call the map method to transform the instances
dataset = dataset.map(lambda x: x**2)
# We can add the num_parallel_calls parameter to specify the number of thread to dedicate to our transformation function if the transformation is computationaly heavy

We can also shuffle the data using the _shuffle()_ method.

In [None]:
dataset = dataset.shuffle(buffer_size=4, seed=42)

Note that we can also shuffle and read dataset from multiple files which is really practical to be able to separate training, test and validation sets for
example.

In [None]:
n_readers = 5
dataset = dataset.interleave(lambda filepath: tf.data.TextLineDataset(filepath).skip(1), cycle_length=n_readers)
# filepath should be a list of filepaths containing the data(the files should be of equal size)

Now we are going to use the housing dataset to demonstrate the possible preprocessing operations.

In [None]:
X_mean, X_std = [], []
n_inputs = 8

def parse_csv_line(line):
    defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
    fields = tf.io.decode_csv(line, record_defaults=defs)
    return tf.stack(fields[:-1]), tf.stack(fields[-1:])

def preprocess(line):
    x, y = parse_csv_line(line)
    return (x - X_mean) / X_std, y # Scales the input features by subtracting the feature means and then dividing by the feature standard deviations, and returns a tuple containing the scaled features and the target

# Now we can put everything together
def csv_reader_dataset(filepaths, n_readers=5, n_read_threads=None, n_parse_threads=5, shuffle_buffer_size=10_000, seed=42, batch_size=32):
    dataset = tf.data.Dataset.list_files(filepaths, seed=seed)
    dataset = dataset.interleave(lambda filepath: tf.data.TextLineDataset(filepath).skip(1), cycle_length=n_readers, num_parallel_calls=n_read_threads)
    dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
    dataset = dataset.shuffle(shuffle_buffer_size, seed=seed)
    return dataset.batch(batch_size).prefetch(1) # Using prefetch at last greatly enhance speed

## Using the TFRecord to process audio and images

TFRecord is a data structure used by tensorflow to store large files. We can easily create a writer with _tf.io.TFRecordWriter_:

In [None]:
with tf.io.TFRecordWriter("my_data.tfrecord") as f:
    f.write(b"This is the first record")
    f.write(b"And this is the second record")

We can use _tfRecordDataset_ to read datasets from multiple files:

In [None]:
filepaths = ["my_dataset.tfrecord"]
dataset = tf.data.TFRecordDataset(filepaths)

We can also compress the TFRecord files

In [None]:
options = tf.io.TFRecordOptions(compression_type="GZIP")
with tf.io.TFRecordWriter("my_data1", options) as f:
    f.write(b"This sentence will be compressed")

Even though the binary files can use any format we want by default tensorflow contain serialized protocol buffers(_protobuffs_). The syntax used by this
protocol is the following:

In [None]:
syntax = "proto3"  
message Person{  
    string name = 1;  
    int32 id = 2;  
    repeated string email = 3;  
}

This code say that a person can have a name, an id and one or more email and the number beside them are the field identifiers. The main protobuf used by
TFRecord file is the _Example_ protobuf which contains the list of features of each instance. Let's recreate the person protobuf from earlier using the 
example protobuf:

In [1]:
from tensorflow.train import BytesList, FloatList, Int64List
from tensorflow.train import Feature, Features, Example

person_example = Example(
    features = Features(
        feature={
            "name": Feature(bytes_list=BytesList(value=[b"Alicia"])),
            "id": Feature(int64_list=Int64List(value=[123])),
            "email": Feature(bytes_list=BytesList(value=[b"a@b.com", b"c@d.com"]))
        }
    )
)

# Now we can use it
with tf.io.TFRecordWriter("my_contacts.tfrecord") as f: # We save persons in a supposed contact list
    for _ in range(5):
        f.write(person_example.SerializeToString())

# And here is how to parse it
feature_description = { # This dictionnary is necessary to define the type and shape of each feature and pass it to the parse_single_example function
    "name": tf.io.FixedLenFeature([], tf.string, default_value=""),
    "id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
    "emails": tf.io.VarLenFeature(tf.string),
}

def parse(serialized_example): # Parsing a single example
    return tf.io.parse_single_example(serialized_example, feature_description)

dataset = tf.data.TFRecordDataset(["my_contacts.tfrecord"]).map(parse)
for parsed_example in dataset:
    print(parsed_example)


### Preprocessing with Keras

Keras possess multiple layer classes that does preprocessing. For example, we have the _Normalization_ layer which perform scaling or standardization. Here
is how to use it:

In [None]:
norm_layer = tf.keras.layers.Normalization()
model = tf.keras.model.Sequential([
    norm_layer, tf.keras.model.Dense(1)
])
model.compile(loss="mse", optimizer=tf.keras.optimizers.SGD(learning_rate=2e-3))
norm_layer.adapt(X_train) # This line is essential for the model to compute the mean and variance of every feature which is necessary to apply scaling
model.fit(X_train, y_train, validation_data=(X_valid, y_valid), epochs=5)

But this way of doing degrade performances because the normalization is performed at each epoch. A better approach would be to perform the preprocessing
only once on the whole dataset:

In [None]:
# We are taking the example of normalization here but it work the same for any other preprocessing operation
norm_layer = tf.keras.layers.Normalization()
norm_layer.adapt(X_train)
X_train_scaled = norm_layer(X_train)
X_valid_scaled = norm_layer(X_valid)

The problem is that if we want to deploy this model in production then the instances will not be normalize, the solution is to combine the preprocessing
layer and the model we have created into a new model.

In [None]:
new_model = tf.keras.model.Sequential([norm_layer, model])
# Now we can use this model for predictions

Let's use another type of preprocessing layer called the _Discretization_ layer. This layer's goal is to transform numerical attributes into categorical
attributes by mapping value ranges to categories.

In [None]:
age = tf.constant([[10.], [93.], [57.], [18.], [37.], [5.]])
disc_layer = tf.keras.layers.Discretization(bin_boundaries=[18., 50.]) # bin_boundaries are just the value ranges
age_categories = disc_layer(age)
print(age_categories)

We cannot pass this categories to our models yet in this form since they cannot be meaningfully compared, instead we need to encode it using _OneHotEncoder_
for example. Fortunately tensorflow possess its version of this function:

In [None]:
onehot_layer = tf.keras.layers.CategoryEncoding(num_tokens=3)
onehot_layer(age_categories)

To encode string values into categories we have _StringLookup_.

In [None]:
cities = ["Auckland", "Paris", "Paris", "San Francisco"]
str_lookup_layer = tf.keras.layers.StringLookup()
str_lookup_layer.adapt(cities)

We also have text vectorization layer(for NLP) to work with

In [2]:
data = ["I", "am", "a block", "of text"]
textV_layer = tf.keras.layers.TextVectorization()
textV_layer.adapt(data)
