In [None]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
    !pip install -q -U tfx==0.21.2
    print("You can safely ignore the package incompatibility errors.")
except Exception:
    pass

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "data"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

<h1 style="color:#3A913F;">The Data API </h1> 

In [21]:
X = tf.range(10)
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

<h2>Chaining Transformations</h2>

In [None]:
#Allows list-like fetures such as map, apply, and filter. 
dataset = dataset.repeat(3).batch(7)


<h2>Shuffling the Data<h2>

You can shuffle the data itself. You then follow that by 
creating equal size chunks of the data and then shuffle them by interleaving. With  tf.data.Dataset.list_files - > .interleave(lambda filepath: tf.data.TextLineDataset(filepath).skip(1),cycle_length = n_readers) w/ n_readers = 5


<h2>Preprocessing the Data</h2>

In [26]:
#psuedo-code
X_mean, X_std = []
n_inputs = 8
def preprocess(line):
    defs = [0.] * n_inputs + [tf.constant([],dtype=tf.float32)]
    fields = tf.io.decode_csv(line, record_defaults=defs)
    x = tf.stack(fields[:-1])
    y = tf.stack(fields[-1:])
    return (X - X_mean) / X_std, y

ValueError: not enough values to unpack (expected 2, got 0)

<h2>Putting Everything Together</h2>

If the dataset is small enough to fit in memory, you can significantly speed up training
by using the dataset’s cache() method to cache its content to RAM. You should
generally do this after loading and preprocessing the data, but before shuffling,
repeating, batching, and prefetching.

In [22]:
#psuedo-code
def csv_reader_dataset(filepaths, repeat=1, n_readers=5,
                        n_read_threads=None, shuffle_buffer_size=10000,
                        n_parse_threads=5, batch_size=32):
    dataset = tf.data.Dataset.list_files(filepaths)
    dataset = dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),cycle_length=n_readers, num_parallel_calls=n_read_threads)
    dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
    dataset = dataset.shuffle(shuffle_buffer_size).repeat(repeat)
    return dataset.batch(batch_size).prefetch(1)

<h2>Prefetching</h2>

In [None]:
With prefetching we can work on training for one batch while simultaneously getting a batch preprocessed. 

<h2>Using the Dataset with tf.keras</h2>

In [None]:
#psuedo-code
train_set = csv_reader_dataset(train_filepaths)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)
model = keras.models.Sequential([...])
model.compile([...])
model.fit(train_set, epochs=10, validation_data=valid_set)
model.evaluate(test_set)
new_set = test_set.take(3).map(lambda X, y: X) # pretend we have 3 new instances
model.predict(new_set) # a dataset containing new instances

<h1 style="color:#3A913F;">The Data API Summary:</h1> 
Dataset object contains many methods to preprocess and shuffle your data. It also allows you to build custom pipelines. and the prefetch features optimizes the speed of training and preprocessing data. 

<h1 style="color:#3A913F;">The TFRecord Format </h1> 

In [28]:
# One method to create a TFRecord is through the tf.io.TFRecordWriter class
with tf.io.TFRecordWriter("mydata.tfrecord") as f:
    f.write(b"This is the first record")
    f.write(b"And this is the second record")
#then you can read one or more TFRecord files 
filepaths = ["mydata.tfrecord"]
dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
    print(item)

tf.Tensor(b'This is the first record', shape=(), dtype=string)
tf.Tensor(b'And this is the second record', shape=(), dtype=string)


<h2>Compressed TFRecord Files</h2>

In [3]:
#You can use TFRecordOptions to compress and specify your compression type when reading a compressed TFRecord file
options = tf.io.TFRecordOptions(compression_type="GZIP")
with tf.io.TFRecordWriter("my_compressed.tfrecord",options) as f:
    f.write(b"a compressed record")
dataset = tf.data.TFRecordDataset(["my_compressed.tfrecord"],compression_type = "GZIP")

<h2>A Brief Introduction to Protocol Buffers</h2>

Went over gRPC's quick start and learned about  proto files such as messages and their fields (make them optional) and tags associated with fields in the messages. Class decorators for your Protocol buffer classes are best practices, as it is not good O-O design to add behaviour to generated classes or classes by inherting from them this what decorators in python are for. Order of operations for grpc are proto file, generate gRPC code i.e. compile, update server, update client -> run. If I need to learn more about gRPC read about its core concepts here: https://grpc.io/docs/what-is-grpc/core-concepts/ and go over the basic python tutorial here: https://grpc.io/docs/languages/python/

<h2>TensorFlow Protobufs</h2>

In [None]:
You must define the <em>Example</em> protobuf and within it the features (fields) for your message. Tensorflow allows for BytesList, FloatList, and Int64List. 

In [7]:
#from tensorflow.train import BytesList, FloatList, Int64List
#from tensorflow.train import Feature, Features, Example
BytesList = tf.train.BytesList
FloatList = tf.train.FloatList
Int64List = tf.train.Int64List
Feature = tf.train.Feature
Features = tf.train.Features
Example = tf.train.Example

person_example = Example(
    features = Features(
        feature = {
            "name": Feature(bytes_list = BytesList(value=[b"Alice"])),
            "id" : Feature(int64_list = Int64List(value=[123])),
            "emails" : Feature(bytes_list = BytesList(value=[b"a@b.com",
                                                             b"c@d.com"]))
        }))
with tf.io.TFRecordWriter("my_contacts.tfrecord") as f:
    f.write(person_example.SerializeToString())

<h2>Loading and Parsing Examples<h2>

You pass in your serialized data to tf.io.parse along with your description of each feature. This description is a dictionary mapping each feature to either a tf.io.FixedLenFeature descriptor indicating the feature's shape, type, and default value, or a tf.io.VarLenFeature descriptor indicating only the type (if the lenght of the feature's list may vary)

In [11]:
feature_description = {
    "name": tf.io.FixedLenFeature([], tf.string, default_value=""),
    "id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
    "emails": tf.io.VarLenFeature(tf.string)

}
for serialized_example in tf.data.TFRecordDataset(["my_contacts.tfrecord"]):
    parsed_example = tf.io.parse_single_example(serialized_example,feature_description)

In [None]:
BytesList can contain any data I want. I just have to serialize the object (e.g., a tensor,image) by encoding then and have to remind myself to decode them as well. 

<h2>Handling Lists of Lists Using the SequenceExample Protobuf</h2>

Reference the example url included if my data consists of a list of lists e.g.,content = [["When", "shall", "we", "three", "meet", "again", "?"],["In", "thunder", ",", "lightning", ",", "or", "in", "rain", "?"]]. In order to deal with converting it to TFRecord format. https://github.com/ageron/handson-ml2/blob/master/13_loading_and_preprocessing_data.ipynb. 

<h1 style="color:#3A913F;">The TFRecord Format Summary: </h1> This involves knowing about proto files. Defining your data scheme is the basis of the TFRecord format. You must understand how to encode and decode data appropriately.

<h1 style="color:#3A913F;">Preprocessing the Input Features</h1> 

#An example of a standardization layer
```python
means = np.mean(X_train, axis=0, keepdims=True)
stds = np.std(X_train, axis=0, keepdims=True)
eps = keras.backend.epsilon()
model = keras.models.Sequential([
keras.layers.Lambda(lambda inputs: (inputs - means) / (stds + eps)),
[...] # other layers
])
```

<h2>Encoding Categorical Features Using the One-Hot Vectors</h2>

In [None]:
vocab = ["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"]
indices = tf.range(len(vocab), dtype=tf.int64)
table_init = tf.lookup.KeyValueTensorInitializer(vocab, indices)
# if the categories were listed in a text file
# (with one category per line), we would use a TextFileInitializer instead
num_oov_buckets = 2
table = tf.lookup.StaticVocabularyTable(table_init, num_oov_buckets)

#Alternatively you can use keras.layers.TextVectorization, which will be capable of doing exactly that: its
adapt() method will extract the vocabulary from a data sample, and its call()
method will convert each category to its index in the vocabulary. You could add this
layer at the beginning of your model, followed by a Lambda layer that would apply the
tf.one_hot() function, if you want to convert these indices to one-hot vectors.

<h2>Encoding Categorical Features Using Embeddings<h2>

In [None]:
regular_inputs = keras.layers.Input(shape=[8])
categories = keras.layers.Input(shape=[], dtype=tf.string)
cat_indices = keras.layers.Lambda(lambda cats: table.lookup(cats))(categories)
cat_embed = keras.layers.Embedding(input_dim=6, output_dim=2)(cat_indices)
encoded_inputs = keras.layers.concatenate([regular_inputs, cat_embed])
outputs = keras.layers.Dense(1)(encoded_inputs)
model = keras.models.Model(inputs=[regular_inputs, categories],
outputs=[outputs])

<h2>Keras Preprocessing Layers<h2>

In [None]:
It will also be possible to chain multiple preprocessing layers using the Preproces
singStage class.

In [None]:
normalization = keras.layers.Normalization()
discretization = keras.layers.Discretization([...])
pipeline = keras.layers.PreprocessingStage([normalization, discretization])
pipeline.adapt(data_sample)

<h2>TF Transform </h2>

Provides interoperability

<h2>The TensorFlow Datasets (TFDS) Project<h2>

<h1 style="color:#3A913F;">Preprocessing the Input Features Summary: </h1> 

Read the documentation here https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing. Somethings to keep in mind are preprocessing on the fly vs preprocessing your training data before. tensorflow_transform provides components provide ease of interoperability between different components e.g., you can run a function such as 
```python
import tensorflow_transform as tft
def preprocess(inputs): # inputs = a batch of input features
median_age = inputs["housing_median_age"]
ocean_proximity = inputs["ocean_proximity"]
standardized_age = tft.scale_to_z_score(median_age)
ocean_proximity_id = tft.compute_and_apply_vocabulary(ocean_proximity)
return {
"standardized_median_age": standardized_age,
"ocean_proximity_id": ocean_proximity_id
}
```

which will provide a TensorFlow function that you can use in your deployed model for on the fly preprocessing. Because it will have saved the necessary statistics computed by Apache Beamsuch as apache spark code. Neutralizing training/serving skew as preprocessing operations performed before training and the ones performed in your app or in the browser will stay consistent. 