In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

# Estimators and Datasets
<!-- requirement: pylib/mnist_dataset.py -->

So far we have been dealing with a fairly low level API tools starting with building neural networks from scratch and then using the layers API.  TensorFlow offers other API tools to interact with its functionality.  One is an `Estimator` and associated classes which implement a high level API reminiscent of `scikit-learn`.  The other is the `Dataset` API which gives a way of defining data transformations before a neural network.  Here we will explore `Estimator`s and `Dataset`s.  These are probably the types of tools you should be using in production purposes, although for pedagogical reasons, we have saved them until now. 

We will first import TensorFlow as well as the `mnist` data set generator from before.

In [None]:
import tensorflow as tf
from pylib import mnist_dataset

## TensorFlow `Dataset`

Often in any machine learning task, we make use of data pipelines to bring our data from its source, perform some transformations on it, and then fit a model.  While Neural Networks do not necessarily require the same feature engineering as standard models, they still require data pipelines to ensure the data is amenable to being fed into the network.  This can be especially important when the data is coming from multiple sources and must be coerced into the particular network architecture the model has been fit with.  TensorFlow offers the [`Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) class as a solution to a few of these data problems.

A `Dataset` represents some data as well as transformations that will be performed on the data.  Let's create a `Dataset` from just a list of data.

In [None]:
ds = tf.data.Dataset.range(10)
ds

We can look at what type of outputs this `Dataset` contains

In [None]:
ds.output_classes

We can also look at what shapes this `Dataset` contains

In [None]:
ds.output_shapes

We see its just a list of numbers, which in TensorFlow language is a `TensorShape([])`.  We can confirm this by looking at the output types.

In [None]:
ds.output_types

It is all well and good to examine the properties of the `Dataset`, but we do need some way of actually acquiring values from it.  Like TensorFlow tensors, the `Dataset` will not have its values on demand.  We must tell it to evaluate the operations it needs to acquire these values.  In this case, that will be create a list of numbers, but in general could be something like reading from a file.

In [None]:
iterator = ds.make_one_shot_iterator()
next_ele = iterator.get_next()
with tf.Session() as sess:
    try:
        while True:
            print(sess.run(next_ele))
    except tf.errors.OutOfRangeError:
        pass

In [None]:
def get_all_elements(ds):
    next_ele = ds.make_one_shot_iterator().get_next()
    eles = []
    with tf.Session() as sess:
        try:
            while True:
                eles.append(sess.run(next_ele))
        except tf.errors.OutOfRangeError:
            pass
    return eles

In [None]:
get_all_elements(ds)

In [None]:
squared = ds.map(lambda x: x**2)
get_all_elements(squared)

In [None]:
even = ds.filter(lambda x: x % 2 == 0)
get_all_elements(even)

Do not forget that we are dealing with TensorFlow, so the equality comparison is identity and not equivalence.  We will need to use `tf.equal` to get what we want.

In [None]:
even = ds.filter(lambda x: tf.equal(x % 2, 0))
get_all_elements(even)

*Exercise*: Get all numbers less than $n$ which have perfect squares within 5 numbers either positive or negative.

In [None]:
def near_n(ds, n):
    near = ds.filter(lambda x: tf.less_equal(x**2, n+5) & tf.greater_equal(x**2, n-5))
    return get_all_elements(near)

near_n(ds, 6)

For debugging purposes, it might be nice to simulate some of these operations in plain Python. Luckily, we can perform `map`, `filter`, *etc.*, in Python!  Its always a good idea to test out the types of operations we want to perform.

In [None]:
x = [i for i in range(10)]
print(list(map(lambda x: x**2, x)))
print(list(filter(lambda x: x % 2 ==0, x)))

Often we will work with rather large data set and we might want to take only some number of the elements.  We can do this with the `take` operation.  (If at this point you are noticing some real inspiration from `Spark RDD`s then you and the author have the same feelings.)

In [None]:
first = ds.take(5)
get_all_elements(first)

Aside from just taking the elements, we can perform a batching operation which will take the `Dataset` and turn it into groups of some number (in this case we will use 2).  This comes in handy when training neural networks.  Note that it is not guaranteed all the batches will be the same size.

In [None]:
batched = ds.batch(2)
get_all_elements(batched)

We can also shuffle the data.  This is useful to ensure that your training steps are not biased towards a particular area in the data set.  The argument to the shuffle is the buffer size which specifies the number of elements to be shuffled at a time.  It is a compromise between randomness and memory usage. Here we will choose all of the elements.

In [None]:
shuffled = ds.shuffle(10)
get_all_elements(shuffled)

Chaining these two together gives us a data set which is shuffled and batched!

In [None]:
batched_and_shuffled = ds.shuffle(10).batch(3)
get_all_elements(batched_and_shuffled)

*Question:* Does order matter here?

Now how we might actually use this in a training example?  Let's make just a simple graph to take these numbers and perform some of these operations.  The idea is we can feed the `get_next` operation into the start of our computation graph which in general can be a neural network.  Here we will subtract 19 and add 20, which is more commonly known as adding one.

In [None]:
repeat_batch = ds.repeat().batch(3)
next_ele = repeat_batch.make_one_shot_iterator().get_next()
x = tf.subtract(next_ele, 19)
y = tf.add(x, 20)
with tf.Session() as sess:
    for i in range(4):
        print(sess.run(y))

Notice we have used the `repeat` method which repeats the tensor. In this case we have not specified an argument, so it will repeat indefinitely, but in general one can specify the number of times to be repeated as an argument to the `repeat` method.

## Building an image pipeline with TFRecords

Another key consideration in machine learning production environments is efficient file storage. When dealing with large amounts of data, this becomes especially important. 

To this end, TensorFlow provides the `.tfrecords` file format, which stores data in binary strings that can be sequentially read from disk. This provides significant increases in reading speed, especially when working on standard hard disk drives as opposed to solid state drives. Moreover, the TFRecords format makes it easy to combine multiple files, and integrates well with TensorFlow `Datasets.` 

Let's build a TFRecords file for an image classification problem, where images are currently stored as `.jpg` files in multiple directories.

In [None]:
import numpy as np
import glob
from PIL import Image

Download the flower images.

In [None]:
import pathlib
data_root = tf.keras.utils.get_file('flower_photos','https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz', untar=True)
data_root = pathlib.Path(data_root)
print(data_root)

Examine the directory structure. We see there are five classes in the data set. 

In [None]:
for item in data_root.iterdir():
    print(item)

In [None]:
label_names = sorted(item.name for item in data_root.glob('*/') if item.is_dir())
label_to_index = dict((name, index) for index,name in enumerate(label_names))
print(label_to_index)

Obtain the file path and label for each image.

In [None]:
all_image_paths = list(data_root.glob('*/*'))
all_image_paths = [str(path) for path in all_image_paths]
all_image_labels = [label_to_index[pathlib.Path(path).parent.name]
                    for path in all_image_paths]

### Save the `.tfrecords` file

Here we get to the bulk of this tutorial: formatting data correctly to write to TFRecords. Perhaps surprisingly, we don't turn data into tensors before creating the TFRecords file. Instead, each *unit* (in our case, image and label) will be represented as a `tf.train.Example`, which contains multiple `tf.train.Feature` attributes. 

```
TFRecords
    tf.train.Example
        'image':tf.train.Feature
        'label':tf.train.Feature
```

A `tf.train.Feature` will contain either an `Int64List`, a `FloatList`, or `BytesList`, in accordance with TensorFlow data types. We encode each image with a `FloatList` and each label with an `Int64List`.

In this example, we decode and preprocess images before writing them to the TFRecords file. We load and resize the image to a NumPy array of size 32x32, divide by 255, then flatten the array and convert it to a standard list. 

In [None]:
tfrecord_filename = 'all_images.tfrecords'

writer = tf.python_io.TFRecordWriter(tfrecord_filename)
for image, label in zip(all_image_paths, all_image_labels):
    img = Image.open(image)
    img = np.array(img.resize((32,32)))/255
    img = img.reshape(-1).tolist()
    
    feature = {'image': tf.train.Feature(float_list=tf.train.FloatList(value=(img))),
              'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label])) }

    example = tf.train.Example(features=tf.train.Features(feature=feature))

    # Writing the serialized example.

    writer.write(example.SerializeToString())
        
writer.close()

### Reading the `.tfrecords` file

Reading from a TFRecords file still requires a bit of work. We first read the file to a `tf.data.TFRecordDataset`, then map a parsing function, which will tell TensorFlow what features to expect from each `tf.train.Example`, and reshape the image array. 

In [None]:
def _parse_function(example_proto):
    features = {"image": tf.FixedLenFeature([3072,], tf.float32, default_value=[0]*3072),
                "label": tf.FixedLenFeature((), tf.int64, default_value=0)}
    parsed_features = tf.parse_single_example(example_proto, features)
    return tf.reshape(parsed_features["image"], [32,32,3]), parsed_features["label"]

In [None]:
sess = tf.Session()
dataset = tf.data.TFRecordDataset('all_images.tfrecords')
dataset = dataset.map(_parse_function)
dataset = dataset.repeat()
dataset = dataset.shuffle(5000) 
dataset = dataset.batch(50)

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
image_batch, label_batch = sess.run(next_element)

In [None]:
image_batch.shape

## TensorFlow `Estimator`
Now that we have learned about how to work with `Dataset`s, lets dive into the `Estimator` object.  This object allows us to use both predefined estimators (really just graphs) or easily define our own graphs in a simple way that is oriented towards machine learning.  They don't require managing all of the objects necessary when using the lower `API`.  

In [None]:
def _one_hot(x, y):
    return x, tf.one_hot(y, 10)

def _input_fn(type_, batch_size, name, one_hot):
    data = getattr(mnist_dataset, type_)('/tmp/data')
    ds = data.map(lambda x, y : ({name: x}, y)).shuffle(500)
    if one_hot:
        ds = ds.map(_one_hot)
    if type_ == 'train':
        ds = ds.repeat()
    return ds.batch(batch_size)

def test_input_fn(batch_size, name='pixels', one_hot=False):
    return _input_fn('test', batch_size, name, one_hot)

def train_input_fn(batch_size, name='pixels', one_hot=False):
    return _input_fn('train', batch_size, name, one_hot)


In [None]:
train_input_fn(10)

In [None]:
my_feature_columns = [tf.feature_column.numeric_column(key='pixels', shape=(28*28))]
estimator = tf.estimator.DNNClassifier(
    feature_columns=my_feature_columns,
    dropout=.5,
    hidden_units=[100, 100],
    n_classes=10
)

In [None]:
estimator.train(
    input_fn=lambda:train_input_fn(50),
    steps=2000)

In [None]:
estimator.evaluate(input_fn=lambda:test_input_fn(100))

It is great that we can use the baked in estimators, and often they will be suitable for the tasks at hand.  Yet, we will also need to write our own custom estimators.  Remember `Estimator`s break up the data pipeline step and the model step, so we can use our estimator with varying data pipelines.  This might come in handy if you train a model on the `MNIST` data set, but want to perform inference or further training on a different data set which will need to be transformed in some way to be coerced into the proper shape.

The main part of an `Estimator` is the model function, think of this as the function which generates a particular computation you want to perform.  In TensorFlow's estimation, there are three things you want your model function to do, *predict*, *train*, and *evaluate*.  The model function should handle the logic behind each one of these steps. Generally speaking the first part of a model function will set up the calculation (graph) and then branch into different logical steps depending on which one of these three actions the user requests the model to perform.  

The function signature of the model function will take a few arguments

* `features` - the input features to the model
* `label` - the input labels (truth values)
* `mode` - the mode, usually one of `tf.estimator.ModeKeys.PREDICT, tf.estimator.ModeKeys.TRAIN, tf.estimator.ModeKeys.EVAL`
* `params` - extra parameters needed by the function.

The return value of the function will be a `tf.estimator.EstimatorSpec` of which we will use a variety of arguments depending on the mode.

In [None]:
def model_fn(features, labels, mode, params):
    # Create an input layer from the input features
    in_layer = tf.feature_column.input_layer(features, 
                                             params['feature_columns'])
    # Create dense layers and an output layer with 10 classes
    
    out = tf.keras.Sequential(
        [
            tf.keras.layers.Dense(100, activation=tf.nn.relu),
            tf.keras.layers.Dense(100, activation=tf.nn.relu),
            tf.keras.layers.Dense(10)
        ]
    )(in_layer)
    
    

    # If PREDICT mode, return predictions
    if mode == tf.estimator.ModeKeys.PREDICT:
        predictions= {
            "class_ids" : tf.argmax(out, 1)[:, tf.newaxis],
            "probabilities": tf.nn.softmax(out),
            "logits": out
        }
        class_out = tf.estimator.export.ClassificationOutput
        return tf.estimator.EstimatorSpec(mode, 
                                          predictions=predictions,
                                          export_outputs={
                                              "predict":class_out(
                                                  scores=predictions["probabilities"],
                                                  classes=tf.cast(predictions["class_ids"], tf.string)
                                              )
                                          })
    
    # Compute the loss and accuracy
    # we will need these for TRAIN and EVAL
    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels,
                                                  logits=out)
    acc = tf.metrics.accuracy(labels=labels,
                                 predictions=tf.argmax(out, 1)[:, tf.newaxis],
                                 name='accuracy')
    # create a summary scalar the Estimator will handle FileWriters
    tf.summary.scalar("accuracy", acc[1])
    
    # If EVAL mode, return the loss and the metrics we want to evaluate
    # here we will choose accuracy
    if mode == tf.estimator.ModeKeys.EVAL:
        return tf.estimator.EstimatorSpec(mode,
                                         loss=loss,
                                         eval_metric_ops={'accuracy': acc})
    
    # If TRAIN mode, create an optimizer and a train operator
    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.AdagradOptimizer(.01)
        train = optimizer.minimize(loss, global_step = tf.train.get_global_step())
        return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train)

Now that we have an `Estimator` we can of course use it in exactly the same way as before.  Lets pass in a `model_dir` so we can use `Tensorboard` to view the results of the training.

In [None]:
import time
custom_est = tf.estimator.Estimator(model_fn=model_fn, 
                                    model_dir='tb/{}'.format(time.time()),
                                    params={
    "feature_columns":my_feature_columns
})

As before, we can train our estimator with the same training input function.

In [None]:
custom_est.train(
    input_fn=lambda:train_input_fn(50),
    steps=2000)

We can also evaluate using the test input function.

In [None]:
custom_est.evaluate(input_fn=lambda:test_input_fn(100))

Now notice that our model checkpoints are saved into the directory.  We can check this with a quick Bash command.

In [None]:
! ls {custom_est.model_dir}

We can also take a model made with `Keras` and turn it directly into an estimator, this is probably how you would want to do things in most cases.

In [None]:
model = tf.keras.Sequential(
        [
            tf.keras.layers.Dense(100, activation='relu', 
                                  name='pixels', 
                                  input_shape=(28*28,)),
            tf.keras.layers.Dense(100, activation='relu'),
            tf.keras.layers.Dense(10, activation='softmax')
        ]
    )

model.compile(optimizer='adagrad',
             loss='categorical_crossentropy',
             metrics=['accuracy'])

keras_est = tf.keras.estimator.model_to_estimator(
    keras_model=model,
    model_dir='tb/{}'.format(time.time())
)

Notice that when we look at the model inputs, `Keras` adds some extra text to the input layer name (`pixels_input` instead of `pixels`). To account for this we can use the `name` parameter of the `train_input_fn`.

In [None]:
model.inputs

In [None]:
keras_est.train(input_fn=lambda:train_input_fn(100, 
                                               name='pixels_input', 
                                               one_hot=True),
               steps=2000)

We can evaluate it as before.

In [None]:
keras_est.evaluate(input_fn=lambda:test_input_fn(100,
                                                 name='pixels_input', 
                                                 one_hot=True))

We now have a nice model, potentially with checkpoints, but what if we want to deploy into production, how can we save this model in a format useful for `tf.Serving`?

## Serving an `Estimator`

The biggest thing we will need to write a function to take the data we want to perform inference upon and process it into input proper for our model.  This can be very useful when the inference data is not the same as the training data.  This function is usually called the `serving_input_receiver_fn`, lets define one there.  Often the estimator will receive something like a serialized string which needs to be parsed, but for now we will have it receive data in the correct input format.  We can use the handy `build_raw_serving_input_receiver_fun` in this case.

In [None]:
def serving_input_receiver_fn():
    return tf.estimator.export.build_raw_serving_input_receiver_fn(
          {"pixels": tf.placeholder(dtype=tf.string, shape=[None])}
    )()

In [None]:
custom_est.export_savedmodel("tb/saved_model", 
                             serving_input_receiver_fn)

Now we can use this model with `tf.Serving` to make some real predictions.

*Copyright &copy; 2019 The Data Incubator.  All rights reserved.*