# HW3: Profiling and Hyperparameter tuning

You will gain experience with two practical tools in this assignment.

* You will use the [TensorFlow Profiler](https://www.tensorflow.org/guide/profiler) to analyze an input pipeline for a slow running program, then add a few lines of code to improve performance. The profiler is built-in to [TensorBoard](https://www.tensorflow.org/tensorboard), a popular visualization tool (parts of this work with both TensorFlow and PyTorch).

* You will use [Keras Tuner](https://www.tensorflow.org/tutorials/keras/keras_tuner) to tune the hyperparameters for a text classifier. Keras Tuner is an open-source hyperparamter tuning package that works with TensorFlow, scikit-learn, and other frameworks.

Of course, you are welcome to use code from the website and book for this assignment. Please cite your sources (when in doubt, it never hurts to include a note re: the resources you used).

## Submission instructions

To submit this assignment, please upload a zip file to CourseWorks. Your zip file should include:
* This notebook (saved, with output)
* Screenshots of the TF Profiler as described below.

# Part 1: Profile a slow text classifier using the TensorFlow Profiler

The most common peformance bottleneck in a DL program is the input pipeline. In a nutshell, modern GPUs are so fast they often sit idle while waiting for data to be loaded off disk, and/or preprocessed by the CPU (informally, this is called "GPU starvation").

In this part of the assignment, you will improve the runtime performance of a text classifier (implemented below). The author of the text classifier has forgotten to optimize the input pipeline, causing it to run +/- **5x slower**, depending on your GPU.

You will practice using the [TensorFlow Profiler](https://www.tensorflow.org/guide/profiler) to identify a performance problem, and [tf.data](https://www.tensorflow.org/guide/data_performance) to fix it. 

**Reading**

You can learn about the [TF Profiler](https://www.tensorflow.org/guide/profiler) and [tf.data](https://www.tensorflow.org/guide/data_performance) with these links. Also see the [example code](https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras) for the profiler.

For this assignment, you should use [caching](https://www.tensorflow.org/guide/data_performance#caching) and [prefetching](https://www.tensorflow.org/guide/data_performance#prefetching) to improve the input pipeline.

## Starter code

The following code trains a simple text classifier on a [dataset](https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz) of programming questions extracted from Stack Overflow. This code is written for you. Each question ("How do I sort a dictionary by value?") is labeled with exactly one tag (`Python`, `CSharp`, `JavaScript`, or `Java`). The code below will train a model to predict the tag for a question. 

## Instructions

The goal of this section is for you to learn the mechanics of profiling code, so you can take advantage of it in your future work. Optimizing the input pipeline itself is not meant to be difficult (you can browse through existing tutorials on [tensorflow.org/tutorials](https://www.tensorflow.org/tutorials) to see how). I mainly want you to be aware of issues like these (so you can catch them in your own code down the road).

After reading and and running this code to train a text classifier, you will:

1. Modify the code below by adding a few lines to profile it with the TF Profiler
1. Re-run the code, open the TF Profiler, and navigate to the section showing the input pipeline. Take a screenshot of the Profiler that shows the program is input-bound (and include the screenshot with your submission). 
1. Modify the code below by writing a few lines to optimize the data input pipeline, using [caching](https://www.tensorflow.org/guide/data_performance#caching) and [prefetching](https://www.tensorflow.org/guide/data_performance#prefetching).
1. Re-run the code, and take a screenshot of the TF Profiler showing the input pipeline is faster (and include that screenshot with your submission).

That's all you need to do for full credit on this section. If you're interested in diving deeper, both the profiler and tf.data are pretty extensive - you can continue reading on your own, and see what you can do to make the input pipeline even faster (there's a new feature called [snapshot](https://www.tensorflow.org/api_docs/python/tf/data/experimental/snapshot?version=nightly) you could try, for example, that was released a couple weeks ago).



In [None]:
# Install the TF Profiler
!pip install -U tensorboard_plugin_profile

In [None]:
# Load the TensorBoard notebook extension
# If you are not running this notebook in https://colab.research.google.com/,
# you will need to install TensorBoard on your machine.
%load_ext tensorboard

In [None]:
import collections
import datetime
import pathlib
import re
import string

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras import utils
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

### Download and explore the dataset

Download the dataset, and explore the directory structure.

In [None]:
data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'

In [None]:
dataset = utils.get_file(
    'stack_overflow_16k.tar.gz',
    data_url,
    untar=True,
    cache_dir='stack_overflow',
    cache_subdir='')

In [None]:
dataset_dir = pathlib.Path(dataset).parent

In [None]:
list(dataset_dir.iterdir())

In [None]:
train_dir = dataset_dir/'train'
list(train_dir.iterdir())

The `train/csharp`, `train/java`, `train/python` and `train/javascript` directories contain many text files, each of which is a Stack Overflow question.

In [None]:
sample_file = train_dir/'python/0.txt'
with open(sample_file) as f:
  print(f.read())

### Load the dataset

Create an input pipeline to koad the data off disk and prepare it into a format suitable for training using [text_dataset_from_directory](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text_dataset_from_directory). This utility creates a labeled `tf.data.Dataset` from a directory structure as follows.

```
train/
...csharp/
......1.txt
......2.txt
...java/
......1.txt
......2.txt
...javascript/
......1.txt
......2.txt
...python/
......1.txt
......2.txt
```

This dataset has already been divided into train and test, but it lacks a validation set. Create a validation set using an 80:20 split of the training data by using the `validation_split` argument below.

In [None]:
batch_size = 32
seed = 0

raw_train_ds = preprocessing.text_dataset_from_directory(
  train_dir,
  batch_size=batch_size,
  validation_split=0.2,
  subset='training',
  seed=seed)

There are 8,000 examples in the training folder, of which you will use 80% (or 6,400) for training. As you will see in a moment, you can train a model by passing a `tf.data.Dataset` directly to `model.fit`. First, iterate over the dataset and print out a few examples, to get a feel for the data.

In [None]:
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(3):
    print("Question: ", text_batch.numpy()[i])
    print("Label:", label_batch.numpy()[i])

Note: To increase the difficulty of the classification problem, the dataset author replaced occurrences of the words *Python*, *CSharp*, *JavaScript*, or *Java* in the programming question with the word *blank*. This is to prevent the classifier from just memorizing a few words, and to make it more interesting to experiment with a few techniques.

The labels are `0`, `1`, `2` or `3`. To see which of these correspond to which string label, you can check the `class_names` property on the dataset.

In [None]:
for i, label in enumerate(raw_train_ds.class_names):
  print("Label", i, "corresponds to", label)

Next, you will create a validation and test dataset. You will use the remaining 1,600 reviews from the training set for validation.

Note:  When using the `validation_split` and `subset` arguments, make sure to either specify a random seed, or to pass `shuffle=False`, so that the validation and training splits have no overlap.

In [None]:
raw_val_ds = preprocessing.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

Create a test dataset.

In [None]:
test_dir = dataset_dir / 'test'
raw_test_ds = preprocessing.text_dataset_from_directory(
    test_dir, batch_size=batch_size)

### Prepare the dataset for training

Next, you will standardize, tokenize, and vectorize the data using the `preprocessing.TextVectorization` layer.

* Standardization refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset.

* Tokenization refers to splitting strings into tokens (for example, splitting a sentence into individual words by splitting on whitespace).

* Vectorization refers to converting tokens into numbers so they can be fed into a neural network.

All of these tasks can be accomplished with this layer. You can learn more about each of these in the [API doc](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization).

* The default standardization converts text to lowercase and removes punctuation.

* The default tokenizer splits on whitespace.

* The default vectorization mode is `int`. This outputs integer indices (one per token). This mode can be used to build models that take word order into account. Here, you will use `binary` to build a bag-of-word model.

In [None]:
VOCAB_SIZE = 10000

vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='binary')

Next, you will call `adapt` using the training set. This will cause the `TextVectorization` layer to learn the vocabulary which it will use to convert sentences to a bag-of-words dataset.

Note: it's important to only use your training data when calling adapt (using the validation or test set would leak information, of course).

In [None]:
# Make a text-only dataset (without labels), then call adapt
train_text = raw_train_ds.map(lambda text, labels: text)
vectorize_layer.adapt(train_text)

Now that the layer is ready, take a look at how it preprocesses text by calling it on a batch of data from the training set.

In [None]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

In [None]:
# Retrieve a batch of 32 reviews and labels from the training set
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]

print("Question", first_question)
print("Label", first_label)

This is the result of applying the layer. The one-hot indicies correspond to the tokens present in the programming question.

In [None]:
one_hot_bow = vectorize_text(first_question, first_label)[0]
print("Bag of words (one-hot encoded):", one_hot_bow)

The layer contains a vocabulary list that you can use to recover the original sentence (if using `int` mode), or to see the words selected by the one-hot encoding (if using `binary` mode). Of course, with binary mode word order is lost.

In [None]:
vocab = vectorize_layer.get_vocabulary()
words = tf.gather(vocab, tf.where(one_hot_bow[0]))
print("Bag of words (tokens)", words)

You are nearly ready to train your model. As a final preprocessing step, you will apply the `TextVectorization` layers you created earlier to the train, validation, and test dataset.

In [None]:
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

### Configure the dataset for performance

The author has forgotten to implement this section. You will do so later after profiling the code.

In [None]:
##### 
# TODO: your code here
# After profiling the code (and taking a screenshot of the slow input pipeline)
# add code here to improve the performance
#####







### Train the model

In [None]:
model = tf.keras.Sequential([layers.Dense(4)])

model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])


#####
# TODO: your code here
# Update and expand the code below to profile your program
#####
history = model.fit(train_ds, validation_data=val_ds, epochs=10)






Notice this is a fairly small model:

In [None]:
print(model.summary())

Evaluate the model on the test set:

In [None]:
loss, accuracy = model.evaluate(test_ds)
print("Accuracy: {:2.2%}".format(accuracy))

### Export the model

In the code above, you applied the `TextVectorization` layer to the dataset. If you want to make your model capable of processing raw strings (to simplify deploying it), you can include the `TextVectorization` layer inside your model. To do so, create a new model as follows.

In [None]:
export_model = tf.keras.Sequential(
    [vectorize_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(accuracy))

Now your model can take raw strings as input and predict a score for each label using `model.predict`. Define a function to find the label with the maximum score:

In [None]:
def get_string_labels(predicted_scores_batch):
  predicted_int_labels = tf.argmax(predicted_scores_batch, axis=1)
  predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
  return predicted_labels

### Run inference on new data

In [None]:
inputs = [
    "how do I extract keys from a dict into a list?",  # python
    "debug public static void main(string[] args) {...}",  # java
]
predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())

After reading and running the code above:
1. Add code to profile it using the TF Profiler (search for "# TODO: your code here")
1. Take a screenshot of the profiler showing a slow input pipeline (and include the screenshot with your submission)
1. Add code to improve the performance of the input pipeline (search for "# TODO: your code here")
1. Re-run the code, then take a screenshot of the profiler showing that your input pipeline is faster (and include the screenshot with your submission).

# Part 2: Hyperparameter tuning

There are many hyperparamters you can explore for your models (number and width of layers, various activations functions, weight initialization strategies, optimizers, etc). 

One way to tune these for a model you're developing is to use [Keras Tuner](https://keras-team.github.io/keras-tuner/), an open-source hyperparameter tuner that works with TensorFlow, sklearn, and other frameworks.

In this part of the assignment, you will use Keras Tuner to improve a text classification model published on https://tensorflow.org. In the previous part of the assignment, your example code was based on [this tutorial](https://www.tensorflow.org/tutorials/load_data/text) (recently published, just a week ago).

A developer recently updated this tutorial, but they did not tune the `int` text classification model from Part 1:

```
model = tf.keras.Sequential([
    layers.Embedding(vocab_size, 64, mask_zero=True),
    layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2),
    layers.GlobalMaxPooling1D(),
    layers.Dense(num_labels)
])
```

Above is the `int` model from part one of that tutorial. Although it has 660,868 parameters (compared to just 40,004 for the `binary` model above it in the tutorial!), the validation accuracy of the `int` model is lower. 

**Instructions**

Use Keras Tuner to improve the `int` model. There are several strategies you can use:

* You can explore alternative sizes for the Embedding and Conv1D layers (perhaps more or fewer neurons are helpful?) 
* You can explore different arrangements and types of layers.
* If you like, you can experiment with RNNs (you can find example code [here](https://www.tensorflow.org/tutorials/text/text_classification_rnn)).

You can improve the model in one of several ways (or ideally, multiple). Your improved model can:
* Have higher accuracy on the validation set
* Be smaller (fewer paramaters)
* Can be faster to train (wall clock time)

Add code below to install Keras Tuner, and optimize this model. After you've finished running your experiments, add a text cell below that lists:
* Your best model (inside ``` quotes ```, to make it easy to read)
* The validation accuracy on the same dataset from that tutorial
* Briefly describes the experiments you ran to improve it. 

**Recommended reading**

Keras Tuner
* [Blog post](https://blog.tensorflow.org/2020/01/hyperparameter-tuning-with-keras-tuner.html)
* [Tutorial](https://www.tensorflow.org/tutorials/keras/keras_tuner)

In [None]:
#####
# TODO: your code here
# Install Keras Tuner, and use it to run experiments
# to improve the `int` text classification model 
# discussed above.
# Include your complete code with your submission in this notebook
#####

TODO: your write-up here. In this text cell, briefly list:
* Your best model (inside ``` quotes ```)
* The validation accuracy on the same dataset from that tutorial
* The experiments you ran to improve it.  