In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

## Generating Shakesperian Text using a Character RNN

We'll start exploring NLP by creating an RNN that takes in a sequence of characters and tries to predict the next character in the sequence. Check this [cool article](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) that contains examples (including Algebraic Geometry)!

We'll train our model on Shakespeare's text

#### Creating the training set

In [2]:
shakespeare_url = 'https://homl.info/shakespeare'
filepath = keras.utils.get_file('shakespeare.txt', shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

In [3]:
shakespeare_text[:15]

'First Citizen:\n'

To process the text, we'll encode each character as an integer by using Keras' ```Tokenizer``` class

In [4]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)

Now we can give the tokenizer a string and get back its encoding

In [5]:
tokenizer.texts_to_sequences([shakespeare_text[:15]])

[[20, 6, 9, 8, 3, 1, 19, 6, 3, 6, 36, 2, 10, 24, 11]]

In [6]:
max_id = len(tokenizer.word_index)  # number of distinct characters
max_id

39

In [7]:
# total number of characters
dataset_size = tokenizer.document_count
dataset_size

1115394

Encode the text (subtracting 1 to get ranges from 0 to 38)

In [8]:
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

In [9]:
encoded

array([19,  5,  8, ..., 20, 26, 10])

#### Splitting a sequential Dataset

Now we need to split into training, validation and test set. But how we do it with characters? We can't just shuffle all the characters in the text.

With time series, we gnerally split on time: for example if we might take years 2000 to 2012 for training, 2013 to 2015 for validation and 2016 to 2018 for test.

In some other case we can split among other dimensions such as an industry type or country.

Splitting accross time is safe, however it implicitly assumes that patterns learned in the past will still exist in the future. I.e. we assume the time series is *stationary*. To make sure the time series is sufficiently stationary, we can plot the model's error on the validation set across time: if the model performes better on the first part of the validation set than on the last part, then the time series might not be stationary enough. Training the model on a shorter time span might be better.

For the Shakespeare example  we'll take the first 90% of the text for the training set keeping the rest for validation and test

In [10]:
train_size = dataset_size *90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

#### Chopping data into windows

We'll use ```dataset.window``` to convert the one long instance we have into many smaller windows of text. Every instance in training will be a fairly short substring of the full text and the RNN will be unrolled over the length of the substrings. This is called **truncated backpropagation through time**

In [11]:
n_steps = 100 # tune-able parameter
window_length = n_steps + 1 # target will be the input shifted 1 character ahead
dataset = dataset.window(window_length, shift=1, drop_remainder=True)

Using ```shift=1``` we create overlapping windows: the first window will contain characters 0 to 100, the second 1 to 101 and so on...

```drop_remainder``` ensures all windows are 101 characters long and we don't have to do any padding

```window()``` creates a dataset containing windows, each of which is also a dataset. It is a *nested dataset*. This is useful when we want to transform each window by calling dataset methods but it cannot be used directly for training. The model expects tensors, not datasets. 

The ```flat_map``` dataset converts a nested dataset into a *flat dataset* (i.e. without nesting). It also takes a function as an argument which allows you to transform each dataset _before_ flattening.

For example passing ```lambda ds: ds.batch(2)``` to ```flat_map``` the dataset $\{\{1, 2\}, \{3,4, 5,6\}\}$ becomes $\{[1,2], [3,4], [5,6]\}$.

In [12]:
dataset = dataset.flat_map(lambda window: window.batch(window_length))

By calling passing ```window_length``` to batch, we get a single tensor for each batch (all windows have that length). That is, the dataset contains windows of 101 characters each. Now we can shuffle the tensors and separate inputs from targets. See Figure 16-1 for a pictorial representation of the process

In [13]:
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

In [14]:
for X, y in dataset.take(1):
    print(X)
    print(y)

tf.Tensor(
[[24  5  9 ...  0  8 13]
 [11 11 23 ...  1 12 26]
 [ 0 16  6 ... 22  0  4]
 ...
 [ 0 19  3 ...  9 17  0]
 [19 19  1 ...  7 13  8]
 [ 9 20  0 ...  0  4  8]], shape=(32, 100), dtype=int32)
tf.Tensor(
[[ 5  9 20 ...  8 13  2]
 [11 23 10 ... 12 26  0]
 [16  6  3 ...  0  4  9]
 ...
 [19  3  8 ... 17  0 15]
 [19  1  8 ... 13  8  1]
 [20  0  3 ...  4  8  1]], shape=(32, 100), dtype=int32)


We'll do one-hot-encoding since there are fairly few features (only 39 characters)

In [15]:
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

#### Bulding and Training the Char-RNN model

We'll create a model with GRU units that are fed into a Time-distributed Dense layer of 39 unites (max_id), since we want to output character probabilities for each possible character

**Note:** Running the code below takes a good while...

In [18]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.GRU(128, return_sequences=True,
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))
])
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
history = model.fit(dataset, epochs=3)
model.save('saved_models/16_nlp_with_rnns/shakespeare.h5')

In [19]:
model = keras.models.load_model('saved_models/16_nlp_with_rnns/shakespeare.h5')

OSError: SavedModel file does not exist at: saved_models/16_nlp_with_rnns/shakespeare.h5/{saved_model.pbtxt|saved_model.pb}

#### Using the Char-RNN model

To pass some inputs to our model, we need to apply the same preprocessing we did for training it

In [20]:
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id)

And a test:

In [21]:
X_new = preprocess(['How are yo'])
y_pred = np.argmax(model.predict(X_new), axis=-1) + 1 # Account for the 1 we removed during preprocessing
y_pred

NameError: name 'model' is not defined

The model generate one predicion for each character in the input. Since we're only interested in the prediction for the last character we simply pick up the last prediction

In [22]:
tokenizer.sequences_to_texts(y_pred)[0][-1] # 1st sentence, last char

NameError: name 'y_pred' is not defined

#### Generating new text

Now that we have a model, we could start using it by feeding in some text, generate a prediction, amend the text, feed it in, get a new prediction, etc...

While this approach does work, it often leads to the same words being predicted over and over again. Instead, we can sample from the distribution of possible characters using ```tf.random.categorical()```, this will generate more interesting text.

```categorical``` samples random class indices, given class log probabilities. We can also introduce a notion of *temperature*, which will control how wild our model will deviate from the logits. We divide the logits by the temperature, when temperature is close to 0 it will favour high probability characters, when it is close to 1, it gives characters an equal probability.

We'll define the ```next_char``` function to do this for us

In [23]:
def next_char(text, model, temperature=1):
    X_new = preprocess([text])
    y_proba = model.predict(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

And a helper function that will generate the text and fill it in

In [24]:
def complete_text(text, model, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text+= next_char(text, model, temperature)
    return text

In [25]:
print(complete_text('t', model, temperature=0.5))

NameError: name 'model' is not defined

In [26]:
print(complete_text('test', model, temperature=0.5))

NameError: name 'model' is not defined

In [27]:
print(complete_text('adam', model, temperature=0.8))

NameError: name 'model' is not defined

In [28]:
print(complete_text('u', model, temperature=1))

NameError: name 'model' is not defined

In [29]:
print(complete_text('k', model, temperature=2))

NameError: name 'model' is not defined

To get better text we can add more layers, train for longer, add regularization. The limitation of this model is that it is incapable of learning patterns longer than n_steps=100 characters. Stateful RNNs can help with that.

#### Stateful RNN

The RNNs discussed so far only used hidden states at each epoch. If we preserve the state after the end of each step and use that as the starting point for the next step, the RNN willl learn long-term patterns despite only backpropagating through short sequences. This is called a *stateful RNN*.

It has a catch though: training a stateful RNN only makes sense if each input sequence in a batch starts exactly where the corresponding sequence in the previous batch left off.

Instead of using shuffled data, we will use sequential, non-overlapping sequences. To do this we'll create the dataset using ```shift=n_steps``` and naturally we won't ```shuffle``` the data. 

Batching is where things get a bit harder. If we were to use ```batch(32)``` then the first batch would contain windows 1 through 32 and the next batch would be 33 to 64, so the first window of each batch are not consecutive. One approach to this is to use 'batches' containing a single window

In [30]:
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])
dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_length))
dataset = dataset.batch(1)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

Another approach would be to divide the corpus into 32 texts of equal length, create one dataset of consecutive input sequences for them and then use ```tf.data.Dataset.zip(dataset).map(lambda *windows: tf.stack(windows))``` to create proper batches, where the nth input sequence in a batch start off exactly where the nth input sequence ended in the previous batch

In [31]:
batch_size = 32
encoded_parts = np.array_split(encoded[:train_size], batch_size)
datasets = []
for encoded_part in encoded_parts:
    dataset = tf.data.Dataset.from_tensor_slices(encoded_part)
    dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
    dataset = dataset.flat_map(lambda window: window.batch(window_length))
    datasets.append(dataset)
dataset = tf.data.Dataset.zip(tuple(datasets)).map(lambda *windows: tf.stack(windows))
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

Now to create the RNN we must set its ```stateful``` parameter as wel as specify the batch_input_shape for the first layer. We leave the second dimension unspecified since the inputs can have any length

In [32]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, stateful=True,
                     dropout=0.2, recurrent_dropout=0.2,
                     batch_input_shape=[batch_size, None, max_id]),
    keras.layers.GRU(128, return_sequences=True, stateful=True,
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))
])

We also need to add a little callback to ensure we reset the states at the end of each epoch.

In [33]:
class ResetStatesCallback(keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs):
        self.model.reset_states()

In [35]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
history = model.fit(dataset, epochs=10, callbacks=[ResetStatesCallback()])
model.save('saved_models/16_nlp_with_rnns/stateful_shake.h5')

In [36]:
model = keras.models.load_model('saved_models/16_nlp_with_rnns/stateful_shake.h5')

OSError: SavedModel file does not exist at: saved_models/16_nlp_with_rnns/stateful_shake.h5/{saved_model.pbtxt|saved_model.pb}

Now that this model is trained, it can only make predictions of the same size as the batch size used during training. To avoid this restriction, we can create a statelss model and copy the stateful model weights' to this model

In [37]:
stateless_model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))
])

In [38]:
stateless_model.set_weights(model.get_weights())

And then use it for some predictions!

In [39]:
print(complete_text('u', stateless_model))

uk$mvo
inuyqqohyhf whust  aep  ghhtge.i:p irrto3gob


In [40]:
print(complete_text('adam', stateless_model, temperature=0.8))

adamooauw f ,seiwtehjrtu eoheunh -ntemnhh
nlu lielihsl


## Sentiment Analysis

For this we'll use the IMDB reviews dataset

In [41]:
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data()

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


In [42]:
X_train[0][:10]

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

From the description, the dataset has already been transformed, with each character being encoded by its character frequency rank.

The integers 0, 1, 2 are special and represent padding token, Start of Sequense (SoS) token and unknown words, respectively. Let's write a function to decode the text

In [43]:
word_index = keras.datasets.imdb.get_word_index()

In [44]:
def decode(text):
    id_to_word = {id_+3: word for word, id_ in word_index.items()}
    for id_, token in enumerate(('<pad>', '<SoS>', '<unk>')):
        id_to_word[id_] = token
    return " ".join(id_to_word[c] for c in text)

In [45]:
decode(X_train[0][:100])

"<SoS> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was"

In a real project we'd have to take care of preprocessing ourselves. Kera's Tokenizer works well for English and most languages but there are exceptions (see pg.535), for example words like 'San Francisco' or '#ILoveDeepLearning'. 

One solution came in a paper by [Taku Kudu](https://arxiv.org/abs/1804.10959) that used unsupervised learning techniques to tokenize and detokenize text at subword level in a language independent way. Google's [SentencePie](https://github.com/google/sentencepiece) is an implementation of this.

[TF.Text](https://medium.com/tensorflow/introducing-tf-text-438c8552bd5e) contains various tokenization strategies, including [WordPiece](https://arxiv.org/abs/1609.08144).

We'll explore some TF operations to do preprocessing by loading the original IMDB reviews dataset.

In [49]:
import tensorflow_datasets as tfds

datasets, info = tfds.load('imdb_reviews', as_supervised=True, with_info=True)
train_size = info.splits['train'].num_examples

In [50]:
def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300)  # Truncate, keep only first 300 chars -> speed up training
    X_batch = tf.strings.regex_replace(X_batch, b"<br\\s*/?>", b" ")  # Replaces <br /> with spaces
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")  # remove anything other than characters or ''
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch
                                       

The next step is to construct the vocabulary by going through the dataset and keeping track of word occurences

In [51]:
from collections import Counter

vocabulary = Counter()
for X_batch, y_batch in datasets['train'].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

In [52]:
vocabulary.most_common()[:3]

[(b'<pad>', 214309), (b'the', 61137), (b'a', 38564)]

We'll truncate the vocabulary to the 10,000 most common words

In [53]:
vocab_size = 10_000
truncated_vocabulary = [word for word, count in  vocabulary.most_common()[:vocab_size]]

Then replace each word with it's ID from the vocabulary

In [54]:
words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

This table will serve as our reference for a given input. In the example below, faaaantastic is an unseen word so it falls into a newly created bucket

In [55]:
table.lookup(tf.constant([b"This move was faaaaantastic".split()]))

<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[   22,   951,    11, 10791]])>

**Tip:** check out TF-Transform, in particular ```tft.compute_and_apply_vocabulary```, it goes through the dataset to find vocabulary and generate TF operations required to encode each word with the vocab.

Now to create the final training set we use the previously defined ```preprocess``` and encode them with the vocabulary we just built

In [56]:
def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

train_set = datasets['train'].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

In [57]:
for X, y in train_set.take(1):
    print(X)
    print(y)

tf.Tensor(
[[  22   11   28 ...    0    0    0]
 [   6   21   70 ...    0    0    0]
 [4099 6881    1 ...    0    0    0]
 ...
 [  22   12  118 ...  331 1047    0]
 [1757 4101  451 ...    0    0    0]
 [3365 4392    6 ...    0    0    0]], shape=(32, 60), dtype=int64)
tf.Tensor([0 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 0], shape=(32,), dtype=int64)


And train!

In [58]:
embed_size = 128
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size,
                           input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam",
              metrics=["accuracy"])
history = model.fit(train_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
117/782 [===>..........................] - ETA: 1:10 - loss: 0.2020 - accuracy: 0.9271

KeyboardInterrupt: 

The first layer we use is an Embedding layer that will convert word IDs into embeddings. It's input matrix has one row per id (vocab_size + num_oov_buckets) and a column per embedding dimension (used 128 but this is a hyperparameter). The embedding layer outputs a 3D tensor of shape **[batch_size, time_steps, embedding_size]** which feeds in perfectly into the rest of the network.

### Masking

This model will have to learn to ignore padding tokens (which have index zero). We can set the Embedding layer's ```mask_zero``` parameter to True to make it pad tokens whose ID is 0. (note: in this case, the pad token has index zero because it is the most common token, in general is a good idea to map padding to 0). With this, padding is ignored by the layers downstream!

With this parameter set to true, the Embedding layer creates a *mask tensor* equal to K.not_equal(inputs, 0); a boolean tensor with entries equal to False where inputs==0, True otherwise. This mask is propagated downstream as long as the time dimension is preserved. For example, the first GRU layer outputs sequences so it preserves the mask. The second doesn't so the mask is not transmitted to the dense layer. Each layer behaves differently when encountering a masked time step, but in general it simple ignored masked time steps.

For example, recurrent layers simply copies the output from previous time step when it encounters a mask. If the mask propagates all the way to the output (e.g. a vec-to-seq or seq-to-seq architecture), then the masked time steps will not contribute to the loss.

Note: LSTMs and GRUs are optimized for GPUs based on Nvidia's cuDNN library, however this implementation does not support masking. The optimized implementation requires you to use the default values for several hyperparameters: activation, recurrent_activation, etc..

All layers that receive the mask must support masking (else an exception is raised). All recurrent layers, as well as TimeDistributed and a few other layers support this. 

See pg 539 for more info

Masking works best for simple sequential models, it doesn't always work for more complex models (e.g. if you need to mix a Conv1D layer with recurrent layers). In such cases you will need to explicitly compute the mask. 

Below an example that is the same model defined as above, but done using the Functional API and handling masks manually

In [63]:
import os
root_logdir = os.path.join(os.curdir, 'my_logs')

def get_run_logdir():
    import time
    run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
    return os.path.join(root_logdir, run_id)

callbacks = [keras.callbacks.TensorBoard(get_run_logdir())]

In [55]:
K = keras.backend
inputs = keras.layers.Input(shape=[None])
mask = keras.layers.Lambda(lambda inputs: K.not_equal(inputs, 0))(inputs)
z = keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size)(inputs)
z = keras.layers.GRU(128, return_sequences=True)(z, mask=mask)
z = keras.layers.GRU(128)(z, mask=mask)
outputs = keras.layers.Dense(1, activation='sigmoid')(z)

model = keras.models.Model(inputs=[inputs], outputs=[outputs])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(train_set, epochs=5, callbacks=callbacks)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [59]:
test_set = datasets['test'].batch(32).map(preprocess)
test_set = test_set.map(encode_words).prefetch(1)

Let's generate predictions for 1 batch

In [60]:
for batch_X, batch_y in test_set.take(1):
    y_preds = np.round(model.predict(test_set.take(1))).flatten()
    print('True values\n', batch_y)
    print('\nPredictions\n', y_preds)

True values
 tf.Tensor([1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1], shape=(32,), dtype=int64)

Predictions
 [1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 0.
 1. 0. 0. 1. 1. 1. 1. 0.]


Not too shabby, now let's look at Pretrained Embeddings!

### Reusing Pretrained Embeddings

Using pre-trained models is pretty easy with tensorflow_hub. For this example we'll use ```nnlm-en-dim50``` sentence embedding module version 1. 

This module is a sequence encoder, taking strings as inputs and encoding each as a single vector (a 50dim vector). It parses the strings and embeds each word using an embedding matrix trained on the Google News 7B corpus, it then compute the mean of all the word embeddings and the result is the sentence embedding.

By default ```hub.KerasLayer``` is not trainable but we can set ```training=True``` to make it trainable.

In [61]:
import tensorflow_hub as hub

model = keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1",
                   dtype=tf.string, input_shape=[], output_shape=[50]),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss="binary_crossentropy", optimizer='adam', metrics=['accuracy'])

Reload the dataset before training

In [64]:
datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)
train_size = info.splits["train"].num_examples
batch_size = 32
train_set = datasets["train"].repeat().batch(batch_size).prefetch(1)
history = model.fit(train_set, steps_per_epoch=train_size // batch_size, epochs=5,
                    callbacks=[keras.callbacks.TensorBoard(get_run_logdir())])

Epoch 1/5
Instructions for updating:
use `tf.profiler.experimental.stop` instead.


Instructions for updating:
use `tf.profiler.experimental.stop` instead.


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## An Encoder-Decoder Network for Neural Machine Translation

We'll explore text translation by building a simple [neural machine translation model](https://arxiv.org/abs/1409.3215). Picture and explanation at page 543 are the best way to understand

In summary, we'll be inputting a sequence to the encoder and (during training) the correct translation (shifted back one step) to the decoder. We also pass the sequence to the encoder reversed, so that the begginning of the sentence is fed last to the encoder.

At each time step the decoder output scores which are then turned to probabilities by the softmax function and the word with the highest probability is output. Training can be done using the ```sparse_caregorical_crossentropy``` loss.

At inference, we don't have the target output, so at each time step, we input the output word of the previous time step.

The book mentions a couple of extra notes before implementing the model as well as the following:
- When the output vocabulary is large, outputting probabilities for every word is very slow (e.b. consider a vocabulary of 50,000 words. Computing softmax over this vector is computationally intensive.). A solution to this is using a *sampled softmax*, which picks the logits of the correct word plus a random sample of incorrect words and computs an approximation based on these logits. Use this for training and the usual softmax at inference.

See book for full code and explanation. No dataset was given for this so I'm skipping this bit.

#### Bidirectional RNNs

Recurrent layers only look at past and present inputs before generating its output. It doesn't look ahead to generate an output. This makes sense for time series, but for NLP it makes sense to look a head before encoding a given word. For example the word queen in the following sentences is defined by what comes after: "Queen of the UK", "the queen of hearts", "the queen bee".

Bidirectional layers are simply two recurrent layers running on the same inputs. One reading from left to right and the other from right to left. Their outpus are combined at each time step (tipycally concatenating). We can wrap our recurrent cells with ```keras.layers.Bidirectional``` to make them bidirectional.

In [61]:
keras.layers.Bidirectional(keras.layers.GRU(10, return_sequences=True))

<tensorflow.python.keras.layers.wrappers.Bidirectional at 0x7f8807771d30>

Although the layer above has 10 units, this layer has to GRUs, so its outputs have 20 values per time step

#### Beam Search

Page 547 explains Beam Search, which is a way to expand the space of possible output sentences. Best to read this

## Attention Mechanisms

Attention is a way to tell the machine to focus on a specific word at a particular time step. To construct it, we will take the existing encoder-decoder architecture and make a slight modification.

Instead of sending the encoder final hidden state to the decoder, we send all the outputs. To focus the decoder on the next word to be predicted, we give each of these outputs a weight, the one with the highest weight will be the focused target. To compute these weights we add an *alignment model* (or *attention layer*). It is composed of a time distributed dense layer which computes how well each output is aligned to the decoder's previous hidden state, an energy score for each encoder output. It is followed by a softmax which calculates the weights for each encoder output. All the weights add up to 1. (figure page 550)

This mechanism is called [*Bahdanau attention*](https://arxiv.org/abs/1409.0473) or *concatenative (additive) attention*.

[*Luong attention*](https://arxiv.org/abs/1508.04025) or *multiplicative attention* was proposed later and it modifies how the similarity between decoder previous hidden state and encoder outputs is calculated. The authors propose the dot product of the two vectors to be calculated, this often works as a good similarity measure and is quicker to compute. The dot product scores also go through a softmax layer to calculate the final weights.

They also propose using the decoder hidden state at the current time step ($\textbf{h}_t$ rather than previous time step $\textbf{h}_{t-1}$) then use the output of the attention mechanism ($\widetilde{h}_t$) to compute the decoder's predictions (rather than to compute decoder's current hidden state. 

Finally, they propose a variation on the dot product, where the encoder outputs first go through a linear transformation (a time distributed Dense layer without bias term) before the dot products are computed. This is called the general dot product approach. They found that the dot product variants performed better than concatenative attention. 

Below equations for each mechanism

$$ \widetilde{h}_{(t)} = \sum_i\alpha_{(t,i)} \mathbf{y}_{(t)}$$

with

$$ \alpha_{(t, i)} = \frac{\exp(e_{(t,i)})}{\sum_{i'}\exp(e_{(t,i')})}$$

and 

$$ e_{(t,i)} = \begin{cases}
                \textbf{h}_{(t)}^{\intercal}\textbf{y}_{(i)} & & dot  \\
                \textbf{h}_{(t)}^{\intercal}\mathbf{W}\textbf{y}_{(i)} & & general\\
                \textbf{v}{(t)}^{\intercal}\text{tanh}(\textbf{W}[\textbf{h}_{(t)};\textbf{y}_{(i)}]) & & concat
               \end{cases}$$

To add attention mechanisms to an Encoder-Decoder model use Tensorflow Addons

```
import tensorflow_addons as tfa

attention_mechanism = tfa.seq2seq.attention_wrapper.LuongAttention(
    n_units, encoder_outputs, memory_sequence_length=encoder_sequence_length)

attention_decoder_cell = tfa.seq2seq.attention_wrapper.AttentionWrapper(
    decoder_cell, attention_mechanism, attention_layer_size=n_units)
```

#### Visual Attention

The attention mechanism was also adopted for generating image captions. Page 553 explains a litte bit of what happens there along with a nice pictorial representation.

### Explainability

An added bonus of attention is that they can help us understand what led the model to produce its output. [This paper from 2016](https://arxiv.org/pdf/1602.04938.pdf) suggest an approach by learning an interpretable mode locally around a classifier's prediction

### Attention is all you need: the Transformer architecture

A [groundbreaking paper from 2017](https://arxiv.org/abs/1706.03762) laid the foundations to a new type of architecture called the *Transformer*, which obtained incredible results without recurrent or convolutional layers. Page 555 shows the transformer architecture, along with its explanation. I won't add it here as the picture draws the explanation.

### Positional Encodings

One of the novel features of the Transformer architecture is the use of Positional Encodings. These are dense vectors that encode the position of a word within a sentence. The authors used a fixed positional encoding, its advantage over a learned positional encoding being that it can be extended to arbitrarily long sentences. 

The positional encoding $P_{p,i}$ for the *i*th component of the encoding for the word located at the *p*th position in the sentence is given by

$$ \text{PE}_{p,i} = \begin{cases} \sin(p/10000^{i/d}) & \text{if i is even} \\ \cos(p/10000^{(i-1)/d}) & \text{if i is odd} \end{cases}$$

Explanation along with image on page 557

We can define our own class for Positional Encoding

In [None]:
class PositionalEncoding(keras.layers.Layer):
    def __init__(self, max_steps, max_dims, dtype=tf.float32, **kwargs):
        super().__init__(dtype=dtype, **kwargs)
        if max_dims % 2 == 1: max_dims += 1 # max_dims must be even
        p, i = np.meshgrid(np.arange(max_steps), np.arange(max_dims // 2))
        pos_emb = np.empty((1, max_steps, max_dims))  # Adding an extra dimension to use broadcasting rules on call
        pos_emb[0, :, ::2] = np.sin(p / 10000**(2 * i / max_dims)).T
        pos_emb[0, :, 1::2] = np.cos(p / 10000**(2 * i / max_dims)).T
        self.positional_encoding = tf.constant(pos_emb.astype(self.dtype))

    def call(self, inputs):
        shape = tf.shape(inputs)
        return inputs + self.positional_encoding[:, :shape[-2], :shape[-1]]

And then create the first layers of the Transformer

In [None]:
embed_size = 512
max_steps = 500
vocab_size = 10000

encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
positional_encoding = PositionalEncoding(max_steps, max_dims=embed_size)
encoder_in = positional_encoding(encoder_embeddings)
decoder_in = positional_encoding(decoder_embeddings)

### Multi-head attention

Let's start by looking at the *Scaled Dot-Product Attention* layer. Suppose the decoder analyzed the sentence "*They played chess*" and it was able to understand that the word "They" is the subject and "played" is the verb. Suppose also that the decoder has already translated They and thinks it should translate the verb next, to do this, it needs to fetch the verb from the input sentence. This works like a dictionary retrieval where the decoder has {'subject':'They', 'verb':'played', ...} and the decoder has a corresponding dictionary. 

The model, however, doesn't have a discrete representation of word tokens to represent the keys. It has vectorized representations of these, so the key (also called *query*) will not perfectly match any key in the dictionary. To solve this, we compute a similarity measure between the query and each key in the dictionary, then use softmax to convert the similarities to weights that add up to 1. The key most similar to the query will have weight closest to 1. The model then computes a weighted sum of the values and the final score will be close to the representation of the word 'played' in this example. 

The similarity measure used is the dot product, like in Luong attention. The equation is show below

$$ \text{Attention}(\mathbf{Q, K, V}) = \text{softmax}\Big( \frac{\mathbf{QK}^\intercal}{\sqrt{d_{keys}}}\Big)\mathbf{V} $$

Where 
- $\mathbf{Q}$ is a matrix with a row per query of shape \[$n_{queries}, d_{keys}$\] where $d_{values}$ is the number of dimensions of each query and each key
- $\mathbf{K}$ is a matrix with one row per key with shape \[$n_{keys}, d_{keys}$\]
- $\mathbf{V}$ os a matrix with one row per value of shape \[$n_{keys}, d_{values}$\] where $d_{values}$ is the number of dimensions of each value