Alan Turing's famous Turning test helps evaulate whether a machine's intelligence matches a human's intelligence. This test was called the imitation game. Where a machine has to try fool the human into thinking it is a human. 

A common approach to language tasks are Recurrent Neural Networks (RNNs), but there are many other types that have other use cases:

- Character RNN used to predict the next character in an sentence, using a Stateless RNN and then a Stateful RNN.
- Sentiment Analysis by extracting a feeling within a sentence
- Neural Machine Translation (NMT) capable of tranlating languages. 

We will also look at how we can boost the RNN performance by using Attention Mechanisms and Encoder-Decoder architecture, which allows the network to focus on a select part of the inputs at each time step. 

Finally, we will then look at a Transformer, a very succesful NLP architecture, before discussing GPT-2 and BERT. 

In [1]:
import sys 
sys.version_info > (3, 5)

import numpy as np
import tensorflow as tf
assert tf.__version__ > "2.0"
from tensorflow import keras
import matplotlib.pyplot as plt

# Shakespeare Dataset

Below is an example of how we would work with text data by converting it using a tokenizer, how to split text data because we cannot shuffle the data as we do with tabular data, 

In [None]:
shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

Downloading data from https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt


In [None]:
print(shakespeare_text[60:250])



All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



In [None]:
"".join(sorted(set(shakespeare_text.lower()))) # list of characters within dataset

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

## Tokenize Text data

In [None]:
# convert all characters into a unique character ID
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)

In [None]:
tokenizer.texts_to_sequences('Romeo')

[[9], [4], [15], [2], [4]]

In [None]:
"".join(tokenizer.sequences_to_texts([[9], [4], [15], [2], [4]]))

'romeo'

In [None]:
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count

Note, the word encoder sets the IDs from 1 to 39 so when we convert the entire text to ID we need to subtract 1 so we can get IDs from 0 to 38.


In [None]:
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

Before we talk about how we can split the text data into training, validation and test set, lets first talk about how we can split the time series data.

##### Splitting Time series data

The safest way is to split the data up across time. For example, take the years 2000 to 2016 as the training, 2017 to 2019 as the validation and leaving 2020 to 2021 as the test set. Ensure there is no overlap in the sets. 

There are two problems: correlation between time series data and assuming your data is a stationary. 
- **Correlation** between variables can lead to an optimisitically biased generalization error, because the training and test set, both contain time series data which are correlated. In these scenarios we should avoid having correlated time series across the training and test set.s

- Assuming that your data is a **Stationary** time series (i.e. the mean, variance and autocorrelation does not change). This assumption works well for most time series data but some time series data has disappearing patterns over time. In these scenarios we would benefit by training the data on short time spans. You can plot the model's error on the validation set, and if you observe increasing errors towards the end of the data then you know the data is not stationary enough.

For example, if you have financial data for many companies, some companies are well correlated because of the sectors that they are in. Traders would exploit these correlations once they realise it, however patterns may soon disappear because of it. The correlation, alongside the unstationary nature, of the data prevents us from obtaining a generalizable model.

Ultimately, how you split time series data depends on the task at hand. 




## Splitting Sequential Text data

Splitting text data is pretty simple, in that we must have no overlap between the sets and introduce a gap to avoid paragraph overlapping. 

In [None]:
train_size = dataset_size * 90 // 100 # take 90% of the data and // 100 to get steps of 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

The `dataset` now is a single sequence of over one million characters. Recall how RNNs work on the previous notebook. If we were to train the neural network it would be equivalent to training a deep neural network with over a million layers - with only one (very long) instance!

Instead, we need to convert this dataset into smaller windows of text. The length of the window size is the maximum pattern length the RNN will learn. The RNN will unrolled over the length of the substrings, this is called **Truncated Backpropagation Through Time (TBPTT)**. Read [this](https://www.quora.com/Whats-the-key-difference-between-backprop-and-truncated-backprop-through-time) Quora answer to understand the difference between backpropagation through time and truncated.


In [None]:
n_steps = 100
window_length = n_steps + 1
dataset = dataset.window(size=window_length, shift=1, drop_remainder=True)

The `shift` argument causes the difference between the next window to be 1 character. For example, the first window will be 0 to 100 the next will be 1 to 101 etc.. Setting the `drop_remainder=True` argument makes every window size equal to `size` argument. Otherwise, the last windows will go from 100 to 1 characters in length. 

In [None]:
dataset # datasets within a dataset

<WindowDataset shapes: DatasetSpec(TensorSpec(shape=(), dtype=tf.int64, name=None), TensorShape([])), types: DatasetSpec(TensorSpec(shape=(), dtype=tf.int64, name=None), TensorShape([]))>

In [None]:
# we now need to flatten it, as the model only accepts tensors
dataset = dataset.flat_map(lambda window: window.batch(window_length))

# the flat_map function flattens the dataset
# the lambda function forces it to create tensors of window_size length

# for example, if 
# example = {{1, 2}, {3, 4, 6, 7}, {8, 9, 10}}
# then example.flat_map(lambda eg: eg.batch(2)), would become
# {{1, 2}, {3, 4}, {5, 6}, {7, 8}, {9, 10}}

Now that the dataset is in the right shape we can shuffle these windows so that gradient descent can have instances that are indepenedent and identically distributed across the training set. 

In [None]:
batch_size = 32
dataset = dataset.shuffle(10000, seed=42).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:])) # X, y

In [None]:
z = [1, 2, 3, 4, 5]
(z[: -1], z[1: ]) # we are trying to predict the next window size

([1, 2, 3, 4], [2, 3, 4, 5])

In [None]:
# one hot encode the dataset as there are not many unique characters ~ 39
dataset = dataset.map(
    lambda X_batch, y_batch: (tf.one_hot(X_batch, depth=max_id), y_batch))

# calling prefetch allows later elements to be prepared while the current element is being processed
dataset = dataset.prefetch(1)

In [None]:
for X_batch, y_batch in dataset.take(1):
  print(X_batch.shape, y_batch.shape)

(32, 100, 39) (32, 100)


## Build Model - Char RNN

We can train a model on all of Shakespeare's work and then use it to predict a character in a sentence. This can be used to produce novel text and is pretty fun to read about. 

Read this blog by Andrej Karapthy: https://karpathy.github.io/2015/05/21/rnn-effectiveness/

In [None]:
model = keras.models.Sequential([
  keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id], 
                   dropout=0.2,), # recurrent_dropout=0.2), #  prevents GPU support
  keras.layers.GRU(128, return_sequences=True,
                   dropout=0.2,), # recurrent_dropout=0.2), #  prevents GPU support
  keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
model.fit(dataset, epochs=5) #  this cell will take a long time to run!

Epoch 1/5
   8063/Unknown - 1602s 198ms/step - loss: 1.7187

KeyboardInterrupt: ignored

## Make Predictions

In [None]:
def preprocess(texts):
  """
  Function that preprocesses text data and returns one hot encoded data. 
  """
  X = np.array(tokenizer.texts_to_sequences(texts)) - 1
  return tf.one_hot(X, max_id)

X_new = preprocess(['great new'])
Y_pred = np.argmax(model(X_new), axis=-1)
print(tokenizer.sequences_to_texts(Y_pred + 1)[0][-1]) # print first sentence last character

s


Although, this is amusing and satisifying to have predicted the next character this does not work well in practice because the model would repeat the same works over and over again.

Instead, we can pick the next letter randomly which will generate diverse and interesting text. We can use the `tf.random.categorical` function, which takes in logits divided by a hyperparameter, temperature. Lower values favour high probability characters while high values will give characters an equal probability. 

The model is ok for small data but if we wanted to realise patterns over a large time step, you can use Stateful RNNs.

# Stateful RNN

So far we have trained Stateless RNNs, this is where at each iteration the model starts with hidden state full of zeros and updates them at the end of each time step. It then removes them at the last time step.

![Stateless vs Stateful](https://media.springernature.com/original/springer-static/image/chp%3A10.1007%2F978-3-030-27947-9_24/MediaObjects/480892_1_En_24_Fig3_HTML.png)

Stateful RNNs reuse the state between batches instead of reinitializing them. This can allow the model to learn long term patterns.

One thing we need to change about the input dataset when using Stateful RNNs is not to split the batches up so that one batch starts where the previous batch left off. There should be no overlap like we saw with the `windows()` function earlier. Stateful RNNs require sequential and non-overlapping input sequences.

Unfortnately, this is not easy to do and requires a lot of code. 

In [None]:
batch_size = 32
encoded_parts = np.array_split(encoded[:train_size], batch_size) # split into 32 parts
# len(encoded_parts) = 32 

datasets = []
n_steps = 100
window_length = n_steps + 1

for encoded_part in encoded_parts:
  dataset = tf.data.Dataset.from_tensor_slices(encoded_part) # dataset object
  dataset = dataset.window(window_length, shift=n_steps,
                           drop_remainder=True) # flatten windows
  dataset = dataset.flat_map(lambda window: (window.batch(window_length)))
  datasets.append(dataset)


dataset = tf.data.Dataset.zip(tuple(datasets)).map(lambda *windows: tf.stack(windows)) # create one massive dataset
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

Make sure you specify `stateful=True` and the `batch_input_shape`, this is so tensorflow can preserve a state for each input sequenence in the batch. 

In [None]:
model = keras.models.Sequential([
  keras.layers.GRU(128, return_sequences = True, stateful = True,
                   dropout = 0.2, # recurrent_dropout=0.2,
                   batch_input_shape = [batch_size, None, max_id]),
  keras.layers.GRU(128, return_sequences = True, stateful = True,
                   dropout = 0.2),
  keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))
])

In [None]:
class ResetStatesCallback(keras.callbacks.Callback):
  """
  Callback used in Stateful RNN model, to reset states at the end of each
  epoch.
  """
  def on_epoch_begin(self, epoch, logs):
    self.model.reset_states()

In [None]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
model.fit(dataset, epochs=10, callbacks=[ResetStatesCallback()])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f55ee451cd0>

# Sentiment Analysis

Sentiment Analysis is when you classifiy a piece of text as either positive (1) or negative (0). A popular dataset is used, kind of like the "hello world" of sentiment analysis called the IMDB reviews dataset.

In [2]:
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


`keras` provides the preprocessed dataset, where the data has been tokenized per word, punctuations removed, words converted to lowercase and indexed by frequency (low values represent more frequent words).

The integers 0, 1 and 2 are special and represent padding tokens, start of sequence (SSS) token and unknown words. 

In [13]:
word_index = keras.datasets.imdb.get_word_index()
# Reverse the word index to obtain a dict mapping indices to words
inverted_word_index = dict((i+3, word) for (word, i) in word_index.items())

for i, token in enumerate(('<pad>', '<sos>', '<unk>')):
  inverted_word_index[i] = token
# Decode the 3rd sequence in the dataset
decoded_sequence = " ".join(inverted_word_index[i] for i in X_train[4])
decoded_sequence

"<sos> worst mistake of my life br br i picked this movie up at target for 5 because i figured hey it's sandler i can get some cheap laughs i was wrong completely wrong mid way through the film all three of my friends were asleep and i was still suffering worst plot worst script worst movie i have ever seen i wanted to hit my head up against a wall for an hour then i'd stop and you know why because it felt damn good upon bashing my head in i stuck that damn movie in the microwave and watched it burn and that felt better than anything else i've ever done it took american psycho army of darkness and kill bill just to get over that crap i hate you sandler for actually going through with this and ruining a whole day of my life"

Looks a like a movie with Adam Sandler in it - pretty funny review, definitely negative. 

Notice that within the review there are characters like `br` and some punctuation has been left in. 

Tokenizing words and splitting them by space boundaries might not work in all situations. There are some words where a space is used but both of them tie in together, for example San Francisco. Fortunately, we can take advantage of many open source tools such as Google's Sentence Piece, Byte Pair Encoding or WordPiece. These tools can help convert text into intgers that can be used in a model.

Instead of using the ready made dataset, lets use the byte string imdb data from tensorflow.

### Load IMDB Reviews

In [15]:
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteDUJ4HG/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteDUJ4HG/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteDUJ4HG/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [21]:
for X_batch, y_batch in datasets["train"].shuffle(20).batch(2).take(1):
    for review, label in zip(X_batch.numpy(), y_batch.numpy()):
        print("Review:", review.decode("utf-8")[:200], "...")
        print("Label:", label, "= Positive" if label else "= Negative")
        print()

Review: Nathan Detroit runs illegal craps games for high rollers in NYC, but the heat is on and he can't find a secure location. He bets chronic gambler Sky Masterson that Sky can't make a prim missionary, Sa ...
Label: 0 = Negative

Review: During a sleepless night, I was switching through the channels & found this embarrassment of a movie. What were they thinking?<br /><br />If this is life after "Remote Control" for Kari (Wuhrer) Salin ...
Label: 0 = Negative



In [17]:
datasets.keys()

dict_keys(['test', 'train', 'unsupervised'])

In [None]:
train_size = info.splits['train'].num_examples
test_size = info.splits['test'].num_examples

### Preprocess data - Encoding

The best thing to do when creating a tensorflow models is to try restrict any preprocessing steps with tensorflow operations only. 

In [29]:
def preprocess(X_batch, y_batch):
  """
  Preprocesses the data and returns a dense tensor
  """
  # 
  X_batch = tf.strings.substr(X_batch, 0, 300) # shorten reviews to speed up processing
  X_batch = tf.strings.regex_replace(X_batch, b"<br\s*/?>", b" ")
  X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z]", b" ")
  X_batch = tf.strings.split(X_batch) #  returns a ragged tensor
  return X_batch.to_tensor(default_value=b"<pad>"), y_batch

In [30]:
from collections import Counter

vocabulary = Counter()

# create vocabulary dictionary
for X_batch, y_batch in datasets['train'].batch(32).map(preprocess):
  for review in X_batch:
    vocabulary.update(list(review.numpy()))

In [31]:
len(vocabulary)

49739

That is a total of ~50,000 words. Not all words will be important so lets truncate to the top 10,000 words.

In [34]:
vocabulary.most_common()[:5]

[(b'<pad>', 224494),
 (b'the', 61156),
 (b'a', 38569),
 (b'of', 33984),
 (b'and', 33432)]

In [35]:
vocab_size = 10000
truncated_vocabulary = [word for word, count in vocabulary.most_common()[:vocab_size]]

In [38]:
# tokenize each word 
words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(words), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)

# specify the number of out of bucket values
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets=num_oov_buckets)
table

<tensorflow.python.ops.lookup_ops.StaticVocabularyTable at 0x7f89869910d0>

The `oov_buckets` is gives us a margin when the model is passed a word it does not have within the Vocabulary Table. It will simply id that word as 10000, i.e. vocab_size + 1.

In [39]:
def encode_words(X_batch, y_batch):
  """
  Uses a vocabulary table to encode the reviews
  """
  return table.lookup(X_batch), y_batch

train_set = datasets['train'].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

### Train Model

In [40]:
embedding_dimension_size = 128

model = keras.models.Sequential([
  keras.layers.Embedding(input_dim=vocab_size + num_oov_buckets,
                         output_dim=embedding_dimension_size,
                         input_shape=[None], 
                         mask_zero=True), #  ignores padding tokens, i.e. id of 0
  keras.layers.GRU(128, return_sequences=True),
  keras.layers.GRU(128, return_sequences=False),
  keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam',
              metrics=['accuracy'])

model.fit(train_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f8985318350>

The `mask_zero=True` allows the model to ignore the `id=0`, which is typically `<pad>`. If it is not then make it. 


In [44]:
test_set = datasets['test'].batch(32).map(preprocess)
test_set = test_set.map(encode_words).prefetch(1)

In [45]:
model.evaluate(test_set)



[0.8169209957122803, 0.7268400192260742]

This is great but having to manually train your own models in a task that is very common like Sentiment Analysis is unreasonable. We can take advantage of tensorflow's pretrained embeddings. 

Embeddings need to be learned but some words like amazing, awesome and fantastic would end up close to one another within the embedded space. Instead of having to relearn this, the pretrained model, on tensorflow, can be used - which is trained on a huge corpus dataset - 7 billion words!



### Reusing Pretrained Embeddings

There are numerous pretrained models on tensorflow [word embeddings](https://www.tensorflow.org/text/guide/word_embeddings). Take a look at TF Hub repository [here](https://tfhub.dev). Browse the models with the repository and just copy the code across to use in your project. 

In [2]:
# load data 
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)
train_size = info.splits["train"].num_examples
batch_size = 32
train_set = datasets["train"].batch(batch_size).prefetch(1)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete7W133M/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete7W133M/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete7W133M/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [3]:
# load pretrained model from tf hub
import tensorflow_hub as hub

model = keras.Sequential([
  hub.KerasLayer('https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1',
                 dtype=tf.string, input_shape=[], output_shape=[50]),
  keras.layers.Dense(128),
  keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [4]:
model.fit(train_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fe32dfb3b50>

# Encoder Decoder Translation

Translating sentences can be complex, unlike sentiment analysis where only the first 100 words will allow us to conclude whether a review is positive or negative, we do require entire entire scripts when we want to translate them. We cannot skip the remaining corpus after the first 100 words - it wouldn't be a great translator! 

This is where the Encoder-Decoder architecture is used. The encoder is fed English words while the decoder is fed the target word (i.e. the translated word). Each prediction ends with a `<EOS>` (End of sequennce) token and starts with a `<SOS>` (Start of sequenece) token. 

![Example of Translation Model](https://camo.githubusercontent.com/2b7ba2f149230fe06f19cbee902fa29559181b427728c96afd1c5cca43fe5372/68747470733a2f2f736d65726974792e636f6d2f6d656469612f696d616765732f61727469636c65732f323031362f676e6d745f617263685f315f656e635f6465632e737667)


As you can see, after each word is translated, it is passed through a softmax function. The output of the decoder is a score of every word within the vocabuarly, this could thousands of word! Which is why the **sampled softmax** is used to speed up computation, you can access it at `tf.nn.sampled_softmax_loss()`. Just like softmax, the word with the highest probability is the output so you can use the `sparse_categorical_crossentropy` loss. 


At inference time (after training) the target sequence can no longer be fed to the decoder, instead, the decoder is fed the output at the previous step.

The [`TensorFlow Addons`](https://www.tensorflow.org/addons) project helps us build sequence to sequence models and are production ready code.


In [7]:
!pip install tensorflow_addons

Collecting tensorflow_addons
  Downloading tensorflow_addons-0.13.0-cp37-cp37m-manylinux2010_x86_64.whl (679 kB)
[?25l[K     |▌                               | 10 kB 24.7 MB/s eta 0:00:01[K     |█                               | 20 kB 26.7 MB/s eta 0:00:01[K     |█▌                              | 30 kB 13.4 MB/s eta 0:00:01[K     |██                              | 40 kB 9.7 MB/s eta 0:00:01[K     |██▍                             | 51 kB 5.4 MB/s eta 0:00:01[K     |███                             | 61 kB 5.9 MB/s eta 0:00:01[K     |███▍                            | 71 kB 5.6 MB/s eta 0:00:01[K     |███▉                            | 81 kB 6.3 MB/s eta 0:00:01[K     |████▍                           | 92 kB 4.8 MB/s eta 0:00:01[K     |████▉                           | 102 kB 5.2 MB/s eta 0:00:01[K     |█████▎                          | 112 kB 5.2 MB/s eta 0:00:01[K     |█████▉                          | 122 kB 5.2 MB/s eta 0:00:01[K     |██████▎                  

In [21]:
import tensorflow_addons as tfa

# define arbitary vocab and embedding dimension size
vocab_size = 100
embed_size = 10

# inputs to encoder and decoder
encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
sequence_lengths = keras.layers.Input(shape=[], dtype=np.int32)

# define embeddings space for encoder and decoder
embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)

# encoder network 
encoder = keras.layers.LSTM(512, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_embeddings) # o, h, c 
encoder_state = [state_h, state_c]

# define sampler to tell the decoder what the output should be 
sampler = tfa.seq2seq.sampler.TrainingSampler() 

# define decoder network using tfa 
decoder_cells = keras.layers.LSTMCell(512)
output_layer = keras.layers.Dense(vocab_size)
#  samples output distribution and produces the input for the next decoding step
decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cells, sampler, output_layer=output_layer)

final_outputs, final_state, final_sequence_lengths = decoder(
    decoder_embeddings, initial_state=encoder_state,
    sequence_length=sequence_lengths
)

Y_proba = tf.nn.softmax(final_outputs.rnn_output)

model = keras.Model(inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
                    outputs=Y_proba)

In [22]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")

Read more about Neural Networks for Translations [here](https://www.tensorflow.org/addons/tutorials/networks_seq2seq_nmt). 

# Bidirectional RNNs

# Attention Mechanisms