**Natural Language Processing with RNNs and Attention**

This notebook is inspired from the handson-ml2 GitHub repository by Aurélien Geron

https://github.com/ageron/handson-ml2

# Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20 and TensorFlow ≥2.0.

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import logging
tf.get_logger().setLevel(logging.ERROR)

import tensorflow_hub as hub

# Char-RNN

## Splitting a sequence into batches of shuffled windows

For example, let's split the sequence 0 to 14 into windows of length 5, each shifted by 2 (e.g.,`[0, 1, 2, 3, 4]`, `[2, 3, 4, 5, 6]`, etc.), then shuffle them, and split them into inputs (the first 4 steps) and targets (the last 4 steps) (e.g., `[2, 3, 4, 5, 6]` would be split into `[[2, 3, 4, 5], [3, 4, 5, 6]]`), then create batches of 3 such input/target pairs:

In [137]:
np.random.seed(42)
tf.random.set_seed(42)

n_steps = 5
dataset = tf.data.Dataset.range(15)
# [0 to 14]

In [138]:
def printds(ds):
    print(list(dataset.as_numpy_iterator()))

In [139]:
printds(dataset)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]


https://www.tensorflow.org/api_docs/python/tf/data/Dataset

In [140]:
dataset = dataset.window(n_steps, shift=2, drop_remainder=True)


In [141]:
for window in dataset:
    print([elem.numpy() for elem in window])

[0, 1, 2, 3, 4]
[2, 3, 4, 5, 6]
[4, 5, 6, 7, 8]
[6, 7, 8, 9, 10]
[8, 9, 10, 11, 12]
[10, 11, 12, 13, 14]


In [142]:
# Right now, it is useness
dataset = dataset.flat_map(lambda window: window.batch(n_steps))
printds(dataset)

[array([0, 1, 2, 3, 4]), array([2, 3, 4, 5, 6]), array([4, 5, 6, 7, 8]), array([ 6,  7,  8,  9, 10]), array([ 8,  9, 10, 11, 12]), array([10, 11, 12, 13, 14])]


In [143]:
dataset = dataset.shuffle(10).map(lambda window: (window[:-1], window[1:]))
printds(dataset)

[(array([6, 7, 8, 9]), array([ 7,  8,  9, 10])), (array([2, 3, 4, 5]), array([3, 4, 5, 6])), (array([4, 5, 6, 7]), array([5, 6, 7, 8])), (array([0, 1, 2, 3]), array([1, 2, 3, 4])), (array([ 8,  9, 10, 11]), array([ 9, 10, 11, 12])), (array([10, 11, 12, 13]), array([11, 12, 13, 14]))]


In [144]:
dataset = dataset.batch(3).prefetch(1)
printds(dataset)

[(array([[ 4,  5,  6,  7],
       [ 0,  1,  2,  3],
       [ 8,  9, 10, 11]]), array([[ 5,  6,  7,  8],
       [ 1,  2,  3,  4],
       [ 9, 10, 11, 12]])), (array([[10, 11, 12, 13],
       [ 2,  3,  4,  5],
       [ 6,  7,  8,  9]]), array([[11, 12, 13, 14],
       [ 3,  4,  5,  6],
       [ 7,  8,  9, 10]]))]


In [145]:

for index, (X_batch, Y_batch) in enumerate(dataset):
    print("_" * 20, "Batch", index, "\nX_batch")
    print(X_batch.numpy())
    print("=" * 5, "\nY_batch")
    print(Y_batch.numpy())

____________________ Batch 0 
X_batch
[[ 2  3  4  5]
 [ 0  1  2  3]
 [ 8  9 10 11]]
===== 
Y_batch
[[ 3  4  5  6]
 [ 1  2  3  4]
 [ 9 10 11 12]]
____________________ Batch 1 
X_batch
[[10 11 12 13]
 [ 4  5  6  7]
 [ 6  7  8  9]]
===== 
Y_batch
[[11 12 13 14]
 [ 5  6  7  8]
 [ 7  8  9 10]]


## Loading the Data and Preparing the Dataset

First, we will download all of Shakespeare’s work, using Keras’s handy `get_file()` function and downloading the data from Andrej Karpathy’s Char-RNN project.

In [1]:
filepath = '/cxldata/dlcourse/shakespeare.txt'
with open(filepath) as f:
    shakespeare_text = f.read()

In [2]:
print(shakespeare_text[:148])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?



In [5]:
"".join(sorted(set(shakespeare_text.lower())))

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

Next, we must encode every character as an integer. For this, we will use Keras’s `Tokenizer` class.

First we need to fit a tokenizer to the text: it will find all the characters used in the text and map each of them to a different character ID, from 1 to the number of distinct characters (it does not start at 0, so we can use that value for masking).

We set char_level=True to get character-level encoding rather than the default word-level encoding. Note that this tokenizer converts the text to lowercase by default (but you can set lower=False if you do not want that).

In [6]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)

Using the above tokenizer to encode a sentence (or a list of sentences) to a list of character IDs.

In [7]:
tokenizer.texts_to_sequences(["First"])

[[20, 6, 9, 8, 3]]

And now, using the same tokenizer to decode back to that sentence from the list of character IDs.

In [8]:
tokenizer.sequences_to_texts([[20, 6, 9, 8, 3]])

['f i r s t']

In [9]:
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count # total number of characters

Now we will encode the full text so each character is represented by its ID (we subtract 1 to get IDs from 0 to 38, rather than from 1 to 39).

In [10]:
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

Now we will split the sequential dataset. We will take the first 90% of the text for the training set (keeping the rest for the validation set and the test set), and create a `tf.data.Dataset` that will return each character one by one from this set.

In [11]:
train_size = dataset_size * 90 // 100
# Creates a Dataset whose elements are slices of the given tensors
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

Now we will chop the sequential dataset into multiple windows.

Every instance in the dataset will be a fairly short substring of the whole text, and the RNN will be unrolled only over the length of these substrings.This is called truncated **backpropagation through time**.

In [12]:
n_steps = 100

# By default, the window() method creates non-overlapping windows, 
# but to get the largest possible training set we use shift=1 so that 
# the first window contains characters 0 to 100, the second contains characters 1 to 101, and so on.
window_length = n_steps + 1 # target = input shifted 1 character ahead

# The window() method to convert the long sequence of characters of the dataset into many smaller windows of text.
# To ensure that all windows are exactly 101 characters long (which will allow us to create batches 
# without having to do any padding), we set drop_remainder=True
# (otherwise the last 100 windows will contain 100 characters, 99 characters, and so on down to 1 character)
dataset = dataset.repeat().window(window_length, shift=1, drop_remainder=True)

The `window()` method creates a dataset that contains windows, each of which is also represented as a dataset. It’s a nested dataset, analogous to a list of lists. However, we cannot use a nested dataset directly for training, as our model will expect tensors as input, not datasets. So, we must call the `flat_map()` method: it converts a nested dataset into a flat dataset.

In [13]:
dataset = dataset.flat_map(lambda window: window.batch(window_length))

Notice that we called `batch(window_length)` on each window: since all windows have exactly that length, we will get a single tensor for each of them. Now the dataset contains consecutive windows of 101 characters each.

In [14]:
np.random.seed(42)
tf.random.set_seed(42)

Since Gradient Descent works best when the instances in the training set are independent and identically distributed we need to shuffle these windows. Then we can batch the windows and separate the inputs (the first 100 characters) from the target (the last character).

In [15]:
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

**This is a visual summary of the dataset preparation steps discussed so far.**
![title](images/preparing_dataset_shuffle_windows.png)

Categorical input features should generally be encoded, usually as one-hot vectors or as embeddings. Here, we will encode each character using a one-hot vector because there are fairly few distinct characters (only 39).

In [16]:
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

In [17]:
dataset = dataset.prefetch(1)

In [18]:
for X_batch, Y_batch in dataset.take(1):
    print(X_batch.shape, Y_batch.shape)

(32, 100, 39) (32, 100)


We have prepared the dataset. Now we will create the model.

## Creating and Training the Model

Now we will create a model to predict the next character based on the previous 100 characters.

Since the output layer is a time-distributed Dense layer and the output probabilities should sum up to 1 at each time step, so we apply the softmax activation function to the outputs of the Dense layer.

In [19]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.GRU(128, return_sequences=True,
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                    activation="softmax"))
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")

**Please note that the following code takes a considerable amount of time to execute. So we ran this on a different system, saved the model, and have loaded the saved model to use it if required. This process has been followed throughout the entire notebook.**

In [20]:
# history = model.fit(dataset, steps_per_epoch=train_size // batch_size,
#                    epochs=10)
# np.save('history/shakespeare.npy', history.history)
# model.save("models/shakespeare.h5")

In [21]:
model = keras.models.load_model('models/shakespeare.h5')
history = np.load('history/shakespeare.npy',allow_pickle='TRUE').item()

## Using the Model to Generate Text

To feed some text to the model that we created above, we first need to preprocess it like we did earlier. This `preprocess()` function will server this purpose:

In [22]:
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id)

Now let’s use the model to predict the next letter in some text:

In [23]:
X_new = preprocess(["How are yo"])
Y_pred = model.predict_classes(X_new)
tokenizer.sequences_to_texts(Y_pred + 1)[0][-1] # 1st sentence, last char

'u'

To generate new text using the Char-RNN model, we can pick the next character randomly, with a probability equal to the estimated probability, using TensorFlow’s `tf.random.categorical()` function. This will generate more diverse and interesting text.

The `categorical()` function samples random class indices, given the class log probabilities (logits).

In [24]:
tf.random.set_seed(42)

tf.random.categorical([[np.log(0.5), np.log(0.4), np.log(0.1)]], num_samples=40).numpy()

array([[0, 1, 0, 2, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 2, 1, 0, 2, 1,
        0, 1, 2, 1, 1, 1, 2, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 2]])

To have more control over the diversity of the generated text, we can divide the logits by a number called the `temperature`, which we can tweak as we wish. A `temperature` close to 0 will favor the high probability characters. While a very high `temperature` will give all characters an equal probability.

The following `next_char()` function uses this approach to pick the next character to add to the input text.

In [25]:
def next_char(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model.predict(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

In [26]:
tf.random.set_seed(42)

next_char("How are yo", temperature=1)

'u'

Next, we will write a small function that will repeatedly call `next_char()` to get the next character and append it to the given text.

In [27]:
def complete_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

We are now ready to generate some text! Let’s try with different temperatures.

In [28]:
tf.random.set_seed(42)

print(complete_text("t", temperature=0.2))

the belly to the belly,
who comes the belly and the


In [29]:
print(complete_text("t", temperature=1))

thing? or why did be great, then?

bianca:
a good c


In [30]:
print(complete_text("t", temperature=2))

th no cytyunes
fft afne charp oak, mean'd: for scog


Our Shakespeare model works best at a temperature close to 1.

**Go Back to the Slides**

## Stateful RNN

In [31]:
tf.random.set_seed(42)

A stateful RNN only makes sense if each input sequence in a batch starts exactly where the corresponding sequence in the previous batch left off. So the first thing we need to do to build a stateful RNN is to use sequential and non-overlapping input sequences (unlike stateless RNNs).

When creating the Dataset, we must therefore use `shift=n_steps` (instead of `shift=1`) when calling the `window()` method.

Here we are chopping Shakespeare’s text into 32 texts of equal length, create one dataset of consecutive input sequences
for each of them, and finally use `tf.train.Dataset.zip(datasets).map(lambda*windows: tf.stack(windows))` to create proper consecutive batches, where the _nth_ input sequence in a batch starts off exactly where the _nth_ input sequence ended in the
previous batch.

In [32]:
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])
dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_length))
dataset = dataset.repeat().batch(1)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

In [33]:
batch_size = 32
encoded_parts = np.array_split(encoded[:train_size], batch_size)
datasets = []
for encoded_part in encoded_parts:
    dataset = tf.data.Dataset.from_tensor_slices(encoded_part)
    dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
    dataset = dataset.flat_map(lambda window: window.batch(window_length))
    datasets.append(dataset)
dataset = tf.data.Dataset.zip(tuple(datasets)).map(lambda *windows: tf.stack(windows))
dataset = dataset.repeat().map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

**This is a visual summary of the steps discussed so far.**
![title](images/stateful_rnn.png)

Now we will create the stateful RNN model by setting `stateful=True` for all recurrent layer, and `batch_input_shape` argument in the first layer to the respective batch size.

In [34]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, stateful=True,
                     dropout=0.2, recurrent_dropout=0.2,
                     batch_input_shape=[batch_size, None, max_id]),
    keras.layers.GRU(128, return_sequences=True, stateful=True,
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                    activation="softmax"))
])

At the end of each epoch, we need to reset the states before we go back to the beginning of the text. For this, we can use this small callback function.

In [35]:
class ResetStatesCallback(keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs):
        self.model.reset_states()

In [36]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
steps_per_epoch = train_size // batch_size // n_steps

Now we can compile and fit the model (for more epochs, because each epoch is much shorter than earlier, and there is only one instance per batch).

In [37]:
# history = model.fit(dataset, steps_per_epoch=steps_per_epoch, epochs=50,
#                     callbacks=[ResetStatesCallback()])
# np.save('history/stateful_rnn.npy', history.history)
# model.save("models/stateful_rnn.h5")

In [38]:
model = keras.models.load_model('models/stateful_rnn.h5')
history = np.load('history/stateful_rnn.npy',allow_pickle='TRUE').item()

To use the model with different batch sizes, we need to create a stateless copy. We can get rid of dropout since it is only used during training:

In [39]:
stateless_model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                    activation="softmax"))
])

To set the weights, we first need to build the model (so the weights get created):

In [40]:
stateless_model.build(tf.TensorShape([None, None, max_id]))

In [41]:
stateless_model.set_weights(model.get_weights())
model = stateless_model

In [42]:
tf.random.set_seed(42)

print(complete_text("t"))

thought thet thy censent;
and sight me now, fut tho


**Go Back to the Slides**

# Sentiment Analysis

In [43]:
tf.random.set_seed(42)

You can load the IMDB dataset easily:

In [44]:
(X_train, y_test), (X_valid, y_test) = keras.datasets.imdb.load_data()

This dataset is already preprocessed.

- `X_train` consists of a list of reviews, each of which is represented as a NumPy array of integers, where each integer represents a word.
- All punctuation was removed, and then words were converted to lowercase, split by spaces, and finally indexed by frequency (so low integers correspond to frequent words).
- The integers 0, 1, and 2 are special: they represent the padding token, the **start-of-sequence (SSS)** token, and unknown words, respectively.

In [45]:
X_train[0][:10]

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

If you want to visualize a review, you can decode it like this:

In [46]:
word_index = keras.datasets.imdb.get_word_index()
id_to_word = {id_ + 3: word for word, id_ in word_index.items()}
for id_, token in enumerate(("<pad>", "<sos>", "<unk>")):
    id_to_word[id_] = token
" ".join([id_to_word[id_] for id_ in X_train[0][:10]])

'<sos> this film was just brilliant casting location scenery story'

**Go Back to the Slides**

First, we will load the original IMDb reviews, as text (byte strings), using TensorFlow Datasets.

In [47]:
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)

In [48]:
datasets.keys()

dict_keys(['test', 'train', 'unsupervised'])

In [49]:
train_size = info.splits["train"].num_examples
test_size = info.splits["test"].num_examples

In [50]:
train_size, test_size

(25000, 25000)

In [51]:
for X_batch, y_batch in datasets["train"].batch(2).take(1):
    for review, label in zip(X_batch.numpy(), y_batch.numpy()):
        print("Review:", review.decode("utf-8")[:200], "...")
        print("Label:", label, "= Positive" if label else "= Negative")
        print()

Review: This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting  ...
Label: 0 = Negative

Review: I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However  ...
Label: 0 = Negative



Now we will create this preprocessing function where we will:
- Truncate the reviews, keeping only the first 300 characters of each since you can generally tell whether a review is positive or not in the first sentence or two.
- Then we use regular expressions to replace &lt;br /&gt; tags with spaces, and to replace any characters other than letters and quotes with spaces.
- Finally, the `preprocess()` function splits the reviews by the spaces, which returns a ragged tensor, and it converts this ragged tensor to a dense tensor, padding all reviews with the padding token "&lt;pad&gt;" so that they all have the same length.

In [52]:
def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300)
    X_batch = tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

In [53]:
preprocess(X_batch, y_batch)

(<tf.Tensor: shape=(2, 53), dtype=string, numpy=
 array([[b'This', b'was', b'an', b'absolutely', b'terrible', b'movie',
         b"Don't", b'be', b'lured', b'in', b'by', b'Christopher',
         b'Walken', b'or', b'Michael', b'Ironside', b'Both', b'are',
         b'great', b'actors', b'but', b'this', b'must', b'simply', b'be',
         b'their', b'worst', b'role', b'in', b'history', b'Even',
         b'their', b'great', b'acting', b'could', b'not', b'redeem',
         b'this', b"movie's", b'ridiculous', b'storyline', b'This',
         b'movie', b'is', b'an', b'early', b'nineties', b'US',
         b'propaganda', b'pi', b'<pad>', b'<pad>', b'<pad>'],
        [b'I', b'have', b'been', b'known', b'to', b'fall', b'asleep',
         b'during', b'films', b'but', b'this', b'is', b'usually', b'due',
         b'to', b'a', b'combination', b'of', b'things', b'including',
         b'really', b'tired', b'being', b'warm', b'and', b'comfortable',
         b'on', b'the', b'sette', b'and', b'having', b'j

Next, we will construct the vocabulary. This requires going through the whole training set once, applying our `preprocess()` function, and using a `Counter` to count the number of occurrences of each word.

In [54]:
from collections import Counter

vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

Let’s look at the three most common words:

In [55]:
vocabulary.most_common()[:3]

[(b'<pad>', 214309), (b'the', 61137), (b'a', 38564)]

In [56]:
len(vocabulary)

53893

Now we will truncate the vocabulary, keeping only the 10,000 most common words:

In [57]:
vocab_size = 10000
truncated_vocabulary = [
    word for word, count in vocabulary.most_common()[:vocab_size]]

In [58]:
word_to_id = {word: index for index, word in enumerate(truncated_vocabulary)}
for word in b"This movie was faaaaaantastic".split():
    print(word_to_id.get(word) or vocab_size)

22
12
11
10000


Now we need to add a preprocessing step to replace each word with its ID (i.e., its index in the vocabulary). We will create a lookup table for this, using 1,000 out-of-vocabulary (oov) buckets.

In [59]:
words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids) # Table initializers given keys and values tensors.
num_oov_buckets = 1000
# String to Id table wrapper that assigns out-of-vocabulary keys to buckets.
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

Let's use the above table to look up the IDs of a few words:

In [60]:
table.lookup(tf.constant([b"This movie was faaaaaantastic".split()]))

<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[   22,    12,    11, 10053]])>

**Note:** The words “this,” “movie,” and “was” were found in the table, so their IDs are lower than 10,000, while the word “faaaaaantastic” was not found, so it was mapped to one of the oov buckets, with an ID greater than or equal to 10,000.

Now we will create the final training set.We batch the reviews, then convert them to short sequences of words using the `preprocess()` function, then encode these words using a simple `encode_words()` function that uses the table we just built, and finally `prefetch` the next batch.

In [61]:
def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

train_set = datasets["train"].repeat().batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

In [62]:
for X_batch, y_batch in train_set.take(1):
    print(X_batch)
    print(y_batch)

tf.Tensor(
[[  22   11   28 ...    0    0    0]
 [   6   21   70 ...    0    0    0]
 [4099 6881    1 ...    0    0    0]
 ...
 [  22   12  118 ...  331 1047    0]
 [1757 4101  451 ...    0    0    0]
 [3365 4392    6 ...    0    0    0]], shape=(32, 60), dtype=int64)
tf.Tensor([0 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 0], shape=(32,), dtype=int64)


Now that we have preprocessed and created the dataset, we can create the model:

The first layer is an `Embedding` layer, which will convert word IDs into embeddings. The embedding matrix needs to have one row per word ID (`vocab_size + num_oov_buckets`) and one column per embedding dimension (this example uses 128 dimensions, but this is a hyperparameter you could tune). Whereas the inputs of the model will be 2D tensors of shape `[batch size, time steps]`, the output of the Embedding layer will be a 3D tensor of shape `[batch size, time steps, embedding size]`.

In [63]:
embed_size = 128
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size,
                           mask_zero=True, # not shown in the book
                           input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [64]:
# history = model.fit(train_set, steps_per_epoch=train_size // 32, epochs=5)
# np.save('history/sentiment_analysis.npy', history.history)
# model.save("models/sentiment_analysis.h5")

In [65]:
model = keras.models.load_model('models/sentiment_analysis.h5')
history = np.load('history/sentiment_analysis.npy',allow_pickle='TRUE').item()

**Go Back to the Slides**

### Manual masking

The previous model will need to learn that the padding tokens should be ignored. We can do this by adding `mask_zero=True` when creating the `Embedding` layer. This means that padding tokens (whose ID is 0) will be ignored by all downstream layers. That’s all! The following model is identical to the previous model, except it is built using the Functional API and handles masking manually:

In [66]:
K = keras.backend
embed_size = 128
inputs = keras.layers.Input(shape=[None])
mask = keras.layers.Lambda(lambda inputs: K.not_equal(inputs, 0))(inputs)
z = keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size)(inputs)
z = keras.layers.GRU(128, return_sequences=True)(z, mask=mask)
z = keras.layers.GRU(128)(z, mask=mask)
outputs = keras.layers.Dense(1, activation="sigmoid")(z)
model = keras.models.Model(inputs=[inputs], outputs=[outputs])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [67]:
# history = model.fit(train_set, steps_per_epoch=train_size // 32, epochs=5)
# np.save('history/manual_masking.npy', history.history)
# model.save("models/manual_masking.h5")

**Go Back to the Slides**