**Chapter 16 – Natural Language Processing with RNNs and Attention**

_This notebook contains all the sample code in chapter 16._

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/jflanigan/handson-ml2/blob/master/16_nlp_with_rnns_and_attention.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

# Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20 and TensorFlow ≥2.0.

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
    !pip install -q -U tensorflow-addons
    IS_COLAB = True
except Exception:
    IS_COLAB = False

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

if not tf.test.is_gpu_available():
    print("No GPU was detected. LSTMs and CNNs can be very slow without a GPU.")
    if IS_COLAB:
        print("Go to Runtime > Change runtime and select a GPU hardware accelerator.")

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "nlp"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.8/611.8 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25h

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


# Sentiment Analysis

In [2]:
tf.random.set_seed(42)

You can load the IMDB dataset easily:

In [3]:
(X_train, y_test), (X_valid, y_test) = keras.datasets.imdb.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [4]:
X_train[0][:10]

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

In [5]:
word_index = keras.datasets.imdb.get_word_index()
id_to_word = {id_ + 3: word for word, id_ in word_index.items()}
for id_, token in enumerate(("<pad>", "<sos>", "<unk>")):
    id_to_word[id_] = token
" ".join([id_to_word[id_] for id_ in X_train[0][:10]])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


'<sos> this film was just brilliant casting location scenery story'

In [6]:
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)

Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.EFXOR6_1.0.0/imdb_reviews-train.tfrecor…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.EFXOR6_1.0.0/imdb_reviews-test.tfrecord…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.EFXOR6_1.0.0/imdb_reviews-unsupervised.…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


In [7]:
datasets.keys()

dict_keys([Split('train'), Split('test'), Split('unsupervised')])

In [8]:
train_size = info.splits["train"].num_examples
test_size = info.splits["test"].num_examples

In [9]:
train_size, test_size

(25000, 25000)

In [10]:
for X_batch, y_batch in datasets["train"].batch(2).take(1):
    for review, label in zip(X_batch.numpy(), y_batch.numpy()):
        print("Review:", review.decode("utf-8")[:200], "...")
        print("Label:", label, "= Positive" if label else "= Negative")
        print()

Review: This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting  ...
Label: 0 = Negative

Review: I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However  ...
Label: 0 = Negative



In [11]:
def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300)
    X_batch = tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

In [12]:
preprocess(X_batch, y_batch)

(<tf.Tensor: shape=(2, 53), dtype=string, numpy=
 array([[b'This', b'was', b'an', b'absolutely', b'terrible', b'movie',
         b"Don't", b'be', b'lured', b'in', b'by', b'Christopher',
         b'Walken', b'or', b'Michael', b'Ironside', b'Both', b'are',
         b'great', b'actors', b'but', b'this', b'must', b'simply', b'be',
         b'their', b'worst', b'role', b'in', b'history', b'Even',
         b'their', b'great', b'acting', b'could', b'not', b'redeem',
         b'this', b"movie's", b'ridiculous', b'storyline', b'This',
         b'movie', b'is', b'an', b'early', b'nineties', b'US',
         b'propaganda', b'pi', b'<pad>', b'<pad>', b'<pad>'],
        [b'I', b'have', b'been', b'known', b'to', b'fall', b'asleep',
         b'during', b'films', b'but', b'this', b'is', b'usually', b'due',
         b'to', b'a', b'combination', b'of', b'things', b'including',
         b'really', b'tired', b'being', b'warm', b'and', b'comfortable',
         b'on', b'the', b'sette', b'and', b'having', b'j

In [13]:
from collections import Counter

vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

In [14]:
vocabulary.most_common()[:3]

[(b'<pad>', 214309), (b'the', 61137), (b'a', 38564)]

In [15]:
len(vocabulary)

53893

In [16]:
vocab_size = 10000
truncated_vocabulary = [
    word for word, count in vocabulary.most_common()[:vocab_size]]

In [17]:
word_to_id = {word: index for index, word in enumerate(truncated_vocabulary)}
for word in b"This movie was faaaaaantastic".split():
    print(word_to_id.get(word) or vocab_size)

22
12
11
10000


In [18]:
words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

In [19]:
table.lookup(tf.constant([b"This movie was faaaaaantastic".split()]))

<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[   22,    12,    11, 10053]])>

In [20]:
def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

train_set = datasets["train"].repeat().batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

In [21]:
for X_batch, y_batch in train_set.take(1):
    print(X_batch)
    print(y_batch)

tf.Tensor(
[[  22   11   28 ...    0    0    0]
 [   6   21   70 ...    0    0    0]
 [4099 6881    1 ...    0    0    0]
 ...
 [  22   12  118 ...  331 1047    0]
 [1757 4101  451 ...    0    0    0]
 [3365 4392    6 ...    0    0    0]], shape=(32, 60), dtype=int64)
tf.Tensor([0 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 0], shape=(32,), dtype=int64)


----- OUR CODE STARTS HERE -----


In [24]:
embed_size = 16
adam = keras.optimizers.Adam(learning_rate=1e-3)
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size,
                           mask_zero=True, # not shown in the book
                           input_shape=[None]),
    keras.layers.SimpleRNN(64, activation="tanh", dropout = 0.5),
    keras.layers.Dense(1, activation="sigmoid"),
])
model.compile(loss="binary_crossentropy", optimizer=adam, metrics=["accuracy"])
history = model.fit(train_set, steps_per_epoch=train_size // 32, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [25]:
# code from above, modified from train
test_set = datasets["test"].repeat().batch(32).map(preprocess)
test_set = test_set.map(encode_words).prefetch(1)
test_loss, test_accuracy = model.evaluate(test_set, steps=test_size // 32)
print("Test Loss:", test_loss)
print("Test Accuracy:", test_accuracy)

Test Loss: 1.0037076473236084
Test Accuracy: 0.7057058215141296


LSTM implementation

In [26]:
# LSTM mostly the same, just swapped simpleRNN for
embed_size = 16
adam = keras.optimizers.Adam(learning_rate=1e-3)
lstm = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size,
                           mask_zero=True, # not shown in the book
                           input_shape=[None]),
    keras.layers.LSTM(64, activation="tanh", dropout = 0.5),
    keras.layers.Dense(1, activation="sigmoid"),
])
lstm.compile(loss="binary_crossentropy", optimizer=adam, metrics=["accuracy"])
newHistory = lstm.fit(train_set, steps_per_epoch=train_size // 32, epochs=20)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [27]:
#test set already acquired earlier
test_loss, test_accuracy = lstm.evaluate(test_set, steps=test_size // 32)
print("Test Loss:", test_loss)
print("Test Accuracy:", test_accuracy)

Test Loss: 0.8557994961738586
Test Accuracy: 0.7334747314453125


In [42]:
def evaluate_models_by_length(datasets, length, preprocess, encode_words, model, lstm_model):
  # define the minimum length required for sequences to be evaluated.
  # we use an upper limit because we want to see what happens as we restrict it further
  # initialize an empty list to hold the sequences that meet the length requirement
  long_sequences = []

  # iterate over the test dataset to filter sequences based on the defined length
  for x, y in datasets['test']:
      sequence = x.numpy()  # Convert the tensor to a np array for length evaluation
      if len(sequence) >= length:
          long_sequences.append((sequence, y.numpy()))

  # check if any sequences met the length requirement and prepare the data
  if long_sequences:
      # unzip the list of tuples into separate lists for data and labels
      long_x, long_y = zip(*long_sequences)
  else:
      # no sequences met the length requirement, use empty arrays
      long_x, long_y = np.array([]), np.array([])

  # convert the list of sequences and labels into tf tensors
  long_x = tf.convert_to_tensor(long_x, dtype=tf.string)  # ensure the correct data type for text data
  long_y = tf.convert_to_tensor(long_y, dtype=tf.int8)  # labels are typically integers

  # output the number of sequences that will be used for evaluation
  print(f"Number of sequences in set: {len(long_x)}")
  # Apply preprocessing to transform the raw text data into a suitable format for model input.
  long_x, long_y = preprocess(long_x, long_y)

  # Encode the preprocessed text data into numerical formats that the model can understand.
  long_x, long_y = encode_words(long_x, long_y)

  rnn_evaluation = model.evaluate(long_x, long_y)
  lstm_evaluation = lstm.evaluate(long_x, long_y)

  return rnn_evaluation, lstm_evaluation

In [43]:
rnn_eval_3000, lstm_eval_3000 = evaluate_models_by_length(datasets, 3000, preprocess, encode_words, model, lstm)
rnn_eval_4000, lstm_eval_4000 = evaluate_models_by_length(datasets, 4000, preprocess, encode_words, model, lstm)
rnn_eval_5000, lstm_eval_5000 = evaluate_models_by_length(datasets, 5000, preprocess, encode_words, model, lstm)

print("RNN Evaluation at 3000 chars:", rnn_eval_3000)
print("LSTM Evaluation at 3000 chars:", lstm_eval_3000)
print("RNN Evaluation at 4000 chars:", rnn_eval_4000)
print("LSTM Evaluation at 4000 chars:", lstm_eval_4000)
print("RNN Evaluation at 5000 chars:", rnn_eval_5000)
print("LSTM Evaluation at 5000 chars:", lstm_eval_5000)

Number of sequences in set: 1683
Number of sequences in set: 719
Number of sequences in set: 295
RNN Evaluation at 3000 chars: [1.2290725708007812, 0.6595365405082703]
LSTM Evaluation at 3000 chars: [1.1797053813934326, 0.6458704471588135]
RNN Evaluation at 4000 chars: [1.2418506145477295, 0.6495131850242615]
LSTM Evaluation at 4000 chars: [1.1655522584915161, 0.6550765037536621]
RNN Evaluation at 5000 chars: [1.2920520305633545, 0.6271186470985413]
LSTM Evaluation at 5000 chars: [1.1795945167541504, 0.6440678238868713]


In [48]:
def display_results(length, datasets, preprocess, encode_words, model, lstm_model):
  print("-" * 40)  # Divider for readability
  rnn_eval, lstm_eval = evaluate_models_by_length(datasets, length, preprocess, encode_words, model, lstm_model)

  print(f"RNN Evaluation at {length} chars: Accuracy = {rnn_eval[1]:.4f}, Loss = {rnn_eval[0]:.4f}")
  print(f"LSTM Evaluation at {length} chars: Accuracy = {lstm_eval[1]:.4f}, Loss = {lstm_eval[0]:.4f}")

# Run evaluations for different character lengths
lengths = [3000, 4000, 5000, 6000, 7000]
for length in lengths:
    display_results(length, datasets, preprocess, encode_words, model, lstm)

----------------------------------------
Number of sequences in set: 1683
RNN Evaluation at 3000 chars: Accuracy = 0.6595, Loss = 1.2291
LSTM Evaluation at 3000 chars: Accuracy = 0.6459, Loss = 1.1797
----------------------------------------
Number of sequences in set: 719
RNN Evaluation at 4000 chars: Accuracy = 0.6495, Loss = 1.2419
LSTM Evaluation at 4000 chars: Accuracy = 0.6551, Loss = 1.1656
----------------------------------------
Number of sequences in set: 295
RNN Evaluation at 5000 chars: Accuracy = 0.6271, Loss = 1.2921
LSTM Evaluation at 5000 chars: Accuracy = 0.6441, Loss = 1.1796
----------------------------------------
Number of sequences in set: 26
RNN Evaluation at 6000 chars: Accuracy = 0.5769, Loss = 1.2710
LSTM Evaluation at 6000 chars: Accuracy = 0.5000, Loss = 1.7659
----------------------------------------
Number of sequences in set: 7
RNN Evaluation at 7000 chars: Accuracy = 0.5714, Loss = 1.2670
LSTM Evaluation at 7000 chars: Accuracy = 0.4286, Loss = 1.7852
