This is a companion notebook for the book [Deep Learning with Python, Second Edition](https://www.manning.com/books/deep-learning-with-python-second-edition?a_aid=keras&a_bid=76564dff). For readability, it only contains runnable code blocks and section titles, and omits everything else in the book: text paragraphs, figures, and pseudocode.

**If you want to be able to follow what's going on, I recommend reading the notebook side by side with your copy of the book.**

This notebook was generated for TensorFlow 2.6.

### Processing words as a sequence: the Sequence Model approach

#### A first practical example

**Downloading the data**

In [None]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup

**Preparing the data**

In [None]:
import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = pathlib.Path("../dlkeras/aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)



In [4]:
train_ds = keras.utils.text_dataset_from_directory(
    "../dlkeras/aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "../dlkeras/aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "../dlkeras/aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)

Found 21002 files belonging to 3 classes.
Found 3972 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [52]:
train_ds

<BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int32)>

In [24]:
one_record = train_ds.take(1).get_single_element()[0][0]

In [30]:
one_label = train_ds.take(1).get_single_element()[1][2]

In [32]:
one_record.numpy().decode('utf8'), one_label.numpy()

("I used to LOVE this movie as a kid but, seeing it again 20+ years later, it actually sucks. Up The Academy might have been ahead of it's time back in 1980, but it has almost nothing to offer today! Movies like Caddyshack and Stripes hold-up much better today than this steaming dogpile. No T&A. No great jokes except for the one-liners we've all heard a million times by now.<br /><br />I recently bought the DVD in hopes that it would be the gem I remembered it being. Well, I was WAY off! The soundtrack had only 2-3 widely-recognizable hits (not the smash compilation others had mentioned) and the frequent voice-overs were terrible. The only thing that was interesting, to me, was predicting what the character's lines were before they said them. Yep, I watched this movie that much back then! <br /><br />The only reason I am writing this review is to give my two cents on why this movie should be forgotten, sorry to say. :(",
 1)

In [57]:
text_only_train_ds.take(1).get_single_element()[2].numpy()

b'Now please don\'t start calling me names like, "unpatriotic" , "weirdo" and more .<br /><br />The very length of this movie (4 hours .. !!!) is its biggest mistake . No editing at all - seems like J.P. Dutta fell in love with his project too much . Even Lagaan was 4 hours long - but it was entertaining and gave a message as well .<br /><br />It\'s based on true incidents and real people . Kudos to it , but were the repetitive war scenes really needed ? On top of it the focus constantly shifted from one battalion / squadron to another and it was impossible to keep a track of them all .<br /><br />Between the skirmishes , there were songs about loneliness , lovesickness and related stuff . There were chummy conversations . In the beginning it gave some relief from the violence but became so monotonous later that one could even correctly predict nature of the forthcoming talk .<br /><br />Why were the soldiers walking around as if they were lions in jungle , fully unaware that enemy was

**Preparing integer sequence datasets**

In [7]:
from tensorflow.keras import layers

max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

**A sequence model built on top of one-hot encoded vector sequences**

In [33]:
import tensorflow as tf
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = tf.one_hot(inputs, depth=max_tokens)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

Extension horovod.torch has not been built: /usr/local/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
[2022-02-16 14:53:15.805 tensorflow-2-6-gpu-ml-g4dn-2xlarge-19b876f82216afed60344884c91d:20 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2022-02-16 14:53:15.829 tensorflow-2-6-gpu-ml-g4dn-2xlarge-19b876f82216afed60344884c91d:20 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
tf.one_hot (TFOpLambda)      (None, None, 20000)       0         
________________________________________

**Training a first basic sequence model**

In [34]:
callbacks = [
    keras.callbacks.ModelCheckpoint("one_hot_bidir_lstm.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("one_hot_bidir_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.500


In [35]:
evluacion_modelo = model.evaluate(int_test_ds)



#### Understanding word embeddings

##### Learning word embeddings with the `Embedding` layer

**Instantiating an `Embedding` layer**

In [36]:
embedding_layer = layers.Embedding(input_dim=max_tokens, output_dim=256)

**Model that uses an Embedding layer trained from scratch**

In [37]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding_1 (Embedding)      (None, None, 256)         5120000   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 64)                73984     
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 5,194,049
Trainable params: 5,194,049
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test

###### Understanding padding & masking

**Model that uses an Embedding layer trained from scratch, with masking enabled**

In [38]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(
    input_dim=max_tokens, output_dim=256, mask_zero=True)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru_with_masking.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru_with_masking.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding_2 (Embedding)      (None, None, 256)         5120000   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 64)                73984     
_________________________________________________________________
dropout_2 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65        
Total params: 5,194,049
Trainable params: 5,194,049
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test

##### Using pretrained word embeddings

###### Downloading the GloVe word embeddings

In [39]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2022-02-16 17:39:35--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2022-02-16 17:39:35--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-02-16 17:39:35--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2022-0

**Parsing the GloVe word-embeddings file**

In [40]:
import numpy as np
path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print(f"Found {len(embeddings_index)} word vectors.")

Found 400000 word vectors.


###### Loading the GloVe embeddings in the model

**Preparing the GloVe word-embeddings matrix**

In [41]:
embedding_dim = 100

vocabulary = text_vectorization.get_vocabulary()
word_index = dict(zip(vocabulary, range(len(vocabulary))))

embedding_matrix = np.zeros((max_tokens, embedding_dim))
for word, i in word_index.items():
    if i < max_tokens:
        embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [42]:
embedding_layer = layers.Embedding(
    max_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
    mask_zero=True,
)

###### Training a simple bidirectional LSTM on top of the GloVe embeddings

**Model that uses aget_single_elementetrained Embedding layer**

In [43]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = embedding_layer(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("glove_embeddings_sequence_model.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("glove_embeddings_sequence_model.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding_3 (Embedding)      (None, None, 100)         2000000   
_________________________________________________________________
bidirectional_3 (Bidirection (None, 64)                34048     
_________________________________________________________________
dropout_3 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 65        
Total params: 2,034,113
Trainable params: 34,113
Non-trainable params: 2,000,000
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

In [59]:
int_train_ds, int_test_ds

(<MapDataset shapes: ((None, 600), (None,)), types: (tf.int64, tf.int32)>,
 <MapDataset shapes: ((None, 600), (None,)), types: (tf.int64, tf.int32)>)

In [65]:

int_train_ds.take(1).get_single_element()[1][0], int_train_ds.take(1).get_single_element()[0][0]

(<tf.Tensor: shape=(), dtype=int32, numpy=1>,
 <tf.Tensor: shape=(600,), dtype=int64, numpy=
 array([  713,   422,    21,     2,  3751,     7,    31,  2584,    27,
           14,   323,  1423,   148,     6,   412,    31,  6851,    27,
          287,     2,  1299,     5,     4,  2694,     1,     7,    34,
            1,  7158,    37,    14, 14070,    17,  8643,     1,  3920,
           31,     2,   575,     5,    29,    31,   575,     5,  6851,
           99,     2,   112,  1798,    23,  6063,   148,   166,     1,
          158,   138,    74,    27,   847,     6,   208,     6,     2,
          169,     6,   400,     2,  1030,    37,   299,    12,    27,
           60,    26,    61,  1423,   148,     6,   412,    27,   444,
            6,   120,     2,  1030,    12,    27,     7,   123,  1181,
            3,     1, 15532,  5548,   137,   124,   490,     7,    15,
           50,    15,  1955,    56,    10,    26,   107,    11,   768,
           13,   516,    30,  7722,     6,   412,    53