# Chp 11: Part 4

11.4.4 When to use sequence models over bag-of-words models?

You may sometimes hear that bag-of-words methods are outdated, and that Transformer-based sequence models are the way to go, no matter what task or dataset you’re looking at. This is definitely not the case: a small stack of Dense layers on top of a bag-of-bigrams remains a perfectly valid and relevant approach in many cases. In fact, among the various techniques that we’ve tried on the IMDB dataset throughout this chapter, the best performing so far was the bag-of-bigrams!

So, when you should prefer one approach over the other?

In 2017, my team and I ran a systematic analysis of the performance of various text classification techniques across many different types of text datasets, and we discovered a remarkable and surprising rule of thumb for deciding whether to go with a bag-of-words model or a sequence model. A golden constant of sorts.

It turns out that, when approaching a new text classification task, you should pay close attention to the ratio between the (number of samples in your training data) and the (mean number of words per sample). 

    ratio = (number of samples in your training data) / (mean number of words per sample)

1. If that ratio is small (less than 1,500) then the bag-of-bigrams model will perform better (and as a bonus, it will be much faster to train and to iterate on). 

2. If that ratio is higher than 1,500, then you should go with a sequence model. 

**In other words, sequence models work best when lots of training data is available and when each sample is relatively short.**

* So if you’re classifying 1,000-word long documents, and you have 100,000 of them, you should go with a bigram model (ratio: 100). 100_000 / 1000 = 100


* If you’re classifying tweets that are 40-word long on average, and you have 50,000 of them, you should also go with a bigram model (ratio: 1,250). 50_000 / 40 = 1250.0
 
* But if you increase your dataset size to 500,000 tweets, then go with a Transformer encoder (ratio: 12,500). 500_000 / 40 = 12500.0

What about the IMDB movie review classification task? We had 20,000 training samples and an average word count of 233, so our rule of thumb points towards a bigram model—which confirms what we found out in practice.

    ratio = 20_000 / 223 = 89.67

This intuitively makes sense: the input of a sequence models represents a richer and more complex space, and thus it takes more data to map out that space—meanwhile, a plain set of terms is a space so simple that you can train a logistic regression on top using just a few hundreds or thousands of samples. In addition, the shorter a sample is, the less the model can afford to discard any of the information it contains—in particular, word order becomes more important, and discarding it can create ambiguity. The sentences "this movie is the bomb" and "this movie was a bomb" have very close unigram representations, which could confuse a bag-of-words model, but a sequence model could tell which one is negative and which one is positive. With a longer sample, word statistics would become more reliable and the topic or sentiment would be more apparent from the word histogram alone.

Now, keep in kind that this heuristic rule was developed specifically for text classification, it may not necessarily hold for other NLP tasks—when it comes to machine translation, for instance, Transformer shines especially for very long sequences, compared to RNNs. Our heuristic is also just a rule of thumb, rather than a scientific law, so expect it to work most of the time, but not necessarily every time.

---
## 11.5 Beyond text classification: sequence-to-sequence learning, pg.400

In [None]:
# treat as globals for now
vocab_size = 15000
sequence_length = 20


def setup():
    import tensorflow as tf
    from tensorflow import keras

    # every layer uses a 16-bit compute dtype and float32 variable dtype by default.
    # most of the forward pass of the model will be done in float16,
    # (with the exception of numerically unstable operations like softmax),
    # while the weights of the model will be stored and updated in float32.
    keras.mixed_precision.set_global_policy("mixed_float16")
    
setup()


def HR():
    # print char * numeric
    print('-' * 80)


def listing11_5_1():
    import os

    dirpath = 'spa-eng'
    if not os.path.isdir(dirpath):
        print(f'{dirpath} not found, creating directory')
        HR()
        try:
            !curl -O http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
            !unzip -q spa-eng.zip
        except Exception as ex:
            print(f"Not able to create directory due to error {ex}")
            
listing11_5_1()

Your GPU may run slowly with dtype policy mixed_float16 because it does not have compute capability of at least 7.0. Your GPU:
  Tesla P4, compute capability 6.1
See https://developer.nvidia.com/cuda-gpus for a list of GPUs and their compute capabilities.
spa-eng not found, creating directory
--------------------------------------------------------------------------------
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2576k  100 2576k    0     0  10.1M      0 --:--:-- --:--:-- --:--:-- 10.1M


In [None]:
# Create global to carry the original data, text_pairs

def listing11_5_2():
    import random

    # The text file contains one example per line: an English sentence, 
    # followed by a tab character, followed by the corresponding Spanish sentence.
    
    text_file = "spa-eng/spa.txt"
    with open(text_file) as f:
        lines = f.read().split("\n")[:-1]
    text_pairs = []

    # Iterate over the lines in the file.
    for line in lines:
        # Each line contains an English phrase and its Spanish translation, tab-separated.
        english, spanish = line.split("\t")
        # prepend "[start]" and append "[end]" to the Spanish sentence
        spanish = "[start] " + spanish + " [end]"
        text_pairs.append((english, spanish))

    print(random.choice(text_pairs))

    return text_pairs

text_pairs = listing11_5_2()
print(len(text_pairs))

('He neglects his studies.', '[start] Él descuida sus estudios. [end]')
118964


In [None]:
# p.402

def listing11_28():
    import random

    random.shuffle(text_pairs)
    num_val_samples = int(0.15 * len(text_pairs))
    num_train_samples = len(text_pairs) - 2 * num_val_samples

    train_pairs = text_pairs[:num_train_samples]
    val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
    test_pairs = text_pairs[num_train_samples + num_val_samples:]

    print("total text_pairs length:", len(text_pairs))
    print("train_pairs length:", len(train_pairs))
    print("val_pairs lenth:", len(val_pairs))
    print("test_pairs:", len(test_pairs))
    print()

    ###########################################################

    # Listing 11.28 Vectorizing the English and Spanish text pairs

    import tensorflow as tf
    import string
    import re

    strip_chars = string.punctuation + "¿"
    strip_chars = strip_chars.replace("[", "")
    strip_chars = strip_chars.replace("]", "")

    def custom_standardization(input_string):
        lowercase = tf.strings.lower(input_string)
        return tf.strings.regex_replace(
            lowercase, f"[{re.escape(strip_chars)}]", "")


    source_vectorization = tf.keras.layers.experimental.preprocessing.TextVectorization(
        max_tokens=vocab_size,
        output_mode="int",
        output_sequence_length=sequence_length,
    )
    target_vectorization = tf.keras.layers.experimental.preprocessing.TextVectorization(
        max_tokens=vocab_size,
        output_mode="int",
        output_sequence_length=sequence_length + 1,
        standardize=custom_standardization,
    )
    train_english_texts = [pair[0] for pair in train_pairs]
    train_spanish_texts = [pair[1] for pair in train_pairs]
    source_vectorization.adapt(train_english_texts)
    target_vectorization.adapt(train_spanish_texts)


    ###########################################################

    # Listing 11.29 Preparing training and validation datasets for the translation task

    batch_size = 64

    def format_dataset(eng, spa):
        eng = source_vectorization(eng)
        spa = target_vectorization(spa)
        return ({
            "english": eng,
            "spanish": spa[:, :-1],
        }, spa[:, 1:])

    def make_dataset(pairs):
        eng_texts, spa_texts = zip(*pairs)
        eng_texts = list(eng_texts)
        spa_texts = list(spa_texts)
        dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
        dataset = dataset.batch(batch_size)
        dataset = dataset.map(format_dataset)
        return dataset.shuffle(2048).prefetch(16).cache()

    train_ds = make_dataset(train_pairs)
    val_ds = make_dataset(val_pairs)

    ###########################################################

    for inputs, targets in train_ds.take(1):
        print(f"inputs['english'].shape: {inputs['english'].shape}")
        print(f"inputs['spanish'].shape: {inputs['spanish'].shape}")
        print(f"targets.shape: {targets.shape}")

    return train_ds, val_ds, target_vectorization, source_vectorization, test_pairs

train_ds, val_ds, target_vectorization, source_vectorization, test_pairs = listing11_28()


total text_pairs length: 118964
train_pairs length: 83276
val_pairs lenth: 17844
test_pairs: 17844

inputs['english'].shape: (64, 20)
inputs['spanish'].shape: (64, 20)
targets.shape: (64, 20)


---
# 11.5.2 Sequence-to-sequence learning with RNNs

Recurrent neural networks dominated sequence-to-sequence learning from 2015 to 2017, before being overtaken by Transformer. They were the basis for many real-world machine translation systems—as mentioned in chapter 10, Google Translate circa 2017 was powered by a stack of seven large LSTM layers. It’s still worth learning about this approach today, as it provides an easy entry point to understand sequence-to-sequence models.


The simplest, naive way to use RNNs to turn a sequence into another sequence is to keep the output of the RNN at each time step—in Keras, like this:

    inputs = keras.Input(shape=(sequence_length,), dtype="int64")
    x = layers.Embedding(input_dim=vocab_size, output_dim=128)(inputs)
    x = layers.LSTM(32, return_sequences=True)(x)
    outputs = layers.Dense(vocab_size, activation="softmax")(x)
    model = keras.Model(inputs, outputs)


However, there are two major issues with this approach:

1. The target sequence must always be the same length as the source sequence. In practice, this is rarely the case. Technically, this isn’t critical, as you could always pad either the source sequence or the target sequence to make their lengthes match.
2. Due to the step-by-step nature of RNNs, the model would only be looking at tokens 0...N in the source sequence in order to predict token N in the target sequence. This constraint makes this setup unsuitable for most tasks, in particular translation. Consider translating "The weather is nice today" to French—that would be "Il fait beau aujourd’hui". You’d need to be able to predict "Il" from just "The", "Il fait" from just "The weather", etc., which is simply impossible.

If you’re a human translator, you’d start by reading the entire source sentence before starting to translate it. This is especially important if you’re dealing with languages that have wildly different word ordering, like English and Japanese. And that’s exactly what standard sequence-to-sequence models do.


In a proper sequence-to-sequence setup (see figure 11.13), you would first use a RNN (the encoder) to turn the entire source sequence into a single vector (or set of vectors). This could be the last output of the RNN, or alternatively, its final internal state vectors. Then you would use this vector (or vectors) as the initial state of another RNN (the decoder), which would look at elements 0...​N in the target sequence, and try to predict step N+1 in the target sequence.

Let’s implement this in Keras with GRU-based encoders and decoders. The choice of GRU rather than LSTM makes things a bit simpler, since GRU only has a single state vector, whereas LSTM has multiple. Let’s start with the encoder.


In [None]:
# Listing 11.30 GRU-based encoder
# Sequence-to-sequence learning with RNNs
# GRU-based encoder

def listing11_30():
    from tensorflow import keras
    from tensorflow.keras import layers
    import numpy as np
    import random
    
    embed_dim = 256
    latent_dim = 1024
    source = keras.Input(shape=(None,), dtype="int64", name="english")
    x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)
    encoded_source = layers.Bidirectional(layers.GRU(latent_dim), merge_mode="sum")(x)

    ###########################################################

    # 11.31 GRU-based decoder and the end-to-end model
    past_target = keras.Input(shape=(None,), dtype="int64", name="spanish")
    x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)
    decoder_gru = layers.GRU(latent_dim, return_sequences=True)
    x = decoder_gru(x, initial_state=encoded_source)
    x = layers.Dropout(0.5)(x)
    target_next_step = layers.Dense(vocab_size, activation="softmax")(x)
    seq2seq_rnn = keras.Model([source, past_target], target_next_step)

    ###########################################################

    # Listing 11.32 Training our recurrent sequence-to-sequence model
    # This takes a long time
    seq2seq_rnn.compile(
        optimizer="rmsprop",
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"])

    seq2seq_rnn.fit(
        train_ds, 
        #epochs=3, 
        epochs=1, 
        validation_data=val_ds
    )

    ###########################################################

    # Using our model for inference
    # Translating new sentences with our RNN encoder and decoder

    spa_vocab = target_vectorization.get_vocabulary()
    spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
    max_decoded_sentence_length = 20

    def decode_sequence(input_sentence):
        tokenized_input_sentence = source_vectorization([input_sentence])
        decoded_sentence = "[start]"
        for i in range(max_decoded_sentence_length):
            tokenized_target_sentence = target_vectorization([decoded_sentence])
            next_token_predictions = seq2seq_rnn.predict(
                [tokenized_input_sentence, tokenized_target_sentence])
            sampled_token_index = np.argmax(next_token_predictions[0, i, :])
            sampled_token = spa_index_lookup[sampled_token_index]
            decoded_sentence += " " + sampled_token
            if sampled_token == "[end]":
                break
        return decoded_sentence

    test_eng_texts = [pair[0] for pair in test_pairs]
    for _ in range(20):
        input_sentence = random.choice(test_eng_texts)
        print("-")
        print(input_sentence)
        print(decode_sequence(input_sentence))


listing11_30()

-
This might not be enough.
[start] esto no puede ser tan [end]
-
Don't forget to pick me up at 6 o'clock tomorrow.
[start] no me va a la mañana mañana [end]
-
Tom put out the fire.
[start] tom se [UNK] la luz [end]
-
That's no longer possible.
[start] eso no es más que yo [end]
-
I hope things change.
[start] espero que [UNK] [end]
-
I need a good dictionary.
[start] necesito un buen trabajo [end]
-
Do you like to study?
[start] te gusta ir [end]
-
Tom doesn't know the reason why Mary is absent.
[start] tom no sabe la verdad de tom está haciendo [end]
-
I live in the house.
[start] yo en la casa [end]
-
I will see him after I get back.
[start] me lo vi cuando yo me [UNK] [end]
-
Tom was nice to everyone.
[start] tom estaba muy tarde a las seis [end]
-
He is not likely to succeed.
[start] Él no es miedo [end]
-
I thought I knew them.
[start] pensé que yo te había ido [end]
-
I am trying to learn English.
[start] estoy de acuerdo de sus amigos [end]
-
Tom picked up a book, opened it, an

---

In [None]:
# 11.5.3 Sequence-to-sequence learning with Transformer, p.409
# Listing 11.35 The TransformerDecoder, p.411

# The Transformer Decoder
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config


    # Listing 11.36 TransformerDecoder method that generates a "causal mask"
    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1),
             tf.constant([1, 1], dtype=tf.int32)], axis=0)
        return tf.tile(mask, mult)


    # Listing 11.37 The forward pass of the TransformerDecoder
    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(
                mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)
        attention_output_1 = self.attention_1(
            query=inputs,
            value=inputs,
            key=inputs,
            attention_mask=causal_mask)
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(
            query=attention_output_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        attention_output_2 = self.layernorm_2(
            attention_output_1 + attention_output_2)
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)

###########################################################

# Putting it all together: a Transformer for machine translation
# PositionalEmbedding layer

class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super(PositionalEmbedding, self).get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config

###########################################################

# End-to-end Transformer

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        attention_output = self.attention(
            inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)
    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config


In [None]:
# Listing 11.38 End-to-end Transformer

def listing11_38():
    import random
    
    embed_dim = 256
    dense_dim = 2048
    num_heads = 8

    encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="english")
    x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)

    encoder_outputs = TransformerEncoder(
        embed_dim, dense_dim, num_heads
        )(x)

    decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="spanish")
    x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
    x = TransformerDecoder(embed_dim, dense_dim, num_heads)(x, encoder_outputs)
    x = layers.Dropout(0.5)(x)
    decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
    transformer = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)


    ###########################################################


    # Listing 11.39 Training the sequence-to-sequence Transformer
    # Training the sequence-to-sequence Transformer
    transformer.compile(
        optimizer="rmsprop",
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"])

    transformer.fit(
        train_ds, 
        # epochs=30,
        epochs=1, 
        validation_data=val_ds
    )


    # Translating new sentences with our Transformer model
    import numpy as np
    spa_vocab = target_vectorization.get_vocabulary()
    spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
    max_decoded_sentence_length = 20

    def decode_sequence(input_sentence):
        tokenized_input_sentence = source_vectorization([input_sentence])
        decoded_sentence = "[start]"
        for i in range(max_decoded_sentence_length):
            tokenized_target_sentence = target_vectorization(
                [decoded_sentence])[:, :-1]
            predictions = transformer(
                [tokenized_input_sentence, tokenized_target_sentence])
            sampled_token_index = np.argmax(predictions[0, i, :])
            sampled_token = spa_index_lookup[sampled_token_index]
            decoded_sentence += " " + sampled_token
            if sampled_token == "[end]":
                break
        return decoded_sentence

    test_eng_texts = [pair[0] for pair in test_pairs]
    for _ in range(20):
        input_sentence = random.choice(test_eng_texts)
        print("-")
        print(input_sentence)
        print(decode_sequence(input_sentence))

listing11_38()

-
You're small.
[start] eres un poco [end]
-
His behavior never ceases to surprise me.
[start] su nunca le [UNK] para que me [UNK] [end]
-
My mother was sick for two days.
[start] mi madre estaba dos días [end]
-
We got there at the same time.
[start] nos hemos estado en el tiempo [end]
-
When do we start?
[start] cuándo nos queremos nosotros [end]
-
I don't feel anything.
[start] no me siento nada [end]
-
Don't you know that you are the laughingstock of the whole town?
[start] no quieres que hacer el tiempo para la ciudad [end]
-
Is that you?
[start] es lo que es [end]
-
Do I have to wear a tie at work?
[start] tengo que ser un trabajo para hacer trabajo [end]
-
Knock it off, Tom.
[start] [UNK] a tom [end]
-
Tom was whistling a song his mother had taught him.
[start] tom estaba acostumbrado a una historia que había oído que él le había hecho su madre [end]
-
What's your favorite yoga pose?
[start] cuál es tu comida [UNK] [end]
-
How many times do I have to tell you?
[start] cuánto vec