<a href="https://colab.research.google.com/github/erinijapranckeviciene/MF54609_18981_1_20241/blob/main/FC_Chapter11_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### BTW, Recalling the effect of validation error being less than training

https://pyimagesearch.com/2019/10/14/why-is-my-validation-loss-lower-than-my-training-loss/#:~:text=Regularization%20methods%20often%20sacrifice%20training,lower%20than%20your%20training%20loss.

### Prepare IMDB data as before

In [None]:
!wget https://github.com/erinijapranckeviciene/MF54609_18981_1_20241/raw/refs/heads/main/datasets/RNN/aclImdb.zip
!unzip -qq aclImdb.zip

In [2]:
import os, pathlib, shutil, random
from tensorflow import keras

batch_size = 32
train_ds = keras.utils.text_dataset_from_directory("aclImdb/train", batch_size=batch_size )
val_ds = keras.utils.text_dataset_from_directory("aclImdb/val", batch_size=batch_size)
test_ds = keras.utils.text_dataset_from_directory("aclImdb/test", batch_size=batch_size)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


### text only ds

In [3]:
# This function returns only data part without target
# to create a new dataset that will be used to create dictionary
text_only_train_ds = train_ds.map(lambda x, y: x)
for inputs in text_only_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("inputs[0]:", inputs[0])
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
inputs[0]: tf.Tensor(b'This is the kind of movie that my enemies content I watch all the time, but it\'s not bloody true. I only watch it once in a while to make sure that it\'s as bad as I first thought it was.<br /><br />Some kind of mobsters hijack a Boeing 747. (That, at least, is an improvement over having Boeing hijack a good part of the Pentagon.) The airplane goes down in the Bermuda triangle and sinks pressurized to the bottoms, a kind of post-facto submarine.<br /><br />It has one of those all-star casts, the stars either falling or barely above the horizon.<br /><br />"We\'re on our own!", says pilot Jack Lemon. He is so right. Except for George Kennedy. He\'s in all these disaster movies.<br /><br />Watch another movie instead. Oh, not "Airport" the original. That\'s no good either. Instead, watch a decent flick about stuck airplanes like "Flight of the Phoenix."', shape=(), dtype=string)


### Prepare int representation dataset

In [4]:
import tensorflow as tf
from tensorflow.keras import layers

# max_length is a length of the vector that encodes text
# 250 size gave ~0.85 on test data
# 600 with 4 LSTM units does not train at all
# the reviews are about 300 words
max_length = 300
max_tokens = 20000
text_vectorization = layers.TextVectorization( max_tokens=max_tokens, output_mode="int", output_sequence_length=max_length)

text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

### Verify

In [5]:
for inputs, targets in int_train_ds:
    print(inputs.shape)
    print(inputs[0])
    print(targets[0])
    # With this test_input variable verify tf.one_hot() transformation
    test_input=inputs[0]
    break

(32, 300)
tf.Tensor(
[ 1475     4   169    80     9     7    38   910    84   791  1547    19
   392    29     5     2   242    94    10    26   122   108    10   118
     9     7   486   429    19   589     2   151  4547    39   386     8
     4  1864     2  3910    24   412  1989     3     2   225    61    10
    26   108   136  1026  1216    14    31     4  1214 18027    31    47
  1913   399  1340   770   148   580    10    62  1071    12    81  4838
   295     3  1363   703    94    19    31   219     4   754  1128     5
   501   138    27  3487    19    31   219    29    50   149     2    87
   291   235     5     2    18    65  1729   217     3   596   865     3
     2   583    14    62   282  2473     6     2   286  1126  4454     4
   169    19    12     7    33   229     2   114    11    20    44     6
  1437     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0

## Transformers  F.Chollet Chapter 11

#### 11.4 The Transformer architecture

..."Transformers were introduced in the seminal paper “Attention is all you need” by Vaswani et al. The gist of the paper is right there in the title: as it turned out, a simple mechanism called “neural attention” could be used to build powerful sequence models that didn’t feature any recurrent layers or convolution layers."...

### Listing 11.21 Transformer encoder implemented as a subclassed Layer

In [6]:
import tensorflow as tf
from tensorflow import keras
from keras import layers

class TransformerEncoder(layers.Layer):
  def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
    super().__init__(**kwargs)
    self.embed_dim = embed_dim
    self.dense_dim = dense_dim
    self.num_heads = num_heads
    self.attention = layers.MultiHeadAttention(
        num_heads=num_heads, key_dim=embed_dim)
    self.dense_proj = keras.Sequential(
        [layers.Dense(dense_dim, activation="relu"), layers.Dense(embed_dim),]
    )
    self.layernorm_1 = layers.LayerNormalization()
    self.layernorm_2 = layers.LayerNormalization()

  def call(self, inputs, mask=None):
    if mask is not None:
        mask=mask[:, tf.newaxis, :]
    attention_output = self.attention(
        inputs, inputs, attention_mask=mask
    )
    proj_input = self.layernorm_1(inputs + attention_output)
    proj_output = self.dense_proj(proj_input)
    return self.layernorm_2(proj_input + proj_output)

  def get_config(self):
    config = super().get_config()
    config.update({
        "embed_dim": self.embed_dim,
        "num_heads": self.num_heads,
        "dense_dim": self.dense_dim,
    })
    return config


### Listing 11.22 Using the Transformer encoder for text classification

In [7]:
vocab_size = 20000
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()


### Listing 11.23 Training and evaluating the Transformer encoder based model

In [8]:
callbacks = [ keras.callbacks.ModelCheckpoint("transformer_encoder.keras", save_best_only=True) ]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=callbacks)

model = keras.models.load_model("transformer_encoder.keras", custom_objects={"TransformerEncoder": TransformerEncoder})
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")


Epoch 1/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 35ms/step - accuracy: 0.6180 - loss: 0.7790 - val_accuracy: 0.8264 - val_loss: 0.3843
Epoch 2/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 31ms/step - accuracy: 0.8244 - loss: 0.3969 - val_accuracy: 0.8520 - val_loss: 0.3461
Epoch 3/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 32ms/step - accuracy: 0.8480 - loss: 0.3487 - val_accuracy: 0.8618 - val_loss: 0.3271
Epoch 4/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 32ms/step - accuracy: 0.8621 - loss: 0.3253 - val_accuracy: 0.8550 - val_loss: 0.3372
Epoch 5/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 32ms/step - accuracy: 0.8754 - loss: 0.2986 - val_accuracy: 0.8634 - val_loss: 0.3192
Epoch 6/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 32ms/step - accuracy: 0.8839 - loss: 0.2797 - val_accuracy: 0.8616 - val_loss: 0.3202
Epoch 7/20
[1m6



[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.8541 - loss: 0.3428
Test acc: 0.858


### Listing 11.24 Implementing positional embedding as a subclassed layer

In [13]:
class PositionalEmbedding(layers.Layer):
  def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
    super().__init__(**kwargs)
    self.token_embeddings = layers.Embedding(
        input_dim=input_dim, output_dim=output_dim)
    self.position_embeddings = layers.Embedding(
        input_dim=sequence_length, output_dim=output_dim)
    self.sequence_length = sequence_length
    self.input_dim = input_dim
    self.output_dim = output_dim

  def call(self, inputs):
    length = tf.shape(inputs)[-1]
    positions = tf.range(start=0, limit=length, delta=1)
    embedded_tokens = self.token_embeddings(inputs)
    embedded_positions = self.position_embeddings(positions)
    return embedded_tokens + embedded_positions

  #def compute_mask(self, inputs, mask=None):
  #  return tf.math.not_equal(inputs, 0) #it was a mistake correction below
    #return layers.Lambda(lambda x: tf.math.not_equal(x, 0))(inputs)

  def get_config(self):
    config = super().get_config()
    config.update({
      "output_dim": self.output_dim,
      "sequence_length": self.sequence_length,
      "input_dim": self.input_dim,
      })
    return config

### Listing 11.25 Combining the Transformer encoder with positional embedding

In [14]:
vocab_size = 20000
sequence_length = 600
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()


### Change the dataset - the sequence length is 600


In [11]:
text_vectorization = layers.TextVectorization( max_tokens=vocab_size, output_mode="int", output_sequence_length=sequence_length)

text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

### Execute the model

In [15]:
callbacks = [keras.callbacks.ModelCheckpoint("full_transformer_encoder.keras", save_best_only=True)]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=callbacks)



Epoch 1/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m69s[0m 98ms/step - accuracy: 0.6039 - loss: 0.7495 - val_accuracy: 0.8090 - val_loss: 0.4182
Epoch 2/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m59s[0m 95ms/step - accuracy: 0.7993 - loss: 0.4331 - val_accuracy: 0.8272 - val_loss: 0.3877
Epoch 3/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 96ms/step - accuracy: 0.8285 - loss: 0.3815 - val_accuracy: 0.8388 - val_loss: 0.3565
Epoch 4/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 98ms/step - accuracy: 0.8542 - loss: 0.3395 - val_accuracy: 0.8562 - val_loss: 0.3404
Epoch 5/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 98ms/step - accuracy: 0.8687 - loss: 0.3071 - val_accuracy: 0.8556 - val_loss: 0.3422
Epoch 6/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 100ms/step - accuracy: 0.8847 - loss: 0.2757 - val_accuracy: 0.8522 - val_loss: 0.3603
Epoch 7/20
[1m

<keras.src.callbacks.history.History at 0x78776ff70e50>

In [16]:
model = keras.models.load_model("full_transformer_encoder.keras", custom_objects={"TransformerEncoder": TransformerEncoder, "PositionalEmbedding": PositionalEmbedding})
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")



[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 19ms/step - accuracy: 0.8431 - loss: 0.3563
Test acc: 0.846


### 11.5.1 A machine translation example