# Attention and Transformer Networks

In this notebook you will use a transformer to perform natural language processing. The below code is modified from [this example](https://keras.io/examples/nlp/text_classification_with_transformer/) from the Keras documentation. Note that the website is using an older version, and thus some of the syntax is outdated. The concepts are still valid.

Read through the code and then see the assignment at the bottom of the notebook. Note that training a transformer model (even a simple one) can take a bit of time.


In [10]:
#############
## IMPORTS ##
############
# Keras imports to create the transformer
import keras
from keras import ops
from keras import layers

In [11]:
############################
## CREATE THE TRANSFORMER ##
############################
# Create a class that defines a transformer. It
# inherits from the keras Layer class so it will
# have all of the functionality of that class
# as well such as fit and predict. We will overwrite
# the initialization function and the call function but
# the other will stay the same.

# If you are unfamiliar with inheritance in Python (or
# inheritance in general) I recommend you do a bit of reseach
# into the topic before continuing
class TransformerBlock(layers.Layer):
    # Define the initialization function which takes three
    # arguments:
    # embed_dim is the embedding size for the tokens,
    # num_heads is the number of heads for the attention layer,
    # ff_dim is the size of the feedforward neural network.
    # The dropout rate of the model can also be changed though a
    # default is provided.
    # QUESTION 1: Do some research and provide context for the
    # embedding size and the number of heads for the attention layer.
    # The Keras or Tensorflow documentation may be helpful in addition
    # to a general internet search. Remember to cite any sources which are
    # external to course materials (i.e. provided lecture notes and the textbook).
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        # Initialize the parent class
        super().__init__()
        # Define a multi-head attention network using the provided
        # parameters
        self.att = layers.MultiHeadAttention(num_heads=num_heads,
                                             key_dim=embed_dim)
        # Define a feedforward neural network with the provided
        # parameters and a relu activation function. Note you can
        # change the activation function to be passed
        # in the arguments (hint for assignment question 4)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"),
             layers.Dense(embed_dim),])

        # Add normalization and dropout to improve performance. For
        # assignment question 4 you will likely want to change these
        # from the default values or remove them entirely
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    # Define the call function which will create the network in the order
    # that is needed to define a transformer (i.e. an attention layer and
    # then a feedfoward neural network). Normalization and dropout are
    # also added but these are optional but may improve performance.
    def call(self, inputs):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output)
        return self.layernorm2(out1 + ffn_output)

In [12]:
#########################################
## TOKEN AND POSISTION EMBEDDING CLASS ##
#########################################
# Create a class, which also inherits from the Keras Layers class that
# will tokenize and then embedd the data. Tokenizing and embedding here
# are similar (if not exactly the same) to concepts we covered last week.
# Question 2: Compare tokenization and position embedding as they relate
# to training a transformer to tokenization and sequence length as they
# relate to recurrent neural networks. You may need to do some outside
# reasearch here. If you use external resources please cite them.
class TokenAndPositionEmbedding(layers.Layer):
    # Define the initilization function. We will only consider phrases
    # who have a tokenized length under a maximum length (maxlen). This
    # is not required but does speed up the process.
    def __init__(self, maxlen, vocab_size, embed_dim):
        # Initialize the parent class
        super().__init__()
        # Define tokenization and position embedding functions from
        # Keras layers
        self.token_emb = layers.Embedding(input_dim=vocab_size,
                                          output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen,
                                        output_dim=embed_dim)

    # Rewrite the call function.
    # Question 3: Determine the function and reason for this function?
    def call(self, x):
        maxlen = ops.shape(x)[-1]
        positions = ops.arange(start=0, stop=maxlen, step=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

In [13]:
#########################
## IMPORT THE DATA SET ##
#########################
# Here we will be looking at the IMDB dataset:
# https://keras.io/api/datasets/imdb/
# The x data are tokenized reviews of movies and the y data
# are the sentiment expressed through the review (positive
# or negative). Thus this is a classification problem.

# Only consider the top 20k words
vocab_size = 20000
# Only consider the first 200 words of each movie review
maxlen = 200

# Import the data which is automatically split into two datasets,
# here a training and validation data set.
(x_train, y_train),(x_val, y_val) = keras.datasets.imdb.load_data(num_words=vocab_size)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")

# Trim the X data to only consider the first 200 words/tokens
# in each review. This is optional but does decrease the time.
x_train = keras.utils.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.utils.pad_sequences(x_val, maxlen=maxlen)

25000 Training sequences
25000 Validation sequences


In [5]:
# Define the hyperparameters of the model
# Embedding size for each token
embed_dim = 32
# Number of attention heads
num_heads = 2
# Hidden layer size in feed forward network
# inside transformer
ff_dim = 32

# Create the model, we start with an input layer
inputs = layers.Input(shape=(maxlen,))
# next we add the embedding layer
embedding_layer = TokenAndPositionEmbedding(maxlen,
                                            vocab_size,embed_dim)
x = embedding_layer(inputs)
# Now the transformer layer
transformer_block = TransformerBlock(embed_dim,
                                     num_heads, ff_dim)
x = transformer_block(x)
# And we finish with a pooling layer and a dense
# layer before adding the output layer
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(2, activation="softmax")(x)

# Question 4: What would be the purpose of the pooling
# and dense layers between defining the transformer and
# creating the output layer?

# Define the model
model = keras.Model(inputs=inputs, outputs=outputs)

In [6]:
# Compile and train the model as a classification problem
model.compile(optimizer="adam",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])
history = model.fit(
    x_train, y_train, batch_size=32, epochs=2,
    validation_data=(x_val, y_val)
)

Epoch 1/2
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 32ms/step - accuracy: 0.7096 - loss: 0.5178 - val_accuracy: 0.8614 - val_loss: 0.3085
Epoch 2/2
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 39ms/step - accuracy: 0.9293 - loss: 0.1874 - val_accuracy: 0.8728 - val_loss: 0.3116


In [7]:
model.evaluate(x_val, y_val)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 12ms/step - accuracy: 0.8736 - loss: 0.3129


[0.31158268451690674, 0.872759997844696]

## Assignment:
1. Last week we learned how to use recurrent neural networks for a similar natural language processing task. Do some research and compare and contrast recurrent neural networks and transformer models as they relate to natural language processing. Cite your sources if they are external to the course materials.
2. For each of the above code cells, add a 1-2 sentence minimum comment explaining what the cell does, how it relates to the information learned this week, and how it relates to information learned prior in the course. Additionally, answer any questions asked in the existing comments either in the code cell or below. Label each answer below with the question number.
3. Rewrite the data import cell so that there are three data sets: training, validation, and test. You can decide how large each of the sets are. Redo the above cell such that after the training the accuracy of the trained model is determined with the test set. Comment on the initial accuracy.
4. Perform hyperparameter tuning in order to attempt to improve the accuracy of the model. In addition to the "normal" hyperparamters of loss function, activation function, and number of hidden layers, yoou can now also adjust embedding size and the number of heads for each token.