# Attention with Keras

© Data Trainers LLC. GPL v 3.0.

Author: Axel Sirota

A whole new world opportunities appear when considering using the layer implementations of the attention components. As of July 2023 we have 3 layers implemented:

- AdditiveAttention: This is the original Attention from the Bahdanau paper that incorporates the concept of Q,K,V attention we say in demo 2; setting, in this case, K=V.
- Attention: This is the Dot Product attention from *Luong et. al.* we saw in the first demo.
- MultiHeadAttention: The general attention everyone uses and we will learn in this demo! It is basically many layers of self attention.

Let's get to it!


## Prep

In [ ]:
!pip install -U nltk gensim 'numpy<2' 'tensorflow-text==2.15.0' 'keras-nlp' 'keras-preprocessing'

Let's run some helper functions to setup using the GPUs

In [ ]:
import multiprocessing
import tensorflow as tf
import sys
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda, ELU, Conv1D, MaxPooling1D, Dropout
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from keras import preprocessing
from textblob import TextBlob, Word
from keras_preprocessing.sequence import pad_sequences
from keras.initializers import Constant
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.keras import Model, Input
import numpy as np
import re
import random
import os
import pandas as pd
import gensim
import warnings
import nltk
import time

TRACE = False

def set_seeds_and_trace():
  os.environ['PYTHONHASHSEED'] = '0'
  np.random.seed(42)
  tf.random.set_seed(42)
  random.seed(42)
  if TRACE:
    tf.debugging.set_log_device_placement(True)

def set_session_with_gpus_and_cores():
  cores = multiprocessing.cpu_count()
  gpus = len(tf.config.list_physical_devices('GPU'))
  config = tf.compat.v1.ConfigProto( device_count = {'GPU': gpus  , 'CPU': cores} , intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
  sess = tf.compat.v1.Session(config=config)
  tf.compat.v1.keras.backend.set_session(sess)

set_seeds_and_trace()
set_session_with_gpus_and_cores()
warnings.filterwarnings('ignore')
nltk.download('punkt')

## Attention *a la Bahdanau*

The easiest way to test a Layer in Keras is to create a simple model that uses such a layer, we will do just that! This also shows how easy is to add attention to your models, which we will use extensively when creating THE Transformer from scratch

Notice we need a custom model class because the inputs needs to be the query and value, and they could have different embeddings as well.

In [3]:
class AttentionModel(tf.keras.Model):

  def __init__(self, vocab_size, max_tokens, embedding_dim, dropout_rate):
    super().__init__()
    self.embedding = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_tokens)
    self.attention = tf.keras.layers.AdditiveAttention()
    self.dropout = tf.keras.layers.Dropout(dropout_rate)
    self.dense = tf.keras.layers.Dense(vocab_size, activation='softmax')

  def call(self, inputs, training=False):

    query, value = inputs
    # Query embeddings of shape [batch_size, Tq, dimension].
    query_embeddings = self.embedding(query)
    # Value embeddings of shape [batch_size, Tv, dimension].
    value_embeddings = self.embedding(value)
    # Notice we could have an embedding for the inputs and another embedding for outputs, we will see more of that later
    x = self.attention([query_embeddings, value_embeddings])
    x = self.dropout(x, training=training)
    x = self.dense(x, training=training)
    return x

  def build_graph(self, max_tokens, embedding_dim):
    query_input = tf.keras.Input(shape=(None, max_tokens, embedding_dim), dtype='int32')
    value_input = tf.keras.Input(shape=(None, max_tokens, embedding_dim), dtype='int32')
    x = (query_input, value_input)
    return Model(inputs=x, outputs=self.call(x))

In [4]:
model = AttentionModel(100, 10, 20, 0)

In [5]:
model.summary()

ValueError: ignored

Oh no! We need to call the model, well that is simple let's simulate 3 sentences!

In [6]:
embedding_dim = 20
max_tokens = 10
query = tf.constant(np.random.randint(0, embedding_dim, size=(3,max_tokens)))
value = tf.constant(np.random.randint(0, embedding_dim, size=(3,max_tokens)))

response = model((query,value) )

In [7]:
response.shape

TensorShape([3, 10, 100])

In [8]:
model.summary()

Model: "attention_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  2000      
                                                                 
 additive_attention (Additi  multiple                  20        
 veAttention)                                                    
                                                                 
 dropout (Dropout)           multiple                  0         
                                                                 
 dense (Dense)               multiple                  2100      
                                                                 
Total params: 4120 (16.09 KB)
Trainable params: 4120 (16.09 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [9]:
model.build_graph(max_tokens=max_tokens, embedding_dim=embedding_dim)

<keras.src.engine.functional.Functional at 0x7b106007fa00>

In [10]:
model.summary()

Model: "attention_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 10, 20, 20   2000      
                             )                                   
                                                                 
 additive_attention (Additi  (None, None, 10, 20, 20   20        
 veAttention)                )                                   
                                                                 
 dropout (Dropout)           (None, None, 10, 20, 20   0         
                             )                                   
                                                                 
 dense (Dense)               (None, None, 10, 20, 10   2100      
                             0)                                  
                                                                 
Total params: 4120 (16.09 KB)
Trainable params: 412

Notice that attention adds very few parameters, adds many knowledge to the following layers, and is paralellizable.

## MultiHead Attention

Now you are ready to see Multi Head Attention. The idea is quite simple, as in CNNs we had many filters and each convolution checked many different aspects of an image, having many self attentions can check different aspects of our entity, globally. In image it is:

<figure>
<center>
<img src='https://www.dropbox.com/s/wjfxpap06viclhv/mha.png?raw=1'  />
<figcaption>Attention</figcaption></center>
</figure>

Each head performs Scaled attention as we did before with the weird formula, and then we concatenate!

In [11]:
class MultiHeadAttentionModel(tf.keras.Model):

  def __init__(self, num_heads, vocab_size, attention_dim, max_tokens, embedding_dim, dropout_rate):
    super().__init__()
    self.embedding = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_tokens)

    # key_dim stands for size of each attention head for query and key, we can also pass value_key is K!=V
    self.attention = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=attention_dim, dropout=dropout_rate)

    self.dense = tf.keras.layers.Dense(100, activation='softmax')

  def call(self, inputs, training=False):

    query, value = inputs
    # Query embeddings of shape [batch_size, Tq, dimension].
    query_embeddings = self.embedding(query)
    # Value embeddings of shape [batch_size, Tv, dimension].
    value_embeddings = self.embedding(value)
    x, weights = self.attention(query_embeddings, value_embeddings, return_attention_scores=True)  # We return the scores to do our plot!
    x = self.dense(x, training=training)
    return x, weights

  def build_graph(self, max_tokens, embedding_dim):
    query_input = tf.keras.Input(shape=(max_tokens, embedding_dim), dtype='int32')
    value_input = tf.keras.Input(shape=(max_tokens, embedding_dim), dtype='int32')
    x = (query_input, value_input)
    return Model(inputs=x, outputs=self.call(x))

In [12]:
vocab_size=100
model = MultiHeadAttentionModel(num_heads=3, vocab_size=vocab_size, attention_dim=2, max_tokens=max_tokens, embedding_dim=embedding_dim, dropout_rate=0)

In [13]:
model.build_graph(max_tokens=max_tokens, embedding_dim=embedding_dim)

<keras.src.engine.functional.Functional at 0x7b105cbb3b80>

In [14]:
query = tf.constant(np.random.randint(0,vocab_size, size=(3,max_tokens, 10)))
value = tf.constant(np.random.randint(0,vocab_size, size=(3,max_tokens, 10)))

response, weights = model((query,value) )

In [15]:
response.shape

TensorShape([3, 10, 10, 100])

**Can you guess each value in the response.shape where does it come from?**

In [16]:
weights.shape

TensorShape([3, 3, 10, 10, 10, 10])

**And for the weights??**

In [17]:
model.summary()

Model: "multi_head_attention_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 10, 20, 20)        2000      
                                                                 
 multi_head_attention (Mult  ((None, 10, 20, 20),      518       
 iHeadAttention)              (None, 3, 10, 20, 10,              
                             20))                                
                                                                 
 dense_1 (Dense)             (None, 10, 20, 100)       2100      
                                                                 
Total params: 4618 (18.04 KB)
Trainable params: 4618 (18.04 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Again, notice Attention as complex as multi head attention did not add many params and adds a lot lexical intelligence.