<a href="https://colab.research.google.com/github/anamitradm/finetuning-llms-hf/blob/main/5_Attention_with_Keras_mysoliution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Attention with Keras

© Data Trainers LLC. GPL v 3.0.

Author: Axel Sirota

A whole new world opportunities appear when considering using the layer implementations of the attention components. As of July 2023 we have 3 layers implemented:

- AdditiveAttention: This is the original Attention from the Bahdanau paper that incorporates the concept of Q,K,V attention we say in demo 2; setting, in this case, K=V.
- Attention: This is the Dot Product attention from *Luong et. al.* we saw in the first demo.
- MultiHeadAttention: The general attention everyone uses and we will learn in this demo! It is basically many layers of self attention.

Let's get to it!


## Prep

In [2]:
!pip install -U nltk 'gensim==4.2.0' 'keras-nlp' 'keras-preprocessing' 'tensorflow-text>=2.15'
!pip install Keras-Preprocessing

Collecting gensim==4.2.0
  Downloading gensim-4.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m46.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting keras-nlp
  Downloading keras_nlp-0.14.0-py3-none-any.whl (571 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m571.8/571.8 kB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting keras-preprocessing
  Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow-text>=2.15
  Downloading tensorflow_text-2.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m70.7 MB/s[0m eta [36m0:00:00[0m
Collecting tensorflow<2.17,>=2.16.1 (from tensorflow-text>=2.1

In [5]:
import multiprocessing
import tensorflow as tf
from tensorflow import keras
import sys
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda, ELU, Conv1D, MaxPooling1D, Dropout
from keras.preprocessing import sequence
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from keras import preprocessing
from textblob import TextBlob, Word
from keras_preprocessing.sequence import pad_sequences
from keras.initializers import Constant
from tensorflow.python.keras import backend as K

from tensorflow.keras import Model, Input
import numpy as np
import re
import random
import os
import pandas as pd
import gensim
import warnings
import nltk
import time

TRACE = False

def set_seeds_and_trace():
  os.environ['PYTHONHASHSEED'] = '0'
  np.random.seed(42)
  tf.random.set_seed(42)
  random.seed(42)
  if TRACE:
    tf.debugging.set_log_device_placement(True)

def set_session_with_gpus_and_cores():
  cores = multiprocessing.cpu_count()
  gpus = len(tf.config.list_physical_devices('GPU'))
  config = tf.compat.v1.ConfigProto( device_count = {'GPU': gpus  , 'CPU': cores} , intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
  sess = tf.compat.v1.Session(config=config)
  K.set_session(sess)

set_seeds_and_trace()
set_session_with_gpus_and_cores()
warnings.filterwarnings('ignore')
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Attention *a la Bahdanau*

The easiest way to test a Layer in Keras is to create a simple model that uses such a layer, we will do just that! This also shows how easy is to add attention to your models, which we will use extensively when creating THE Transformer from scratch

Notice we need a custom model class because the inputs needs to be the query and value, and they could have different embeddings as well.

In [6]:
class AttentionModel(tf.keras.Model):

  def __init__(self, vocab_size, max_tokens, embedding_dim, dropout_rate):
    super().__init__()
    self.embedding = tf.keras.layers.Embedding(vocab_size,embedding_dim,  mask_zero=True) # Add the classic Embedding layer
    # self.attention = tf.keras.layers.AdditiveAttention()([query, value]) # Remove this line as query and value are not defined here
    self.dropout = tf.keras.layers.Dropout(dropout_rate)
    self.dense = tf.keras.layers.Dense(vocab_size, activation='softmax')

  def call(self, inputs, training=False):

    query, value = inputs
    query_input = tf.keras.Input(shape=(None, max_tokens, embedding_dim), dtype='int32')
    value_input = tf.keras.Input(shape=(None, max_tokens, embedding_dim), dtype='int32')
    token_embedding = tf.keras.layers.Embedding(max_tokens, embedding_dim)


    # Query embeddings of shape [batch_size, Tq, dimension].

    query_embeddings = token_embedding(query_input)

    # Value embeddings of shape [batch_size, Tv, dimension].
    value_embeddings = token_embedding(value_input)

    # Apply attention and also return the weights. We return the scores to do our plot!
    x = tf.keras.layers.AdditiveAttention()([query_embeddings, value_embeddings])
    # weights = tf.nn.softmax(self.dense(4), axis=1) # Remove this line as it is causing the error
    # Notice we could have an embedding for the inputs and another embedding for outputs, we will see more of that later
    #x = self.attention
    # apply attention
    x = self.dropout(x, training=training)
    x = self.dense(x, training=training) # Pass the output of the attention layer to the dense layer
    return x #, weights # Return only x for now

  def build_graph(self, max_tokens, embedding_dim):
    query_input = tf.keras.Input(shape=(None, max_tokens, embedding_dim), dtype='int32')
    value_input = tf.keras.Input(shape=(None, max_tokens, embedding_dim), dtype='int32')
    x = (query_input, value_input)
    return Model(inputs=x, outputs=self.call(x))

In [7]:
model = AttentionModel(100, 10, 20, 0)

In [8]:
model.summary()

Oh no! We need to call the model, well that is simple let's simulate 3 sentences!

In [None]:

embedding_dim = 20
max_tokens = 10
query = tf.constant(np.random.randint(0, embedding_dim, size=(3,max_tokens)))
value = tf.constant(np.random.randint(0, embedding_dim, size=(3,max_tokens)))
x = (query, value)
response = AttentionModel(100, 10, 20, 0).call(x, False)

In [None]:
response.shape

(None, None, 10, 20, 100)

In [None]:
model.summary()

In [None]:
model.build_graph(max_tokens=max_tokens, embedding_dim=embedding_dim)

<Functional name=functional_1, built=True>

In [None]:
model.summary()

Notice that attention adds very few parameters, adds many knowledge to the following layers, and is paralellizable.

## MultiHead Attention

Now you are ready to see Multi Head Attention. The idea is quite simple, as in CNNs we had many filters and each convolution checked many different aspects of an image, having many self attentions can check different aspects of our entity, globally. In image it is:

<figure>
<center>
<img src='https://www.dropbox.com/s/wjfxpap06viclhv/mha.png?raw=1'  />
<figcaption>Attention</figcaption></center>
</figure>

Each head performs Scaled attention as we did before with the weird formula, and then we concatenate!

In [1]:
class MultiHeadAttentionModel(tf.keras.Model):

  def __init__(self, num_heads, vocab_size, attention_dim, max_tokens, embedding_dim, dropout_rate):
    super().__init__()
    self.embedding = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_tokens)

    # Add MHA, key_dim stands for size of each attention head for query and key, we can also pass value_key if K!=V
    self.attention = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=attention_dim)
    self.dropout = tf.keras.layers.Dropout(dropout_rate)
    self.dense = tf.keras.layers.Dense(100, activation='softmax')

  def call(self, inputs, training=False):

    query, value = inputs
    query_input = tf.keras.Input(shape=(None, max_tokens, embedding_dim), dtype='int32')
    value_input = tf.keras.Input(shape=(None, max_tokens, embedding_dim), dtype='int32')
    token_embedding = tf.keras.layers.Embedding(max_tokens, embedding_dim)
    # Query embeddings of shape [batch_size, Tq, dimension].
    query_embeddings = token_embedding(query_input)
    # Value embeddings of shape [batch_size, Tv, dimension].
    value_embeddings = token_embedding(value_input)
    # Apply attention and also return the weights. We return the scores to do our plot!
    x, weights = self.attention([query_embeddings, value_embeddings])
    x = self.dense(x, training=training)
    return x, weights

  def build_graph(self, max_tokens, embedding_dim):
    query_input = tf.keras.Input(shape=(max_tokens, embedding_dim), dtype='int32')
    value_input = tf.keras.Input(shape=(max_tokens, embedding_dim), dtype='int32')
    x = (query_input, value_input)
    return Model(inputs=x, outputs=self.call(x))

NameError: name 'tf' is not defined

In [None]:
vocab_size=100
model = MultiHeadAttentionModel(num_heads=3, vocab_size=vocab_size, attention_dim=2, max_tokens=max_tokens, embedding_dim=embedding_dim, dropout_rate=0)

In [None]:
model.build_graph(max_tokens=max_tokens, embedding_dim=embedding_dim)

TypeError: missing a required argument: 'value'

In [None]:
query = tf.constant(np.random.randint(0,vocab_size, size=(3,max_tokens, 10)))
value = tf.constant(np.random.randint(0,vocab_size, size=(3,max_tokens, 10)))

response, weights = model((query,value) )

TypeError: Exception encountered when calling MultiHeadAttentionModel.call().

[1mmissing a required argument: 'value'[0m

Arguments received by MultiHeadAttentionModel.call():
  • inputs=('tf.Tensor(shape=(3, 10, 10), dtype=int64)', 'tf.Tensor(shape=(3, 10, 10), dtype=int64)')
  • training=False

In [None]:
response.shape

(None, None, 10, 20, 100)

**Can you guess each value in the response.shape where does it come from?**

In [None]:
weights.shape

**And for the weights??**

In [None]:
model.summary()

Again, notice Attention as complex as multi head attention did not add many params and adds a lot lexical intelligence.