<a id='top'></a><a name='top'></a>
# Chapter 16: Implementing Multi-Head Attention in Keras

* [Introduction](#introduction)
* [16.0 Imports and Setup](#16.0)
* [16.1 Recap of Multi-Head Attention](#16.1)
* [16.2 Implementing Multi-Head Attention from Scratch](#16.2)
* [16.3 Testing Out the Code](#16.3)
* [Extra](#extra)

---
<a name='introduction'></a><a id='introduction'></a>
# Introduction
<a href="#top">[back to top]</a>

### Dataset

* TODO

### Explore
* The layers that form part of the multi-head attention mechanism
* How to implement the multi-head attention mechanism from scratch

---
<a name='16.0'></a><a id='16.0'></a>
# 16.0 Imports and Setup
<a href="#top">[back to top]</a>

In [1]:
req_file = "requirements_16.txt"

In [2]:
%%writefile {req_file}
isort
scikit-learn-intelex
watermark

Overwriting requirements_16.txt


In [3]:
import sys
IS_COLAB = 'google.colab' in sys.modules

if IS_COLAB:
    print("Installing packages")
    !pip install --upgrade --quiet -r {req_file}
else:
    print("Running locally.")

# Need to import before sklearn
from sklearnex import patch_sklearn
patch_sklearn()

Running locally.


Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [3]:
%%writefile imports.py
import locale
import math
import pprint
import warnings

import numpy as np
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.activations import softmax
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Layer
from tqdm.auto import tqdm
from watermark import watermark

Overwriting imports.py


In [4]:
!isort imports.py --sl
!cat imports.py

import locale
import math
import pprint

import numpy as np
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.activations import softmax
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Layer
from tqdm.auto import tqdm
from watermark import watermark


In [5]:
import locale
import math
import pprint
import warnings

import numpy as np
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.activations import softmax
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Layer
from tqdm.auto import tqdm
from watermark import watermark

In [6]:
def HR():
    print("-"*40)
    
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding
warnings.filterwarnings('default')
BASE_DIR = '.'
pp = pprint.PrettyPrinter(indent=4)

seed = 42

print(watermark(iversions=True,globals_=globals(),python=True,machine=True))

Python implementation: CPython
Python version       : 3.8.12
IPython version      : 7.34.0

Compiler    : Clang 13.0.0 (clang-1300.0.29.3)
OS          : Darwin
Release     : 21.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 4
Architecture: 64bit

tensorflow: 2.9.3
numpy     : 1.23.5



---
<a name='16.1'></a><a id='16.1'></a>
# 16.1 Recap of the Transformer Architecture
<a href="#top">[back to top]</a>

---
<a name='16.2'></a><a id='16.2'></a>
# 16.2 Implementing the Multi-Head Attention from Scratch
<a href="#top">[back to top]</a>

In [7]:
class DotProductAttention(Layer):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def call(self, queries, keys, values, d_k, mask=None):
        # Scoring the queries against the keys after transposing the latter, and scaling
        scores = tf.matmul(queries, keys, transpose_b=True) / tf.math.sqrt(tf.cast(d_k, tf.float32))

        # Apply mask to the attention scores
        if mask is not None:
            scores += -1e9 * mask

        # Computing the weights by a softmax operation
        weights = softmax(scores)

        # Computing the attention by a weighted sum of the value vectors
        return tf.matmul(weights, values)

In [8]:
# Implementing the Multi-Head Attention
class MultiHeadAttention(Layer):
    def __init__(self, h, d_k, d_v, d_model, **kwargs):
        super().__init__(**kwargs)
        self.attention = DotProductAttention()  # Scaled dot product attention
        self.heads = h  # Number of attention heads to use
        self.d_k = d_k  # Dimensionality of the linearly projected queries and keys
        self.d_v = d_v  # Dimensionality of the linearly projected values
        self.d_model = d_model  # Dimensionality of the model
        self.W_q = Dense(d_k)   # Learned projection matrix for the queries
        self.W_k = Dense(d_k)   # Learned projection matrix for the keys
        self.W_v = Dense(d_v)   # Learned projection matrix for the values
        self.W_o = Dense(d_model) # Learned projection matrix for the multi-head output

        
    def reshape_tensor(self, x, heads, flag):
        if flag:
            # Tensor shape after reshaping and transposing:
            # (batch_size, heads, seq_length, -1)
            x = tf.reshape(x, shape=(tf.shape(x)[0], tf.shape(x)[1], heads, -1))
            x = tf.transpose(x, perm=(0, 2, 1, 3))
        else:
            # Reverting the reshaping and transposing operations:
            # (batch_size, seq_length, d_k)
            x = tf.transpose(x, perm=(0, 2, 1, 3))
            x = tf.reshape(x, shape=(tf.shape(x)[0], tf.shape(x)[1], self.d_k))
        return x

    
    def call(self, queries, keys, values, mask=None):
        # Rearrange the queries to be able to compute all heads in parallel
        q_reshaped = self.reshape_tensor(self.W_q(queries), self.heads, True)
        # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)

        # Rearrange the keys to be able to compute all heads in parallel
        k_reshaped = self.reshape_tensor(self.W_k(keys), self.heads, True)
        # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)

        # Rearrange the values to be able to compute all heads in parallel
        v_reshaped = self.reshape_tensor(self.W_v(values), self.heads, True)
        # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)

        # Compute the multi-head attention output using the reshaped queries,
        # keys, and values
        o_reshaped = self.attention(q_reshaped, k_reshaped, v_reshaped, self.d_k, mask)
        # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)

        # Rearrange back the output into concatenated form
        output = self.reshape_tensor(o_reshaped, self.heads, False)
        # Resulting tensor shape: (batch_size, input_seq_length, d_v)

        # Apply one final linear projection to the output to generate the multi-head
        # attention. Resulting tensor shape: (batch_size, input_seq_length, d_model)
        return self.W_o(output)

In [9]:
tf.random.set_seed(seed)
np.random.seed(seed)

input_seq_length = 5  # Maximum length of the input sequence
h = 8  # Number of self-attention heads
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
d_model = 512  # Dimensionality of the model sub-layers' outputs
batch_size = 64  # Batch size from the training process

queries = np.random.random((batch_size, input_seq_length, d_k))
keys = np.random.random((batch_size, input_seq_length, d_k))
values = np.random.random((batch_size, input_seq_length, d_v))

multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model)

test = multihead_attention(queries, keys, values)
print(test.shape)
HR()
print(test.numpy().round(4))

(64, 5, 512)
----------------------------------------
[[[-0.483   0.0095  0.2367 ...  0.0082 -0.14    0.3893]
  [-0.483   0.0083  0.2397 ...  0.0093 -0.1388  0.3903]
  [-0.4804  0.0089  0.2385 ...  0.0082 -0.1377  0.392 ]
  [-0.482   0.0086  0.239  ...  0.0061 -0.1368  0.3931]
  [-0.4803  0.0096  0.2381 ...  0.0064 -0.139   0.3917]]

 [[-0.3068 -0.1063  0.0525 ...  0.1303 -0.1617  0.3285]
  [-0.3079 -0.109   0.0535 ...  0.1306 -0.1605  0.3273]
  [-0.3074 -0.1055  0.0474 ...  0.1335 -0.1588  0.3233]
  [-0.3063 -0.1089  0.0549 ...  0.1296 -0.1609  0.3292]
  [-0.3093 -0.1074  0.054  ...  0.1316 -0.1613  0.3283]]

 [[-0.4409 -0.0034  0.0741 ...  0.0784 -0.1859  0.3943]
  [-0.4421 -0.0052  0.0719 ...  0.0801 -0.1858  0.3927]
  [-0.4401 -0.0073  0.0695 ...  0.0807 -0.1872  0.3899]
  [-0.4416 -0.007   0.0694 ...  0.0819 -0.1852  0.3903]
  [-0.4441 -0.0122  0.0711 ...  0.0843 -0.1868  0.3896]]

 ...

 [[-0.427  -0.025   0.0841 ...  0.1249 -0.1349  0.3819]
  [-0.4269 -0.0239  0.0825 ...  0.1256

2023-06-20 11:12:58.542612: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
