<a href="https://colab.research.google.com/github/cheroualiyakoub/Attention-is-all-you-need/blob/main/1.%20Attention%20Is%20All%20You%20Need%20Paper/Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing | Transformer

**IMPORTANT**<br>
Enable **GPU acceleration** by going to *Runtime > Change Runtime Type*. Keep in mind that, on certain tiers, you're not guaranteed GPU access depending on usage history and current load.
<br><br>
Also, if you're running this in the cloud rather than a local Jupyter server on your machine, then the notebook will *timeout* after a period of inactivity.
<br><br>
Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:<br>
https://research.google.com/colaboratory/local-runtimes.html

In [34]:
!pip install BPEmb

import math
import numpy as np
import tensorflow as tf

from bpemb import BPEmb



# Transformers From Scratch

We'll build a transformer from scratch, layer-by-layer. We'll start with the **Multi-Head Self-Attention** layer since that's the most involved bit. Once we have that working, the rest of the model will look familiar if you've been following the course so far.

## Multi-Head Self-Attention

#### Scaled Dot Product Self-Attention


Inside each attention head is a **Scaled Dot Product Self-Attention** operation as we covered in the slides. Given *queries*, *keys*, and *values*, the operation returns a new "mix" of the values.

$$Attention(Q, K, V) = softmax(\frac{QK^T)}{\sqrt{d_k}})V$$

The following function implements this and also takes a mask to account for padding and for masking future tokens for decoding (i.e. **look-ahead mask**). We'll cover masking later in the notebook.

In [35]:
def scaled_dot_product_attention(query, key, value, mask=None):
  key_dim = tf.cast(tf.shape(key)[-1], tf.float32)
  scaled_scores = tf.matmul(query, key, transpose_b=True) / np.sqrt(key_dim)

  if mask is not None:
    scaled_scores = tf.where(mask==0, -np.inf, scaled_scores)

  softmax = tf.keras.layers.Softmax()
  weights = softmax(scaled_scores)
  return tf.matmul(weights, value), weights

Suppose our *queries*, *keys*, and *values* are each a length of 3 with a dimension of 4.

In [36]:
seq_len = 3
embed_dim = 4

queries = np.random.rand(seq_len, embed_dim)
keys = np.random.rand(seq_len, embed_dim)
values = np.random.rand(seq_len, embed_dim)

print("Queries:\n", queries)

Queries:
 [[0.09016255 0.05664876 0.98422602 0.23257678]
 [0.64433269 0.11234856 0.32055715 0.21039094]
 [0.53803271 0.42018061 0.81938405 0.90825181]]


This would be the self-attention output and weights.

In [37]:
output, attn_weights = scaled_dot_product_attention(queries, keys, values)

print("Output\n", output, "\n")
print("Weights\n", attn_weights)

Output
 tf.Tensor(
[[0.10308886 0.5407429  0.5554236  0.51850426]
 [0.11125512 0.54034936 0.5352532  0.5036852 ]
 [0.1103044  0.5422263  0.53625214 0.5054381 ]], shape=(3, 4), dtype=float32) 

Weights
 tf.Tensor(
[[0.28296292 0.39674172 0.32029533]
 [0.33348697 0.29106882 0.37544426]
 [0.33065036 0.30199978 0.36734986]], shape=(3, 3), dtype=float32)


#### Generating queries, keys, and values for multiple heads.

Now that we have a way to calculate self-attention, let's actually generate the input *queries*, *keys*, and *values* for multiple heads.
<br><br>
In the slides (and in most references), each attention head had its <u>own separate</u> set of *query*, *key*, and *value* weights. Each weight matrix was of dimension $d\ x \ d/h$ where h was the number of heads.

<img src="https://github.com/cheroualiyakoub/Attention-is-all-you-need/blob/main/1.%20Attention%20Is%20All%20You%20Need%20Paper/pic/multi_head_self_attention.png?raw=1" alt="Alternative text" />

It's easier to understand things this way and we can certainly code it this way as well. But we can also "simulate" different heads with a single query matrix, single key matrix, and single value matrix.
<br><br>
We'll do both. First we'll create *query*, *key*, and *value* vectors using separate weights per head.
<br><br>
In the slides, we used an example of 12 dimensional embeddings processed by  three attentions heads, and we'll do the same here.

In [38]:
batch_size = 1
seq_len = 3
embed_dim = 12
num_heads = 3
head_dim = embed_dim // num_heads

print(f"Dimension of each head: {head_dim}")

Dimension of each head: 4


**Using separate weight matrices per head**

Suppose these are our input embeddings. Here we have a batch of 1 containing a sequence of length 3, with each element being a 12-dimensional embedding.

In [39]:
x = np.random.rand(batch_size, seq_len, embed_dim).round(1)
print("Input shape: ", x.shape, "\n")
print("Input:\n", x)

Input shape:  (1, 3, 12) 

Input:
 [[[0.5 0.3 0.1 0.  0.  0.8 0.3 0.3 0.4 0.9 1.  0.4]
  [0.9 0.3 0.3 0.6 0.6 0.9 0.8 0.4 0.7 0.2 0.6 0.9]
  [0.8 0.1 0.7 0.1 0.9 0.7 0.5 0.1 0.6 0.3 0.9 0.4]]]


We'll declare three sets of *query* weights (one for each head), three sets of *key* weights, and three sets of *value* weights. Remember each weight matrix should have a dimension of $\ \text{d/h}$.

In [40]:
# The query weights for each head.
wq0 = np.random.rand(embed_dim, head_dim).round(1)
wq1 = np.random.rand(embed_dim, head_dim).round(1)
wq2 = np.random.rand(embed_dim, head_dim).round(1)

# The key weights for each head.
wk0 = np.random.rand(embed_dim, head_dim).round(1)
wk1 = np.random.rand(embed_dim, head_dim).round(1)
wk2 = np.random.rand(embed_dim, head_dim).round(1)

# The value weights for each head.
wv0 = np.random.rand(embed_dim, head_dim).round(1)
wv1 = np.random.rand(embed_dim, head_dim).round(1)
wv2 = np.random.rand(embed_dim, head_dim).round(1)

In [41]:
print("The three sets of query weights (one for each head):")
print("wq0:\n", wq0)
print("wq1:\n", wq1)
print("wq2:\n", wq1)

The three sets of query weights (one for each head):
wq0:
 [[0.4 0.2 0.4 0.3]
 [0.8 0.7 0.1 0.1]
 [0.3 0.  0.4 0.3]
 [0.6 0.8 0.7 0.9]
 [0.8 0.4 0.1 0.4]
 [0.2 0.7 0.1 0.8]
 [0.2 0.1 0.3 0.7]
 [0.9 0.4 0.2 0.9]
 [0.9 0.6 0.8 0.1]
 [0.8 0.9 0.1 0.5]
 [0.2 0.8 0.1 0.8]
 [0.8 0.7 0.8 0.5]]
wq1:
 [[0.7 0.8 0.9 0.8]
 [0.9 0.4 0.6 0.6]
 [0.1 0.6 0.4 0.8]
 [1.  0.6 0.5 0.1]
 [0.1 0.5 0.1 0.8]
 [0.2 0.2 0.5 0.6]
 [0.2 0.3 0.1 0.4]
 [0.6 0.8 0.6 0.8]
 [0.5 0.2 0.2 0.1]
 [0.8 0.1 0.8 0.9]
 [0.8 0.7 0.1 1. ]
 [0.4 0.1 0.3 0.9]]
wq2:
 [[0.7 0.8 0.9 0.8]
 [0.9 0.4 0.6 0.6]
 [0.1 0.6 0.4 0.8]
 [1.  0.6 0.5 0.1]
 [0.1 0.5 0.1 0.8]
 [0.2 0.2 0.5 0.6]
 [0.2 0.3 0.1 0.4]
 [0.6 0.8 0.6 0.8]
 [0.5 0.2 0.2 0.1]
 [0.8 0.1 0.8 0.9]
 [0.8 0.7 0.1 1. ]
 [0.4 0.1 0.3 0.9]]


We'll generate our *queries*, *keys*, and *values* for each head by multiplying our input by the weights.

In [42]:
# Geneated queries, keys, and values for the first head.
q0 = np.dot(x, wq0)
k0 = np.dot(x, wk0)
v0 = np.dot(x, wv0)

# Geneated queries, keys, and values for the second head.
q1 = np.dot(x, wq1)
k1 = np.dot(x, wk1)
v1 = np.dot(x, wv1)

# Geneated queries, keys, and values for the third head.
q2 = np.dot(x, wq2)
k2 = np.dot(x, wk2)
v2 = np.dot(x, wv2)

These are the resulting *query*, *key*, and *value* vectors for the first head.

In [43]:
print("Q, K, and V for first head:\n")

print(f"q0 {q0.shape}:\n", q0, "\n")
print(f"k0 {k0.shape}:\n", k0, "\n")
print(f"v0 {v0.shape}:\n", v0)

Q, K, and V for first head:

q0 (1, 3, 4):
 [[[2.56 3.15 1.33 2.82]
  [3.86 3.69 2.76 3.91]
  [3.   2.88 1.93 3.04]]] 

k0 (1, 3, 4):
 [[[3.17 2.36 2.56 2.43]
  [4.21 2.91 3.55 3.32]
  [3.57 2.25 3.13 2.72]]] 

v0 (1, 3, 4):
 [[[2.82 2.52 2.7  2.05]
  [3.62 3.01 3.49 3.44]
  [3.29 2.42 3.1  3.07]]]


Now that we have our Q, K, V vectors, we can just pass them to our self-attention operation. Here we're calculating the output and attention weights for the first head.

In [44]:
out0, attn_weights0 = scaled_dot_product_attention(q0, k0, v0)

print("Output from first attention head: ", out0, "\n")
print("Attention weights from first head: ", attn_weights0)

Output from first attention head:  tf.Tensor(
[[[3.5920599 2.9744961 3.459369  3.401097 ]
  [3.613446  3.0002985 3.4825878 3.4315946]
  [3.600311  2.983282  3.4681337 3.4134767]]], shape=(1, 3, 4), dtype=float32) 

Attention weights from first head:  tf.Tensor(
[[[0.01536692 0.93721914 0.04741397]
  [0.00214407 0.9831933  0.01466269]
  [0.00902194 0.9531857  0.03779246]]], shape=(1, 3, 3), dtype=float32)


Here are the other two (attention weights are ignored).

In [45]:
out1, _ = scaled_dot_product_attention(q1, k1, v1)
out2, _ = scaled_dot_product_attention(q2, k2, v2)

print("Output from second attention head: ", out1, "\n")
print("Output from third attention head: ", out2,)

Output from second attention head:  tf.Tensor(
[[[3.1222017 3.6094034 3.178613  2.502714 ]
  [3.1327453 3.6216776 3.1971664 2.5129533]
  [3.127195  3.615222  3.1873956 2.5075629]]], shape=(1, 3, 4), dtype=float32) 

Output from third attention head:  tf.Tensor(
[[[2.8246992 2.1105528 3.5810847 3.5944724]
  [2.828389  2.11703   3.5872064 3.5983238]
  [2.8269708 2.114478  3.5848002 3.5968463]]], shape=(1, 3, 4), dtype=float32)


As we covered in the slides, once we have each head's output, we concatenate them and then put them through a linear layer for further processing.

In [46]:
combined_out_a = np.concatenate((out0, out1, out2), axis=-1)
print(f"Combined output from all heads {combined_out_a.shape}:")
print(combined_out_a)

# The final step would be to run combined_out_a through a linear/dense layer
# for further processing.

Combined output from all heads (1, 3, 12):
[[[3.5920599 2.9744961 3.459369  3.401097  3.1222017 3.6094034 3.178613
   2.502714  2.8246992 2.1105528 3.5810847 3.5944724]
  [3.613446  3.0002985 3.4825878 3.4315946 3.1327453 3.6216776 3.1971664
   2.5129533 2.828389  2.11703   3.5872064 3.5983238]
  [3.600311  2.983282  3.4681337 3.4134767 3.127195  3.615222  3.1873956
   2.5075629 2.8269708 2.114478  3.5848002 3.5968463]]]


So that's a complete run of **multi-head self-attention** using separate sets of weights per head.<br>

Let's now get the same thing done using a single query weight matrix, single key weight matrix, and single value weight matrix.<br><br>
These were our separate per-head query weights:

In [47]:
print("Query weights for first head: \n", wq0, "\n")
print("Query weights for second head: \n", wq1, "\n")
print("Query weights for third head: \n", wq2)

Query weights for first head: 
 [[0.4 0.2 0.4 0.3]
 [0.8 0.7 0.1 0.1]
 [0.3 0.  0.4 0.3]
 [0.6 0.8 0.7 0.9]
 [0.8 0.4 0.1 0.4]
 [0.2 0.7 0.1 0.8]
 [0.2 0.1 0.3 0.7]
 [0.9 0.4 0.2 0.9]
 [0.9 0.6 0.8 0.1]
 [0.8 0.9 0.1 0.5]
 [0.2 0.8 0.1 0.8]
 [0.8 0.7 0.8 0.5]] 

Query weights for second head: 
 [[0.7 0.8 0.9 0.8]
 [0.9 0.4 0.6 0.6]
 [0.1 0.6 0.4 0.8]
 [1.  0.6 0.5 0.1]
 [0.1 0.5 0.1 0.8]
 [0.2 0.2 0.5 0.6]
 [0.2 0.3 0.1 0.4]
 [0.6 0.8 0.6 0.8]
 [0.5 0.2 0.2 0.1]
 [0.8 0.1 0.8 0.9]
 [0.8 0.7 0.1 1. ]
 [0.4 0.1 0.3 0.9]] 

Query weights for third head: 
 [[0.9 0.8 0.2 0.6]
 [0.4 0.5 0.9 0.5]
 [0.4 0.9 0.6 0.9]
 [0.2 0.3 0.9 0.9]
 [0.3 0.5 0.8 0.3]
 [0.8 0.9 0.8 0.9]
 [0.2 0.3 0.1 0.2]
 [0.6 0.2 0.2 0.7]
 [0.2 0.6 0.2 0.1]
 [0.7 0.9 0.7 0.4]
 [0.1 0.9 0.2 0.6]
 [1.  0.6 0.6 0.5]]


Suppose instead of declaring three separate query weight matrices, we had declared one. i.e. a single $d\ x\ d$ matrix. We're concatenating our per-head query weights here instead of declaring a new set of weights so that we get the same results.

In [48]:
wq = np.concatenate((wq0, wq1, wq2), axis=1)
print(f"Single query weight matrix {wq.shape}: \n", wq)

Single query weight matrix (12, 12): 
 [[0.4 0.2 0.4 0.3 0.7 0.8 0.9 0.8 0.9 0.8 0.2 0.6]
 [0.8 0.7 0.1 0.1 0.9 0.4 0.6 0.6 0.4 0.5 0.9 0.5]
 [0.3 0.  0.4 0.3 0.1 0.6 0.4 0.8 0.4 0.9 0.6 0.9]
 [0.6 0.8 0.7 0.9 1.  0.6 0.5 0.1 0.2 0.3 0.9 0.9]
 [0.8 0.4 0.1 0.4 0.1 0.5 0.1 0.8 0.3 0.5 0.8 0.3]
 [0.2 0.7 0.1 0.8 0.2 0.2 0.5 0.6 0.8 0.9 0.8 0.9]
 [0.2 0.1 0.3 0.7 0.2 0.3 0.1 0.4 0.2 0.3 0.1 0.2]
 [0.9 0.4 0.2 0.9 0.6 0.8 0.6 0.8 0.6 0.2 0.2 0.7]
 [0.9 0.6 0.8 0.1 0.5 0.2 0.2 0.1 0.2 0.6 0.2 0.1]
 [0.8 0.9 0.1 0.5 0.8 0.1 0.8 0.9 0.7 0.9 0.7 0.4]
 [0.2 0.8 0.1 0.8 0.8 0.7 0.1 1.  0.1 0.9 0.2 0.6]
 [0.8 0.7 0.8 0.5 0.4 0.1 0.3 0.9 1.  0.6 0.6 0.5]]


In the same vein, pretend we declared a single key weight matrix, and single value weight matrix.

In [49]:
wk = np.concatenate((wk0, wk1, wk2), axis=1)
wv = np.concatenate((wv0, wv1, wv2), axis=1)

print(f"Single key weight matrix {wk.shape}:\n", wk, "\n")
print(f"Single value weight matrix {wv.shape}:\n", wv)

Single key weight matrix (12, 12):
 [[0.2 0.1 0.4 0.1 0.7 0.7 0.5 0.5 0.9 0.5 0.8 0.1]
 [0.3 0.5 0.  0.8 0.1 0.4 0.6 0.1 0.9 0.3 0.7 0.5]
 [0.4 0.1 0.4 0.7 0.9 0.3 0.3 0.9 0.3 0.1 0.7 0.9]
 [0.1 0.9 0.8 0.7 0.  0.6 0.7 0.8 0.7 1.  0.1 0. ]
 [0.7 0.4 0.9 0.4 0.6 0.4 0.6 0.3 0.2 0.6 0.2 0.8]
 [0.9 0.6 0.6 0.6 0.3 0.7 0.7 0.2 0.8 0.3 0.7 0.8]
 [0.3 0.2 0.2 0.6 0.7 0.8 0.2 0.8 0.8 0.6 0.5 0.9]
 [0.8 0.6 0.4 1.  0.1 0.2 0.9 0.1 0.8 0.2 0.9 0.8]
 [0.9 0.8 0.3 0.3 0.7 0.3 0.5 0.3 0.6 0.5 0.8 0.6]
 [0.7 0.9 1.  0.5 0.3 0.3 0.1 0.7 0.8 0.  0.4 0.9]
 [0.5 0.3 0.4 0.5 0.2 0.6 0.1 0.4 0.1 0.1 0.7 1. ]
 [1.  0.  0.6 0.1 1.  0.5 0.3 0.5 0.  1.  0.3 0. ]] 

Single value weight matrix (12, 12):
 [[0.4 0.5 0.6 1.  0.7 0.1 0.2 0.2 0.9 0.4 0.5 0.9]
 [0.1 0.2 0.1 0.8 0.9 0.9 1.  0.7 0.  0.6 0.2 0.2]
 [0.3 0.1 0.2 0.6 0.6 0.9 0.4 0.9 1.  0.3 0.1 0.4]
 [0.2 0.1 0.3 0.2 0.6 0.4 0.8 0.1 0.3 0.7 0.2 0. ]
 [0.6 0.2 0.6 0.8 0.4 0.8 0.8 0.1 0.  0.  0.4 0.1]
 [0.8 0.9 0.3 0.3 0.4 0.8 0.  0.1 0.8 0.1 0.4 0.7]
 [0.6

Now we can calculate all our *queries*, *keys*, and *values* with three dot products.

In [50]:
q_s = np.dot(x, wq)
k_s = np.dot(x, wk)
v_s = np.dot(x, wv)

These are our resulting query vectors (we'll call them "combined queries"). How do we simulate different heads with this?

In [51]:
print(f"Query vectors using a single weight matrix {q_s.shape}:\n", q_s)

Query vectors using a single weight matrix (1, 3, 12):
 [[[2.56 3.15 1.33 2.82 2.91 1.98 2.3  3.71 2.7  3.7  2.31 2.73]
  [3.86 3.69 2.76 3.91 3.52 3.09 2.87 4.52 3.71 4.43 3.47 3.89]
  [3.   2.88 1.93 3.04 2.63 2.8  2.23 4.28 2.87 4.28 2.86 3.24]]]


Somehow, we need to separate these vectors such they're treated like three separate sets by the self-attention operation.

In [52]:
print(q0, "\n")
print(q1, "\n")
print(q2)

[[[2.56 3.15 1.33 2.82]
  [3.86 3.69 2.76 3.91]
  [3.   2.88 1.93 3.04]]] 

[[[2.91 1.98 2.3  3.71]
  [3.52 3.09 2.87 4.52]
  [2.63 2.8  2.23 4.28]]] 

[[[2.7  3.7  2.31 2.73]
  [3.71 4.43 3.47 3.89]
  [2.87 4.28 2.86 3.24]]]


Notice how each set of per-head queries looks like we took the combined queries, and chopped them vertically every four dimensions.
<br><br>
We can split our combined queries into $\text{d}\ \text{x}\ \text{d/h}$ heads using **reshape** and **transpose**.<br><br>
The first step is to *reshape* our combined queries from a shape of:<br>
(batch_size, seq_len, embed_dim)<br>

into a shape of<br>
 (batch_size, seq_len, num_heads, head_dim).
 <br>

 https://www.tensorflow.org/api_docs/python/tf/reshape

In [53]:
# Note: we can achieve the same thing by passing -1 instead of seq_len.
q_s_reshaped = tf.reshape(q_s, (batch_size, seq_len, num_heads, head_dim))
print(f"Combined queries: {q_s.shape}\n", q_s, "\n")
print(f"Reshaped into separate heads: {q_s_reshaped.shape}\n", q_s_reshaped)

Combined queries: (1, 3, 12)
 [[[2.56 3.15 1.33 2.82 2.91 1.98 2.3  3.71 2.7  3.7  2.31 2.73]
  [3.86 3.69 2.76 3.91 3.52 3.09 2.87 4.52 3.71 4.43 3.47 3.89]
  [3.   2.88 1.93 3.04 2.63 2.8  2.23 4.28 2.87 4.28 2.86 3.24]]] 

Reshaped into separate heads: (1, 3, 3, 4)
 tf.Tensor(
[[[[2.56 3.15 1.33 2.82]
   [2.91 1.98 2.3  3.71]
   [2.7  3.7  2.31 2.73]]

  [[3.86 3.69 2.76 3.91]
   [3.52 3.09 2.87 4.52]
   [3.71 4.43 3.47 3.89]]

  [[3.   2.88 1.93 3.04]
   [2.63 2.8  2.23 4.28]
   [2.87 4.28 2.86 3.24]]]], shape=(1, 3, 3, 4), dtype=float64)


At this point, we have our desired shape. The next step is to *transpose* it such that simulates vertically chopping our combined queries. By transposing, our matrix dimensions become:<br>
(batch_size, num_heads, seq_len, head_dim)<br>

https://www.tensorflow.org/api_docs/python/tf/transpose

In [54]:
q_s_transposed = tf.transpose(q_s_reshaped, perm=[0, 2, 1, 3]).numpy()
print(f"Queries transposed into \"separate\" heads {q_s_transposed.shape}:\n",
      q_s_transposed)

Queries transposed into "separate" heads (1, 3, 3, 4):
 [[[[2.56 3.15 1.33 2.82]
   [3.86 3.69 2.76 3.91]
   [3.   2.88 1.93 3.04]]

  [[2.91 1.98 2.3  3.71]
   [3.52 3.09 2.87 4.52]
   [2.63 2.8  2.23 4.28]]

  [[2.7  3.7  2.31 2.73]
   [3.71 4.43 3.47 3.89]
   [2.87 4.28 2.86 3.24]]]]


If we compare this against the separate per-head queries we calculated previously, we see the same result except we now have all our queries in a single matrix.

In [55]:
print("The separate per-head query matrices from before: ")
print(q0, "\n")
print(q1, "\n")
print(q2)

The separate per-head query matrices from before: 
[[[2.56 3.15 1.33 2.82]
  [3.86 3.69 2.76 3.91]
  [3.   2.88 1.93 3.04]]] 

[[[2.91 1.98 2.3  3.71]
  [3.52 3.09 2.87 4.52]
  [2.63 2.8  2.23 4.28]]] 

[[[2.7  3.7  2.31 2.73]
  [3.71 4.43 3.47 3.89]
  [2.87 4.28 2.86 3.24]]]


Let's do the exact same thing with our combined keys and values.

In [56]:
k_s_transposed = tf.transpose(tf.reshape(k_s, (batch_size, -1, num_heads, head_dim)), perm=[0, 2, 1, 3]).numpy()
v_s_transposed = tf.transpose(tf.reshape(v_s, (batch_size, -1, num_heads, head_dim)), perm=[0, 2, 1, 3]).numpy()

print(f"Keys for all heads in a single matrix {k_s.shape}: \n", k_s_transposed, "\n")
print(f"Values for all heads in a single matrix {v_s.shape}: \n", v_s_transposed)

Keys for all heads in a single matrix (1, 3, 12): 
 [[[[3.17 2.36 2.56 2.43]
   [4.21 2.91 3.55 3.32]
   [3.57 2.25 3.13 2.72]]

  [[2.1  2.55 1.86 2.15]
   [3.73 3.87 3.35 3.31]
   [3.4  3.15 2.5  2.89]]

  [[2.93 1.53 3.16 3.49]
   [4.03 3.67 4.04 3.95]
   [3.   2.46 3.57 4.1 ]]]] 

Values for all heads in a single matrix (1, 3, 12): 
 [[[[2.82 2.52 2.7  2.05]
   [3.62 3.01 3.49 3.44]
   [3.29 2.42 3.1  3.07]]

  [[2.36 2.23 2.11 1.77]
   [3.14 3.63 3.21 2.52]
   [2.79 3.23 2.59 2.18]]

  [[2.25 1.69 3.13 2.97]
   [2.83 2.12 3.59 3.6 ]
   [2.57 1.63 3.13 3.33]]]]


Set up this way, we can now calculate the outputs from all attention heads with a single call to our self-attention operation.

In [57]:
all_heads_output, all_attn_weights = scaled_dot_product_attention(q_s_transposed,
                                                                  k_s_transposed,
                                                                  v_s_transposed)
print("Self attention output:\n", all_heads_output)

Self attention output:
 tf.Tensor(
[[[[3.5920599 2.9744961 3.459369  3.401097 ]
   [3.613446  3.0002985 3.4825878 3.4315946]
   [3.600311  2.983282  3.4681337 3.4134767]]

  [[3.1222017 3.6094034 3.178613  2.502714 ]
   [3.1327453 3.6216776 3.1971664 2.5129533]
   [3.127195  3.615222  3.1873956 2.5075629]]

  [[2.8246992 2.1105528 3.5810847 3.5944724]
   [2.828389  2.11703   3.5872064 3.5983238]
   [2.8269708 2.114478  3.5848002 3.5968463]]]], shape=(1, 3, 3, 4), dtype=float32)


As a sanity check, we can compare this against the outputs from individual heads we calculated earlier:

In [58]:
print("Per head outputs from using separate sets of weights per head:")
print(out0, "\n")
print(out1, "\n")
print(out2)

Per head outputs from using separate sets of weights per head:
tf.Tensor(
[[[3.5920599 2.9744961 3.459369  3.401097 ]
  [3.613446  3.0002985 3.4825878 3.4315946]
  [3.600311  2.983282  3.4681337 3.4134767]]], shape=(1, 3, 4), dtype=float32) 

tf.Tensor(
[[[3.1222017 3.6094034 3.178613  2.502714 ]
  [3.1327453 3.6216776 3.1971664 2.5129533]
  [3.127195  3.615222  3.1873956 2.5075629]]], shape=(1, 3, 4), dtype=float32) 

tf.Tensor(
[[[2.8246992 2.1105528 3.5810847 3.5944724]
  [2.828389  2.11703   3.5872064 3.5983238]
  [2.8269708 2.114478  3.5848002 3.5968463]]], shape=(1, 3, 4), dtype=float32)


To get the final concatenated result, we need to reverse our **reshape** and **transpose** operation, starting with the **transpose** this time.

In [59]:
combined_out_b = tf.reshape(tf.transpose(all_heads_output, perm=[0, 2, 1, 3]),
                            shape=(batch_size, seq_len, embed_dim))
print("Final output from using single query, key, value matrices:\n",
      combined_out_b, "\n")
print("Final output from using separate query, key, value matrices per head:\n",
      combined_out_a)

Final output from using single query, key, value matrices:
 tf.Tensor(
[[[3.5920599 2.9744961 3.459369  3.401097  3.1222017 3.6094034 3.178613
   2.502714  2.8246992 2.1105528 3.5810847 3.5944724]
  [3.613446  3.0002985 3.4825878 3.4315946 3.1327453 3.6216776 3.1971664
   2.5129533 2.828389  2.11703   3.5872064 3.5983238]
  [3.600311  2.983282  3.4681337 3.4134767 3.127195  3.615222  3.1873956
   2.5075629 2.8269708 2.114478  3.5848002 3.5968463]]], shape=(1, 3, 12), dtype=float32) 

Final output from using separate query, key, value matrices per head:
 [[[3.5920599 2.9744961 3.459369  3.401097  3.1222017 3.6094034 3.178613
   2.502714  2.8246992 2.1105528 3.5810847 3.5944724]
  [3.613446  3.0002985 3.4825878 3.4315946 3.1327453 3.6216776 3.1971664
   2.5129533 2.828389  2.11703   3.5872064 3.5983238]
  [3.600311  2.983282  3.4681337 3.4134767 3.127195  3.615222  3.1873956
   2.5075629 2.8269708 2.114478  3.5848002 3.5968463]]]


We can encapsulate everything we just covered in a class.

In [60]:
class MultiHeadSelfAttention(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads):
    super(MultiHeadSelfAttention, self).__init__()
    self.d_model = d_model
    self.num_heads = num_heads

    self.d_head = self.d_model // self.num_heads

    self.wq = tf.keras.layers.Dense(self.d_model)
    self.wk = tf.keras.layers.Dense(self.d_model)
    self.wv = tf.keras.layers.Dense(self.d_model)

    # Linear layer to generate the final output.
    self.dense = tf.keras.layers.Dense(self.d_model)

  def split_heads(self, x):
    batch_size = x.shape[0]

    split_inputs = tf.reshape(x, (batch_size, -1, self.num_heads, self.d_head))
    return tf.transpose(split_inputs, perm=[0, 2, 1, 3])

  def merge_heads(self, x):
    batch_size = x.shape[0]

    merged_inputs = tf.transpose(x, perm=[0, 2, 1, 3])
    return tf.reshape(merged_inputs, (batch_size, -1, self.d_model))

  def call(self, q, k, v, mask):
    qs = self.wq(q)
    ks = self.wk(k)
    vs = self.wv(v)

    qs = self.split_heads(qs)
    ks = self.split_heads(ks)
    vs = self.split_heads(vs)

    output, attn_weights = scaled_dot_product_attention(qs, ks, vs, mask)
    output = self.merge_heads(output)

    return self.dense(output), attn_weights


In [61]:
mhsa = MultiHeadSelfAttention(12, 3)

output, attn_weights = mhsa(x, x, x, None)
print(f"MHSA output{output.shape}:")
print(output)

MHSA output(1, 3, 12):
tf.Tensor(
[[[-0.15136674  0.47280192 -0.1231066  -0.6842792  -0.9570793
   -0.03200206  0.03029087 -0.7708946   0.69708985  0.11875576
    0.8421662  -0.7319753 ]
  [-0.1577799   0.46053767 -0.13005814 -0.6750201  -0.9376211
   -0.04589014  0.01804432 -0.7830092   0.6967263   0.12486982
    0.83996165 -0.72131324]
  [-0.14494632  0.472653   -0.13980737 -0.6777024  -0.9674089
   -0.03821798  0.01831874 -0.7672671   0.69209206  0.1276465
    0.8365545  -0.7257753 ]]], shape=(1, 3, 12), dtype=float32)


## Encoder Block

We can now build our **Encoder Block**. In addition to the **Multi-Head Self Attention** layer, the **Encoder Block** also has **skip connections**, **layer normalization steps**, and a **two-layer feed-forward neural network**. The original **Attention Is All You Need** paper also included some **dropout** applied to the self-attention output which isn't shown in the illustration below (see references for a link to the paper).

<div>
<img src="https://github.com/cheroualiyakoub/Attention-is-all-you-need/blob/main/1.%20Attention%20Is%20All%20You%20Need%20Paper/pic/encoder_block.png?raw=1" width="500"/>
</div>

Since a two-layer feed forward neural network is used in multiple places in the transformer, here's a function which creates and returns one.

In [101]:
def feed_forward_network(d_model, hidden_dim):
  return tf.keras.Sequential([
      tf.keras.layers.Dense(hidden_dim, activation='relu'),
      tf.keras.layers.Dense(d_model)
  ])

This is our encoder block containing all the layers and steps from the preceding illustration (plus dropout).

In [102]:
class EncoderBlock(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, hidden_dim, dropout_rate=0.1):
    super(EncoderBlock, self).__init__()

    self.mhsa = MultiHeadSelfAttention(d_model, num_heads)
    self.ffn = feed_forward_network(d_model, hidden_dim)

    self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
    self.dropout2 = tf.keras.layers.Dropout(dropout_rate)

    self.layernorm1 = tf.keras.layers.LayerNormalization()
    self.layernorm2 = tf.keras.layers.LayerNormalization()

  def call(self, x, training, mask):
    mhsa_output, attn_weights = self.mhsa(x, x, x, mask=mask)
    mhsa_output = self.dropout1(mhsa_output, training=training)
    mhsa_output = self.layernorm1(x + mhsa_output)

    ffn_output = self.ffn(mhsa_output)
    ffn_output = self.dropout2(ffn_output, training=training)
    output = self.layernorm2(mhsa_output + ffn_output)

    return output, attn_weights


Suppose we have an embedding dimension of 12, and we want 3 attention heads and a feed forward network with a hidden dimension of 48 (4x the embedding dimension). We would declare and use a single encoder block like so:

In [103]:
encoder_block = EncoderBlock(12, 3, 48)

block_output,  _ = encoder_block(x, training=True, mask=None)
print(f"Output from single encoder block {block_output.shape}:")
print(block_output)

Output from single encoder block (1, 3, 12):
tf.Tensor(
[[[ 1.1176243  -0.0868856  -0.6481668   0.5764882  -0.4230999
    0.19738159 -0.7557271   0.72194445 -1.9913605   2.0848377
   -0.5503276  -0.24270874]
  [ 1.0540625   0.9510841  -0.9140223   1.0893008   0.39611056
    0.30703992 -1.2452945   0.7734767  -1.7457588   0.5896942
   -1.4558834   0.20018998]
  [ 1.5350673  -0.5104299  -0.15841521  0.33517894  0.9873435
   -0.15205733 -0.68545043  0.8951303  -2.0789797   1.2626112
   -0.8403023  -0.5896963 ]]], shape=(1, 3, 12), dtype=float32)


## Word and Positional Embeddings

Let's now deal with the actual input to the **initial** encoder block. The inputs are going to be *positional word embeddings*. That is, word embeddings with some positional information added to them.
<br>

Let's start with **subword** tokenization. For demonstration, we'll use a subword tokenizer called **BPEmb**. It uses **Byte-Pair Encoding** and supports over two hundred languages.

https://bpemb.h-its.org/


In [104]:
# Load the English tokenizer.
bpemb_en = BPEmb(lang="en")

The library comes with embeddings for a number of words.

In [105]:
bpemb_vocab_size, bpemb_embed_size = bpemb_en.vectors.shape
print("Vocabulary size:", bpemb_vocab_size)
print("Embedding size:", bpemb_embed_size)

Vocabulary size: 10000
Embedding size: 100


In [106]:
# Embedding for the word "car".
bpemb_en.vectors[bpemb_en.words.index('car')]

array([-0.305548, -0.325598, -0.134716, -0.078735, -0.660545,  0.076211,
       -0.735487,  0.124533, -0.294402,  0.459688,  0.030137,  0.174041,
       -0.224223,  0.486189, -0.504649, -0.459699,  0.315747,  0.477885,
        0.091398,  0.427867,  0.016524, -0.076833, -0.899727,  0.493158,
       -0.022309, -0.422785, -0.154148,  0.204981,  0.379834,  0.070588,
        0.196073, -0.368222,  0.473406,  0.007409,  0.004303, -0.007823,
       -0.19103 , -0.202509,  0.109878, -0.224521, -0.35741 , -0.611633,
        0.329958, -0.212956, -0.497499, -0.393839, -0.130101, -0.216903,
       -0.105595, -0.076007, -0.483942, -0.139704, -0.161647,  0.136985,
        0.415363, -0.360143,  0.038601, -0.078804, -0.030421,  0.324129,
        0.223378, -0.523636, -0.048317, -0.032248, -0.117367,  0.470519,
        0.225816, -0.222065, -0.225007, -0.165904, -0.334389, -0.20157 ,
        0.572352, -0.268794,  0.301929, -0.005563,  0.387491,  0.261031,
       -0.11613 ,  0.074982, -0.008433,  0.259987, 

We don't need the embeddings since we're going to use our own embedding layer. What we're interested in are the subword tokens and their respective ids. The ids will be used as indexes into our embedding layer.<br>

These are the subword tokens for our example sentence from the slides. **BPEmb** places underscores in front of any tokens which are whole words or intended to begin words.<br>

Remember that subword tokenizers are trained using count frequencies over a corpus. So these subword tokens are specific to **BPEmb**. Another subword tokenizer may output something different. This is why it's important that when we use a pretrained model, we make sure to use the pretrained model's tokenizer. We'll see this when we use pretrained transformers later in this module.

In [107]:
sample_sentence = "Where can I find a pizzeria?"
tokens = bpemb_en.encode(sample_sentence)
print(tokens)

['▁where', '▁can', '▁i', '▁find', '▁a', '▁p', 'iz', 'zer', 'ia', '?']


We can retrieve each subword token's respective id using the *encode_ids* method.

In [108]:
token_seq = np.array(bpemb_en.encode_ids("Where can I find a pizzeria?"))
print(token_seq)

[ 571  280  386 1934    4   24  248 4339  177 9967]


Now that we have a way to tokenize and vectorize sentences, we can declare and use an embedding layer with the same vocabulary size as **BPEmb** and a desired embedding size.

In [109]:
token_embed = tf.keras.layers.Embedding(bpemb_vocab_size, embed_dim)
token_embeddings = token_embed(token_seq)

# The untrained embeddings for our sample sentence.
print("Embeddings for: ", sample_sentence)
print(token_embeddings)

Embeddings for:  Where can I find a pizzeria?
tf.Tensor(
[[-0.01327326 -0.01017855 -0.02066456  0.02657075 -0.02569367  0.00632077
  -0.04273451 -0.04425365  0.01838433  0.00137855  0.00076794 -0.0020627 ]
 [-0.00814817  0.01944372 -0.00027491 -0.02275245  0.00209203 -0.04899383
  -0.03662769  0.00645707  0.04375986  0.0152767   0.00136011  0.02722396]
 [-0.04608799  0.04308565 -0.03497688 -0.00121663 -0.04987511  0.00034027
   0.0347109   0.01670711  0.00784076  0.02125475  0.02898685  0.04047329]
 [-0.00551502  0.03952031  0.04245657  0.00645719 -0.02366608 -0.02773635
  -0.01461071 -0.02907307 -0.00964286 -0.04066302  0.02770874 -0.03161144]
 [-0.03092362  0.04348184 -0.03477585  0.00761949  0.03696151  0.03384955
  -0.04752473 -0.04417132 -0.03825296 -0.02933741  0.00465336 -0.03130996]
 [ 0.01258698  0.0067548  -0.02318536  0.04668101  0.03536225  0.0319347
   0.03931178  0.00799087 -0.00249125 -0.00519902 -0.01224021 -0.03285204]
 [ 0.02702895  0.02985645 -0.00750583  0.00866572 

Next, we need to add *positional* information to each token embedding. As we covered in the slides, the original paper used sinusoidals but it's more common these days to just use another set of embeddings. We'll do the latter here.<br>

Here, we're declaring an embedding layer with rows equalling a maximum sequence length and columns equalling our token embedding size. We then generate a vector of position ids.

In [110]:
max_seq_len = 256
pos_embed = tf.keras.layers.Embedding(max_seq_len, embed_dim)

# Generate ids for each position of the token sequence.
pos_idx = tf.range(len(token_seq))
print(pos_idx)

tf.Tensor([0 1 2 3 4 5 6 7 8 9], shape=(10,), dtype=int32)


We'll use these position ids to index into the positional embedding layer.

In [111]:
# These are our positon embeddings.
position_embeddings = pos_embed(pos_idx)
print("Position embeddings for the input sequence\n", position_embeddings)

Position embeddings for the input sequence
 tf.Tensor(
[[-0.03963637 -0.04204594  0.02489932  0.0385805   0.00693009  0.03903357
  -0.01129876  0.04942553 -0.00418737  0.02762145 -0.03317753  0.01927267]
 [-0.01136623  0.02414629  0.02556327 -0.04535156  0.00530325  0.03580023
   0.04164941 -0.01535302 -0.04659935 -0.03705139  0.01586578 -0.04142201]
 [ 0.01792503 -0.0346771  -0.04305955  0.02828251  0.00736285  0.04626873
  -0.04120122 -0.0100264   0.00483947  0.01425353 -0.0138227   0.04550341]
 [ 0.01445282 -0.01102936 -0.03085283 -0.0350817  -0.00785683  0.00040257
   0.0344961  -0.01176101  0.00196458  0.03947881 -0.04884348  0.03268308]
 [-0.01659729  0.02240426  0.01816011  0.03862705 -0.04475918  0.00696085
  -0.04524432  0.02315936  0.02162952 -0.0044598   0.04664299 -0.04124485]
 [ 0.03516365  0.02513752  0.03504122  0.0012566  -0.02159182  0.02019883
  -0.0114519  -0.02247277  0.00727206  0.02733849 -0.00463139  0.0395881 ]
 [ 0.02896013  0.01045686 -0.03674661  0.02118373  

The final step is to add our token and position embeddings. The result will be the input to the first encoder block.

In [112]:
input = token_embeddings + position_embeddings
print("Input to the initial encoder block:\n", input)

Input to the initial encoder block:
 tf.Tensor(
[[-0.05290964 -0.05222449  0.00423475  0.06515125 -0.01876358  0.04535434
  -0.05403328  0.00517188  0.01419696  0.029      -0.0324096   0.01720997]
 [-0.0195144   0.04359001  0.02528837 -0.06810401  0.00739528 -0.0131936
   0.00502172 -0.00889596 -0.0028395  -0.02177469  0.01722588 -0.01419805]
 [-0.02816296  0.00840855 -0.07803643  0.02706588 -0.04251225  0.046609
  -0.00649033  0.00668072  0.01268023  0.03550828  0.01516415  0.0859767 ]
 [ 0.0089378   0.02849095  0.01160374 -0.02862451 -0.03152292 -0.02733378
   0.01988539 -0.04083408 -0.00767828 -0.0011842  -0.02113475  0.00107164]
 [-0.04752091  0.0658861  -0.01661574  0.04624654 -0.00779767  0.0408104
  -0.09276906 -0.02101196 -0.01662344 -0.0337972   0.05129635 -0.07255481]
 [ 0.04775063  0.03189232  0.01185586  0.04793761  0.01377043  0.05213353
   0.02785988 -0.0144819   0.00478082  0.02213948 -0.0168716   0.00673606]
 [ 0.05598909  0.04031331 -0.04425244  0.02984945 -0.00516409 

## Encoder

Now that we have an encoder block and a way to embed our tokens with position information, we can create the **encoder** itself.<br>

Given a batch of vectorized sequences, the encoder creates positional embeddings, runs them through its encoder blocks, and returns contextualized tokens.

In [126]:
class Encoder(tf.keras.layers.Layer):
  def __init__(self, num_blocks, d_model, num_heads, hidden_dim, src_vocab_size,
               max_seq_len, dropout_rate=0.1):
    super(Encoder, self).__init__()

    self.d_model = d_model
    self.max_seq_len = max_seq_len

    self.token_embed = tf.keras.layers.Embedding(src_vocab_size, self.d_model)
    self.pos_embed = tf.keras.layers.Embedding(max_seq_len, self.d_model)

    # The original Attention Is All You Need paper applied dropout to the
    # input before feeding it to the first encoder block.
    self.dropout = tf.keras.layers.Dropout(dropout_rate)

    # Create encoder blocks.
    self.blocks = [EncoderBlock(self.d_model, num_heads, hidden_dim, dropout_rate)
    for _ in range(num_blocks)]

  def call(self, input, training=None, mask=None):
    token_embeds = self.token_embed(input)

    # Generate position indices for a batch of input sequences.
    num_pos = input.shape[0] * self.max_seq_len
    pos_idx = np.resize(np.arange(self.max_seq_len), num_pos)
    pos_idx = np.reshape(pos_idx, input.shape)
    pos_embeds = self.pos_embed(pos_idx)

    x = self.dropout(token_embeds + pos_embeds, training=training)

    # Run input through successive encoder blocks.
    for block in self.blocks:
      x, weights = block(x, training=training, mask=mask)

    return x, weights

If you're wondering about this code block here:


```
num_pos = input.shape[0] * self.max_seq_len
pos_idx = np.resize(np.arange(self.max_seq_len), num_pos)
pos_idx = np.reshape(pos_idx, input.shape)
pos_embeds = self.pos_embed(pos_idx)
```


This generates positional embeddings for a *batch* of input sequences. Suppose this was our batch of input sequences to the encoder.

In [127]:
# Batch of 3 sequences, each of length 10 (10 is also the
# maximum sequence length in this case).
seqs = np.random.randint(0, 10000, size=(3, 10))
print(seqs.shape)
print(seqs)

(3, 10)
[[8540  699 3882 8104 1461 4601  380 5310 3532 6410]
 [8637 6754 3395 9796 3691 4747 2846 7201 4847 5162]
 [9285 8486 7357 3925 4899 5414 3704 5505 9395 7846]]


We need to retrieve a positional embedding for every element in this batch. The first step is to create the respective positional ids...

In [128]:
pos_ids = np.resize(np.arange(seqs.shape[1]), seqs.shape[0] * seqs.shape[1])
print(pos_ids)

[0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9]


...and then reshape them to match the input batch dimensions.

In [129]:
pos_ids = np.reshape(pos_ids, (3, 10))
print(pos_ids.shape)
print(pos_ids)

(3, 10)
[[0 1 2 3 4 5 6 7 8 9]
 [0 1 2 3 4 5 6 7 8 9]
 [0 1 2 3 4 5 6 7 8 9]]


We can now retrieve position embeddings for every token embedding.

In [130]:
pos_embed(pos_ids)

<tf.Tensor: shape=(3, 10, 12), dtype=float32, numpy=
array([[[-0.03963637, -0.04204594,  0.02489932,  0.0385805 ,
          0.00693009,  0.03903357, -0.01129876,  0.04942553,
         -0.00418737,  0.02762145, -0.03317753,  0.01927267],
        [-0.01136623,  0.02414629,  0.02556327, -0.04535156,
          0.00530325,  0.03580023,  0.04164941, -0.01535302,
         -0.04659935, -0.03705139,  0.01586578, -0.04142201],
        [ 0.01792503, -0.0346771 , -0.04305955,  0.02828251,
          0.00736285,  0.04626873, -0.04120122, -0.0100264 ,
          0.00483947,  0.01425353, -0.0138227 ,  0.04550341],
        [ 0.01445282, -0.01102936, -0.03085283, -0.0350817 ,
         -0.00785683,  0.00040257,  0.0344961 , -0.01176101,
          0.00196458,  0.03947881, -0.04884348,  0.03268308],
        [-0.01659729,  0.02240426,  0.01816011,  0.03862705,
         -0.04475918,  0.00696085, -0.04524432,  0.02315936,
          0.02162952, -0.0044598 ,  0.04664299, -0.04124485],
        [ 0.03516365,  0.02

Let's try our encoder on a batch of sentences.

In [131]:
input_batch = [
    "Where can I find a pizzeria?",
    "Mass hysteria over listeria.",
    "I ain't no circle back girl."
]

bpemb_en.encode(input_batch)

[['▁where', '▁can', '▁i', '▁find', '▁a', '▁p', 'iz', 'zer', 'ia', '?'],
 ['▁mass', '▁hy', 'ster', 'ia', '▁over', '▁l', 'ister', 'ia', '.'],
 ['▁i', '▁a', 'in', "'", 't', '▁no', '▁circle', '▁back', '▁girl', '.']]

In [132]:
input_seqs = bpemb_en.encode_ids(input_batch)
print("Vectorized inputs:")
input_seqs

Vectorized inputs:


[[571, 280, 386, 1934, 4, 24, 248, 4339, 177, 9967],
 [1535, 1354, 1238, 177, 380, 43, 871, 177, 9935],
 [386, 4, 6, 9937, 9915, 467, 5410, 810, 3692, 9935]]

Note how the input sequences aren't the same length in this batch. In this case, we need to pad them out so that they are.

We'll do this using *pad_sequences*.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences

In [133]:
padded_input_seqs = tf.keras.preprocessing.sequence.pad_sequences(input_seqs, padding="post")
print("Input to the encoder:")
print(padded_input_seqs.shape)
print(padded_input_seqs)

Input to the encoder:
(3, 10)
[[ 571  280  386 1934    4   24  248 4339  177 9967]
 [1535 1354 1238  177  380   43  871  177 9935    0]
 [ 386    4    6 9937 9915  467 5410  810 3692 9935]]


Since our input now has padding, now's a good time to cover **masking**.
<br>

So given a mask, wherever there's a mask position set to 0, the corresponding position in the attention scores will be set to *-inf*. The resulting attention weight for the position will then be zero and no attending will occur for that position.
<br>

In the slides, we covered *look-ahead* masks for the decoder to prevent it from attending to future tokens, but we also need masks for padding.
<br>

In total, there are three masks involved:
1. The *encoder mask* to mask out any padding in the encoder sequences.

2. The *decoder mask* which is used in the decoder's **first** multi-head self-attention layer. It's a <u>combination of two masks</u>: one to account for the padding in target sequences, and the look-ahead mask.

3. The *memory mask* which is used in the decoder's **second** multi-head self-attention layer. The keys and values for this layer are going to be the encoder's output, and this mask will ensure the decoder doesn't attend to any encoder output which corresponds to padding. In practice, 1 and 3 are often the same.

The *scaled_dot_product_attention* function has this line:
```
  if mask is not None:
    scaled_scores = tf.where(mask==0, -np.inf, scaled_scores)
```

Let's create an encoder mask for our batch of input sequences.<br>

Wherever there's padding, we want the mask position set to zero.

In [134]:
enc_mask = tf.cast(tf.math.not_equal(padded_input_seqs, 0), tf.float32)
print("Input:")
print(padded_input_seqs, '\n')
print("Encoder mask:")
print(enc_mask)

Input:
[[ 571  280  386 1934    4   24  248 4339  177 9967]
 [1535 1354 1238  177  380   43  871  177 9935    0]
 [ 386    4    6 9937 9915  467 5410  810 3692 9935]] 

Encoder mask:
tf.Tensor(
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]], shape=(3, 10), dtype=float32)


Keep in mind that the dimension of the attention matrix (for this example) is going to be:<br>
*(batch size, number of heads, query size, key size)*<br>
(3, 3, 10, 10)

So we need to expand the mask dimensions like so:

In [135]:
enc_mask = enc_mask[:, tf.newaxis, tf.newaxis, :]
enc_mask

<tf.Tensor: shape=(3, 1, 1, 10), dtype=float32, numpy=
array([[[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]],


       [[[1., 1., 1., 1., 1., 1., 1., 1., 1., 0.]]],


       [[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]]], dtype=float32)>

This way, the encoder mask will now be *broadcasted*.<br>
https://www.tensorflow.org/xla/broadcasting

Now we can declare an encoder and pass it batches of vectorized sequences.

In [136]:
num_encoder_blocks = 6

# d_model is the embedding dimension used throughout.
d_model = 12

num_heads = 3

# Feed-forward network hidden dimension width.
ffn_hidden_dim = 48

src_vocab_size = bpemb_vocab_size
max_input_seq_len = padded_input_seqs.shape[1]

encoder = Encoder(
    num_encoder_blocks,
    d_model,
    num_heads,
    ffn_hidden_dim,
    src_vocab_size,
    max_input_seq_len)

We can now pass our input sequences and mask to the encoder.

In [137]:
encoder_output, attn_weights = encoder(padded_input_seqs, training=True, mask=enc_mask)
print(f"Encoder output {encoder_output.shape}:")
print(encoder_output)

Encoder output (3, 10, 12):
tf.Tensor(
[[[-0.33231628 -0.22362208 -2.2380664   0.05607291  0.73056525
   -0.751446   -0.3026732   1.8942666   0.8176175  -0.6552055
    0.97250575  0.0323018 ]
  [ 0.08985625 -0.16067123 -2.0403652  -0.2906918   0.05954127
   -0.6293086  -0.3847365   2.3005133   0.76673275 -0.17724845
    0.9898023  -0.52342427]
  [ 0.49515298 -0.20049411 -2.418865    1.166448   -0.16037431
   -1.1636302   0.24331978  0.730527    0.9133192   0.4839618
    0.8075338  -0.89689887]
  [ 0.1521135  -0.46373174 -2.0667353   1.3020321  -0.04134262
   -0.25867832 -1.1413778   1.7192883   0.76352024 -0.19584723
    0.7536985  -0.5229398 ]
  [ 0.4872014  -0.24148712 -1.9313676   0.9765346  -0.3503809
   -0.92744315 -0.64997387  2.0172946   1.0518029  -0.1584914
    0.29461813 -0.5683076 ]
  [ 0.07985155 -0.38760275 -2.1478546   1.231245   -0.0412124
   -0.5678461  -0.4475841   2.0576181   0.4447183  -0.20920147
    0.5893758  -0.6015073 ]
  [-0.07914703 -0.35984483 -1.9841175   1.

## Decoder Block

Let's build the **Decoder Block**. Everything we did to create the **encoder** block applies here. The major differences are that the **Decoder Block** has:
1. a **Multi-Head Cross-Attention** layer which uses the encoder's outputs as the keys and values.

2. an extra skip/residual connection along with an extra layer normalization step.

<div>
<img src="https://github.com/cheroualiyakoub/Attention-is-all-you-need/blob/main/1.%20Attention%20Is%20All%20You%20Need%20Paper/pic/decoder_block.png?raw=1" width="500"/>
</div>

In [138]:
class DecoderBlock(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, hidden_dim, dropout_rate=0.1):
    super(DecoderBlock, self).__init__()

    self.mhsa1 = MultiHeadSelfAttention(d_model, num_heads)
    self.mhsa2 = MultiHeadSelfAttention(d_model, num_heads)

    self.ffn = feed_forward_network(d_model, hidden_dim)

    self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
    self.dropout2 = tf.keras.layers.Dropout(dropout_rate)
    self.dropout3 = tf.keras.layers.Dropout(dropout_rate)

    self.layernorm1 = tf.keras.layers.LayerNormalization()
    self.layernorm2 = tf.keras.layers.LayerNormalization()
    self.layernorm3 = tf.keras.layers.LayerNormalization()

  # Note the decoder block takes two masks. One for the first MHSA, another
  # for the second MHSA.
  def call(self, encoder_output, target, training, decoder_mask, memory_mask):
    mhsa_output1, attn_weights = self.mhsa1(target, target, target, decoder_mask)
    mhsa_output1 = self.dropout1(mhsa_output1, training=training)
    mhsa_output1 = self.layernorm1(mhsa_output1 + target)

    mhsa_output2, attn_weights = self.mhsa2(mhsa_output1, encoder_output,
                                            encoder_output,
                                            memory_mask)
    mhsa_output2 = self.dropout2(mhsa_output2, training=training)
    mhsa_output2 = self.layernorm2(mhsa_output2 + mhsa_output1)

    ffn_output = self.ffn(mhsa_output2)
    ffn_output = self.dropout3(ffn_output, training=training)
    output = self.layernorm3(ffn_output + mhsa_output2)

    return output, attn_weights


## Decoder

The decoder is almost the same as the encoder except it takes the encoder's output as part of its input, and it takes two masks: the decoder mask and memory mask.

In [150]:
class Decoder(tf.keras.layers.Layer):
  def __init__(self, num_blocks, d_model, num_heads, hidden_dim, target_vocab_size,
               max_seq_len, dropout_rate=0.1):
    super(Decoder, self).__init__()

    self.d_model = d_model
    self.max_seq_len = max_seq_len

    self.token_embed = tf.keras.layers.Embedding(target_vocab_size, self.d_model)
    self.pos_embed = tf.keras.layers.Embedding(max_seq_len, self.d_model)

    self.dropout = tf.keras.layers.Dropout(dropout_rate)

    self.blocks = [DecoderBlock(self.d_model, num_heads, hidden_dim, dropout_rate) for _ in range(num_blocks)]

  def call(self, encoder_output, target, training, decoder_mask, memory_mask):
    token_embeds = self.token_embed(target)

    # Generate position indices.
    num_pos = target.shape[0] * self.max_seq_len
    pos_idx = np.resize(np.arange(self.max_seq_len), num_pos)
    pos_idx = np.reshape(pos_idx, target.shape)

    pos_embeds = self.pos_embed(pos_idx)

    x = self.dropout(token_embeds + pos_embeds, training=training)

    for block in self.blocks:
      x, weights = block(encoder_output=encoder_output, target=x, training=training, decoder_mask=decoder_mask, memory_mask=memory_mask)

    return x, weights

Before we try the decoder, let's cover the masks involved. The decoder takes two masks:

The *decoder mask* which is a <u>combination of two masks</u>: one to account for the padding in target sequences, and the look-ahead mask. This mask is used in the decoder's **first** multi-head self-attention layer.

The *memory mask* which is used in the decoder's **second** multi-head self-attention. The keys and values for this layer are going to be the encoder's output, and this mask will ensure the decoder doesn't attend to any encoder output which corresponds to padding.

Suppose this is our batch of vectorized target *input* sequences for the decoder. These values are just made up.<br>

**Note**: If you need a refresher on how to prepare target input and output sequences for the decoder, refer to the [seq2seq notebook](https://colab.research.google.com/github/nitinpunjabi/nlp-demystified/blob/main/notebooks/nlpdemystified_seq2seq_and_attention.ipynb).



In [151]:
# Made up values.
target_input_seqs = [
    [1, 652, 723, 123, 62],
    [1, 25,  98, 129, 248, 215, 359, 249],
    [1, 2369, 1259, 125, 486],
]

As we did with the encoder input sequences, we need to pad out this batch so that all sequences within it are the same length.

In [152]:
padded_target_input_seqs = tf.keras.preprocessing.sequence.pad_sequences(target_input_seqs, padding="post")
print("Padded target inputs to the decoder:")
print(padded_target_input_seqs.shape)
print(padded_target_input_seqs)

Padded target inputs to the decoder:
(3, 8)
[[   1  652  723  123   62    0    0    0]
 [   1   25   98  129  248  215  359  249]
 [   1 2369 1259  125  486    0    0    0]]


We can create the padding mask the same way we did for the encoder.

In [153]:
dec_padding_mask = tf.cast(tf.math.not_equal(padded_target_input_seqs, 0), tf.float32)
dec_padding_mask = dec_padding_mask[:, tf.newaxis, tf.newaxis, :]
print(dec_padding_mask)

tf.Tensor(
[[[[1. 1. 1. 1. 1. 0. 0. 0.]]]


 [[[1. 1. 1. 1. 1. 1. 1. 1.]]]


 [[[1. 1. 1. 1. 1. 0. 0. 0.]]]], shape=(3, 1, 1, 8), dtype=float32)


As we covered in the slides, the look-ahead mask is a diagonal where the lower half are 1s and the upper half are zeros. This is easy to create using the *band_part* method:<br>
https://www.tensorflow.org/api_docs/python/tf/linalg/band_part

In [154]:
target_input_seq_len = padded_target_input_seqs.shape[1]
look_ahead_mask = tf.linalg.band_part(tf.ones((target_input_seq_len,
                                               target_input_seq_len)), -1, 0)
print(look_ahead_mask)

tf.Tensor(
[[1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0.]
 [1. 1. 1. 1. 0. 0. 0. 0.]
 [1. 1. 1. 1. 1. 0. 0. 0.]
 [1. 1. 1. 1. 1. 1. 0. 0.]
 [1. 1. 1. 1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 1. 1. 1. 1.]], shape=(8, 8), dtype=float32)


To create the decoder mask, we just need to combine the padding and look-ahead masks. Note how the columns of the resulting decoder mask are all zero for padding positions.

In [155]:
dec_mask = tf.minimum(dec_padding_mask, look_ahead_mask)
print("The decoder mask:")
print(dec_mask)

The decoder mask:
tf.Tensor(
[[[[1. 0. 0. 0. 0. 0. 0. 0.]
   [1. 1. 0. 0. 0. 0. 0. 0.]
   [1. 1. 1. 0. 0. 0. 0. 0.]
   [1. 1. 1. 1. 0. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]]]


 [[[1. 0. 0. 0. 0. 0. 0. 0.]
   [1. 1. 0. 0. 0. 0. 0. 0.]
   [1. 1. 1. 0. 0. 0. 0. 0.]
   [1. 1. 1. 1. 0. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 1. 0. 0.]
   [1. 1. 1. 1. 1. 1. 1. 0.]
   [1. 1. 1. 1. 1. 1. 1. 1.]]]


 [[[1. 0. 0. 0. 0. 0. 0. 0.]
   [1. 1. 0. 0. 0. 0. 0. 0.]
   [1. 1. 1. 0. 0. 0. 0. 0.]
   [1. 1. 1. 1. 0. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]]]], shape=(3, 1, 8, 8), dtype=float32)


We can now declare a decoder and pass it everything it needs. In our case, the *memory* mask is the same as the *encoder* mask.

In [156]:
decoder = Decoder(6, 12, 3, 48, 10000, 8)
decoder_output, _ = decoder(encoder_output=encoder_output, target=padded_target_input_seqs,
                            training=True, decoder_mask=dec_mask, memory_mask=enc_mask)
print(f"Decoder output {decoder_output.shape}:")
print(decoder_output)

Decoder output (3, 8, 12):
tf.Tensor(
[[[-1.392404    0.80751544  0.5884524   0.11931834  1.9014071
    0.2405626  -1.8042227   0.3398928  -0.44588256 -1.1996797
    0.4185815   0.42645895]
  [-0.8263886   0.56320167  1.7330054   1.1195354   1.1469616
   -0.07810245 -1.8630401   0.13881044  0.11155178 -1.0997264
   -0.8271812  -0.11862742]
  [-1.3095174   0.2652379   1.4245481   0.9675348   1.7069954
   -0.02487501 -1.6928331   0.20464136 -0.2344392  -1.0730258
   -0.42009917  0.18583205]
  [-2.0250492   1.0710709   1.1227536   1.2479886   1.3217863
   -0.33384183 -0.8424877   0.26743492 -0.14789423 -0.99569184
   -0.47291356 -0.21315598]
  [-1.4375656   0.74480414  1.0914987   0.79071176  1.800177
    0.2022104  -1.9317381   0.0810984  -0.12643388 -0.36471936
   -0.52497643 -0.32506704]
  [-0.7676151   1.085856    1.8423499  -0.69002265  1.5989496
   -0.21772891 -1.5515097   0.25026444  0.00386874 -0.9674823
   -0.58172995 -0.00519993]
  [-0.72171456  0.54279494  1.9903524   1.0262178

## Transformer

We now have all the pieces to build the **Transformer** itself, and it's pretty simple.

In [166]:
class Transformer(tf.keras.Model):
  def __init__(self, num_blocks, d_model, num_heads, hidden_dim, source_vocab_size,
               target_vocab_size, max_input_len, max_target_len, dropout_rate=0.1):
    super(Transformer, self).__init__()

    self.encoder = Encoder(num_blocks, d_model, num_heads, hidden_dim, source_vocab_size,
                           max_input_len, dropout_rate)

    self.decoder = Decoder(num_blocks, d_model, num_heads, hidden_dim, target_vocab_size,
                           max_target_len, dropout_rate)

    # The final dense layer to generate logits from the decoder output.
    self.output_layer = tf.keras.layers.Dense(target_vocab_size)

  def call(self, input_seqs, target_input_seqs, training, encoder_mask,
           decoder_mask, memory_mask):
    encoder_output, encoder_attn_weights = self.encoder(input=input_seqs,
                                                        training=training, mask=encoder_mask)

    decoder_output, decoder_attn_weights = self.decoder(encoder_output=encoder_output,
                                                        target=target_input_seqs,
                                                        training=training,
                                                        decoder_mask=decoder_mask,
                                                        memory_mask=memory_mask)

    return self.output_layer(decoder_output), encoder_attn_weights, decoder_attn_weights


In [167]:
transformer = Transformer(
    num_blocks = 6,
    d_model = 12,
    num_heads = 3,
    hidden_dim = 48,
    source_vocab_size = bpemb_vocab_size,
    target_vocab_size = 7000, # made-up target vocab size.
    max_input_len = padded_input_seqs.shape[1],
    max_target_len = padded_target_input_seqs.shape[1])

transformer_output, _, _ = transformer(input_seqs=padded_input_seqs,
                                       target_input_seqs=
                                       padded_target_input_seqs,
                                       training=True,
                                       encoder_mask=enc_mask,
                                       decoder_mask=dec_mask,
                                       memory_mask=enc_mask)
print(f"Transformer output {transformer_output.shape}:")
print(transformer_output) # If training, we would use this output to calculate losses.

Transformer output (3, 8, 7000):
tf.Tensor(
[[[ 0.00731332  0.04587527  0.04748233 ...  0.07824551 -0.07507154
    0.09597214]
  [-0.0104115   0.01235084  0.03681688 ...  0.03563239 -0.05328212
    0.07149391]
  [ 0.00688984  0.01921601  0.05249054 ...  0.05040047 -0.0290722
    0.13523383]
  ...
  [-0.04869027  0.05809747  0.01587982 ...  0.07286657 -0.02220155
    0.0379923 ]
  [-0.02782372  0.0725248   0.05032315 ...  0.12392734 -0.056352
    0.10691591]
  [-0.0214413   0.04324479  0.05715049 ...  0.09436374 -0.0403249
    0.11694759]]

 [[-0.01883995  0.03135853  0.04840583 ...  0.04678973 -0.01536078
    0.102964  ]
  [-0.03395027  0.03969305  0.02485096 ...  0.05448001  0.00550492
    0.1127027 ]
  [-0.04619592  0.04743066  0.02719874 ...  0.07895622  0.00376518
    0.13221996]
  ...
  [-0.04368157  0.0486478   0.00734998 ...  0.08028561  0.01617477
    0.12260336]
  [-0.02967573  0.05021084  0.02383391 ...  0.06259517  0.03073626
    0.12355409]
  [-0.01077404  0.04275585  0.051

That's the whole original transformer from scratch. learning rate warmup (Refer to the paper for more information on this).

It's useful to know how these models work under the hood, but to train our own transformer to get impressive results is expensive. Both in terms of compute and data.<br>

Fortunately, there's a zoo of **pretrained** transformer models we can use. We'll explore that later.