# MultiHead Attention

By far the most important, and most significant component of the transformer is the Multihead Attention mechanism.
After all, the paper is titled, "Attention is all you need". From now on, Multihead Attention will be shorthanded to MHA from not on for convenience. 

Therefore, before diving deep into the rest of the transformer, I think it is worthwhile to try implementing 
this module from scratch, which in turn will help us understand ths subtle parts about MHA.

There does exist official PyTorch implementation, which is used for their official Transformer block, but I am a strong believer of
trying something from scratch to appreciate the details when using a de-facto implementation. 



# Attention
MHA is simply multiple Attention blocks stacked together, therefore in order to truly understand MHA, we need to do a deep dive on actual attention.
From the paper, Attention is defined with this mathematical equation:
$\text{Attention}(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$
with the following diagram to support the equation
![Attention Diagram](images/attention.jpeg)

# Q, K, and V
Q, K, and V stand for Query, Key, and Value respectively. To cite this awesome [Stack Overfow Answer](https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms)
> The key/value/query concepts come from retrieval systems. For example, when you type a query to search for some video on Youtube, the search engine will map your query against a set of keys (video title, description etc.) associated with candidate videos in the database, then present you the best matched videos (values).

One thing that really confused me personally when I was studying Attention is how they seemed to use the terms Q,K,V interchangeably even though they had different implications.
The following is what I mean:
![Attention Diagram 2](images/confusion.png "Attention and MHA")
The input to both Scaled-Dot-Product Attention and Multi-Head Attention are Q,K,V, but Multi-Head Attention has Scaled-Dot-Product Attention in them, and it seems to take in Q,K,V that is projected with Linear Layer.
Later on, they show the following equation for MHA that looks like the following:
![MHA Equation](images/mha_equation.png "MHA Equation")

Comparing it to the previos Attention equation which takes in Q,K,and V only, the Attention equation from MHA takes in 
$QW_{i}^{Q}$, $KW_{i}^{K}$, and $VW_{i}^{V}$. So which one is it...?


Maybe this was just me that was confused, but if you are like me who was also confused by this during the first read through of the paper, hopefully the following description may help.


In [1]:
import torch
from torch import nn
from torch.nn import Module, ModuleList

