Attention is All You Need

The implementation of transformer as presented in the paper "Attention is all you need" from scratch.

Excellent Illustration of Transformers: Illustrated Guide to Transformers Neural Network: A step by step explanation

Keys, Queries and Values in Attention Mechanism: What exactly are keys, queries, and values in attention mechanisms?

Positional Encoding: Transformer Architecture: The Positional Encoding

Data Flow, Parameters, and Dimensions in Transformer: Into The Transformer, Transformers: report on Attention Is All You Need

The Transformer architecture is a popular type of neural network used in natural language processing (NLP) tasks, such as machine translation and text classification. It was first introduced in a paper by Vaswani et al. in 2017.

At a high level, the Transformer model consists of an encoder and a decoder, both of which contain a series of identical layers. Each layer has two sub-layers: a self-attention layer and a feedforward layer. The self-attention layer allows the model to attend to different parts of the input sequence, while the feedforward layer applies a non-linear transformation to the output of the self-attention layer.

Now, let's break down the math behind the self-attention layer. Suppose we have an input sequence of length $N$, represented as a matrix $X$, where each row corresponds to a word embedding. We want to compute a new sequence of vectors $Z$, where each vector is a weighted sum of all the input vectors:

$$ Z = XW $$

However, we want to compute the weights dynamically, based on the similarity between each pair of input vectors. This is where self-attention comes in. We first compute a "query" vector $Q$, a "key" matrix $K$, and a "value" matrix $V$:

$$ Q = XW_q \\ K = XW_k \\ V = XW_v $$

where $W_q$, $W_k$, and $W_v$ are learned weight matrices. Then, we compute the attention weights as a softmax function of the dot product between $Q$ and $K$:

$$ \text{Attention}(Q, K, V) = \text{softmax}(\frac{Q K^{T}}{\sqrt{d_k}})V $$

where $d_k$ is the dimensionality of the key vectors. The softmax function ensures that the attention weights sum to $1$, and the scaling factor of $\frac{1}{\sqrt{d_k}}$ helps stabilize the gradients during training.

Finally, we compute the output of the self-attention layer as a weighted sum of the value vectors:

$$ Z = \text{Attention}(Q, K, V) W_o $$

where $W_o$ is another learned weight matrix. The output of the self-attention layer is then passed through a feedforward layer with a ReLU activation function, and the process is repeated for each layer in the encoder and decoder.

Overall, the Transformer architecture is a powerful tool for NLP tasks, and its self-attention mechanism allows it to model long-range dependencies in the input sequence.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
assets		assets
tensorflow		tensorflow
torch		torch
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Attention is All You Need

About

Releases

Packages

Languages

License

bkhanal-11/transformers

Folders and files

Latest commit

History

Repository files navigation

Attention is All You Need

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages