# The Transformer Architecture

This notebook was made in the attempt of understanding what is a Transformer (not talking about the movie series here).

---

## References

- Alammar, J. (2018). The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/
- Collis, J. (2017, April 21). Glossary of Deep Learning: Word Embedding. Deeper Learning. https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca
- Unzueta, D. (2022, October 18). Fully Connected Layer vs Convolutional Layer: Explained | Built In. Builtin.com. https://builtin.com/machine-learning/fully-connected-layer
- (2021). E2eml.school. https://e2eml.school/transformers.html




## Further reading:

Some of those are here because they came before The Tranformer and are useful to understand what kind of problems it solves and some are here to dive deeper in the topic.

> Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27.

> Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

## Remarks

> Word Embedding aims to create a vector representation with a much lower dimensional space. These are called Word Vectors. Word Vectors are used for semantic parsing, to extract meaning from text to enable natural language understanding. For a language model to be able to predict the meaning of text, it needs to be aware of the contextual similarity of words. **_So an answer to “What is word embedding?” is: it’s a means of building a low-dimensional vector representation from corpus of text, which preserves the contextual similarity of words._**


> Notice how matrix multiplication acts as a lookup table here. Our A matrix is made up of a stack of one-hot vectors. They have ones in the first column, the fourth column, and the third column, respectively. When we work through the matrix multiplication, this serves to pull out the first row, the fourth row, and the third row of the B matrix, in that order. This trick of using a one-hot vector to pull out a particular row of a matrix is at the core of how transformers work.


## Concepts

- How fully connected layers work?
  - They receive a tensor and perform matrix multiplication to reduce the tensor's dimension. Just remeber that the number of units in a tensor can be rearranged so that $W_{\text{units} \times N} \times T_{N \times M}$
- What is the MLP? How it works? (Multi-layer Perceptron)
- What is tokenazation? How it is applied in Trabsformers?
- What is word embedding?
- What is an Encoder?
- What is the Decoder?
- What is Self-Attention?
- What is beam search and how does it parallelize?

```mermaid
flowchart BT
    IEB([Input Embedding])
    OEB([Output Embedding])
    PE1([Positional Encoding +])
    PE2([Positional Encoding +])

    IEB --> PE1
    PE1 --> D1 & C1

    OEB --> PE2
    PE2 --> E & F

    subgraph AttentionHead1[" "]
        direction BT
        A1([Add & Norm])
        B1([Feed Forward])
        C1([Add & Norm])
        D1([Multi-Head Attention])
    
        D1 --> C1 
        C1 --> B1 & A1
        B1 --> A1
    end

    A1 --> D2

    subgraph AttentionHead2[" "]
        direction BT
        A2([Add & Norm])
        B2([Feed Forward])
        C2([Add & Norm])
        D2([Multi-Head Attention])
        F([Add & Norm])
        E([Masked Multi-Head Attention])
        
        E --> F
        F --> D2 & C2
        D2 --> C2 
        C2 --> B2 & A2
        B2 --> A2
    end
```