Transformer from Scratch

Name: Aditya Singh | LinkedIn | GitHub

Transformer from Scratch

Transformer(transform/translate) helps in understanding real world meaning of the data by transforming the relations(important/priority-wise) w.r.t actual fundamental meaning.
Note : Transformer acts as the model inside main model.

Components of Transformer

Encoder Layer
- Input Embeddings
- Positional Encoding
- Self-Attention Layer
- Feed Forward Network
Decoder Layer
- Masked Self-Attention
- Encoder-Decoder Attention
- Feed-Forward Network

Attention: It finds, measure, evaluate the relation of one word or any data with other data present in the batch/sequence/matrix. There are 2 types of Attentions:

Self Attention: Evaluate connection/relation of each word/data.
Multi Head Attention: Multiple Self Attentions in parallel, to evaluate complex relations between entities(data). Head refers to the attention mechanism.

Architecture of Transformers

Encoder Layer

Input (Tokens/Embedding are the raw input to the Transformer.) Token's embedding transformed/split into 3 vectors: Q, K, V. Each token has its own Q, K, V.
Positional Encoding (Metadata of data order with sine/cosine functions)
Self Attention Layer (Relation capturing & evaluation or in technical terms: establishing weights)
- $Attention(Q,K,V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V$
  Softmax(Sum to 1) is used to normalize the data(scores) & to convert scores to probability distribution.
  - Q, K, V: Query, Key, Value
  - ${QK^T}$: Compute Scores/Interaction with others.
    Represents the strength of attention.
  - V: Value vectors, Dynamic dictionary.
    It contains content/dictionary/data which is linked with Q & K through multiplication helping in translating the relation in final understanding/output(placement of word). V helps in understanding the context of the language.
- $\sqrt{d_k}$: Fixes large dot products causing gradient issues.
  ${d_k}$ refers to the dimension of keys.
  "d" is the dimensionality.
Feed-Forward Network (Simple Neural Network)

Decoder Layer

Masked-Self Attention (Hiding future tokens during training)
Encoder-Decoder Attention (Q from Decoder, K & V from Encoder. Encoder output used to focus on relevant input parts)

How Q, K, V Works in Transformers:

Input Tokens → Projected into Keys (K) and Values (V).
- Think: Each token writes its data (V) and a label (K) into a dictionary.
Queries (Q) "search" this dictionary by comparing Q to all Ks.
The best-matching Ks (highest attention scores) retrieve their corresponding Vs.
The output is a weighted blend of the retrieved Vs.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README.md		README.md
Transformer from Scratch.ipynb		Transformer from Scratch.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Transformer from Scratch

Components of Transformer

Architecture of Transformers

Encoder Layer

Decoder Layer

How Q, K, V Works in Transformers:

About

Uh oh!

Releases

Packages

Languages

adityasinghcoding/Transformer-From-Scratch

Folders and files

Latest commit

History

Repository files navigation

Transformer from Scratch

Components of Transformer

Architecture of Transformers

Encoder Layer

Decoder Layer

How Q, K, V Works in Transformers:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages