# Understanding the Transformer Architecture: A Step-by-Step Breakdown

## Project Overview

This Jupyter Notebook aims to demystify the complex Transformer architecture by breaking it down into understandable components. We will cover the overall structure, focusing on the encoder stack and the crucial role of the self-attention mechanism in generating contextual word embeddings and enabling parallel processing. This is a foundational step towards understanding how Transformers achieve state-of-the-art performance in various NLP tasks.

## Table of Contents

1.  **Transformer as a Block Diagram: High-Level Overview**
2.  **Encoder-Decoder Architecture of Transformers**
    * Stacking Encoders and Decoders
    * Reference: "Attention Is All You Need" Paper
3.  **Inside a Single Encoder Block**
    * Self-Attention Layer
    * Feed-Forward Neural Network Layer
4.  **Inside a Single Decoder Block (Brief Mention)**
5.  **The Role of Self-Attention: Fixed vs. Contextual Embeddings**
    * The Problem with Fixed Embeddings
    * How Self-Attention Creates Contextual Embeddings
6.  **Key Advantages Revisited: Parallelization and Scalability**
7.  **Upcoming: Deep Dive into Self-Attention Mechanism**

---

## 1. Transformer as a Block Diagram: High-Level Overview

At its highest level, a Transformer can be conceptualized as a single block designed to solve **sequence-to-sequence tasks**.

* **Input**: Typically an English sentence (or any source sequence).
* **Output**: A French sentence (or any target sequence), representing the translation or transformation of the input.

This block diagram represents the end-to-end functionality, but the real magic lies within its internal structure.

## 2. Encoder-Decoder Architecture of Transformers

The Transformer architecture, similar to earlier sequential models, fundamentally follows an **Encoder-Decoder architecture**.

* **Encoder**: Responsible for processing the input sequence and transforming it into a rich representation.
* **Decoder**: Responsible for taking the encoder's output and generating the target sequence.

### Stacking Encoders and Decoders

A key characteristic of the Transformer is that it doesn't just use one encoder and one decoder. Instead, it uses **multiple stacked encoders** and **multiple stacked decoders**.

* In the original "Attention Is All You Need" research paper (which we'll reference), the Transformer uses **six encoders** stacked on top of each other, and **six decoders** similarly stacked.
* Information flows from the bottom encoder to the top, and then the final encoder's output is passed to all decoders.
* Within the decoder stack, information flows from bottom to top as well, generating the output sequence.

### Reference: "Attention Is All You Need" Paper

The foundational paper for Transformers is titled ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762) -> [Article](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) This paper introduces:
* Positional Encoding
* Self-Attention
* Multi-Head Attention
* Feed-Forward Networks
* The overall Transformer architecture that we are dissecting.

## diagram

![alt](images/high-level-architecture.png)

---

## 3. Inside a Single Encoder Block

Each of the `N` (e.g., 6) encoder blocks in the Transformer stack has two primary sub-layers:

1.  **Self-Attention Layer**:
    * This is the most critical and novel component.
    * It allows the encoder to weigh the importance of different words in the input sequence relative to each other when processing a specific word.
    * Unlike traditional RNNs which process words sequentially, the self-attention layer processes words in a way that allows them to "look at" all other words simultaneously.
    * We will delve into the exact working of self-attention in the next sections.

2.  **Feed-Forward Neural Network Layer**:
    * This is a simple, position-wise fully connected feed-forward network.
    * It's applied independently to each position (each word's representation) in the sequence.
    * It consists of two linear transformations with a ReLU activation in between.
    * Its purpose is to further process the representations produced by the self-attention layer.

**Data Flow within an Encoder:**
1.  Input words are first converted into **vector embeddings** (using an Embedding Layer).
2.  These vectors are then passed to the **Self-Attention Layer**.
3.  The output of the Self-Attention Layer is passed to the **Feed-Forward Neural Network Layer**.
4.  The output of the Feed-Forward Neural Network is then passed as input to the *next* encoder in the stack (if any).

## diagram

![alt](images/encoder-high-level.png)

---

## 4. Inside a Single Decoder Block (Brief Mention)

The decoder blocks are similar to encoders but have an additional layer:

* **Masked Self-Attention Layer**: Similar to the encoder's self-attention, but it's "masked" to prevent attending to future (yet to be generated) words in the output sequence.
* **Encoder-Decoder Attention Layer**: This unique layer allows the decoder to "pay attention" to the output of the *encoder stack*. This is where the input sequence's representation (from the top encoder) is integrated into the output generation process.
* **Feed-Forward Neural Network Layer**: Same as in the encoder.

We will focus more on the decoder's internal workings in later discussions.

---

## 5. The Role of Self-Attention: Fixed vs. Contextual Embeddings

One of the most profound contributions of the self-attention mechanism is its ability to generate **contextual embeddings**. To understand this, let's first review the problem with traditional "fixed" word embeddings.

### The Problem with Fixed Embeddings

* Traditional word embedding techniques (like Word2Vec, GloVe) assign a **single, fixed vector representation** to each word in a vocabulary.
* **Example**: For the sentence "My name is Krish and I play cricket."
    * The word "cricket" would always have the same vector, regardless of its usage. This fails to capture that "cricket" can refer to a sport or an insect.
    * Similarly, "I" would have a fixed vector, regardless of who "I" refers to or what "I" is doing.

This fixed nature means that the embeddings lack **contextual awareness**. The meaning of a word often depends heavily on the words around it.

### How Self-Attention Creates Contextual Embeddings

The self-attention mechanism within the Transformer's encoder (and decoder) directly addresses this limitation:

1.  **Input Vector**: When a word (e.g., "how") enters the self-attention layer, it initially has a fixed vector representation (let's call it $V_{how}$).
2.  **Interaction with Other Words**: The self-attention mechanism allows this word's vector to interact with the vectors of *all other words* in the same input sequence ("are", "you").
3.  **Calculating Relationships**: It calculates "attention scores" that indicate how much each word in the sequence should "attend to" or influence the representation of the current word.
4.  **Creating a New Vector (Contextual Vector)**: Based on these attention scores, a *new vector* is computed for the word. Let's call this $Z_{how}$.
    * This new vector $Z_{how}$ is not fixed. It is a **contextual vector** because its value is derived by considering the relationships and context provided by "are" and "you" (and all other words in the sentence).
    * For example, the contextual vector for "bank" in "river bank" will be different from its contextual vector in "bank account" because of how it interacts with "river" vs. "account" through self-attention.

**Why Contextual Vectors are Crucial**:
* They capture the nuanced meaning of a word within its specific sentence.
* This rich, context-aware representation is essential for understanding complex language and is a key reason why Transformers excel with long and intricate sentences.

---

## 6. Key Advantages Revisited: Parallelization and Scalability

The self-attention mechanism is fundamental to achieving two major advantages of Transformers over traditional RNNs:

1.  **Parallel Processing**:
    * Since self-attention computes relationships between all words simultaneously (rather than sequentially), **all words in the input sentence can be passed to the self-attention layer *in parallel***.
    * This is a drastic shift from RNNs, where words must be processed one at a time.
    * **Impact**: Enables significantly faster training times, especially for very large datasets and models.

2.  **Scalability**:
    * The ability to parallelize computations makes the Transformer architecture inherently **scalable** for training on massive datasets.
    * This has been a critical factor in the development of today's largest and most powerful NLP models (e.g., BERT, GPT-3, LLaMA).
    * When the data size is huge, Transformers can leverage distributed computing to train much more efficiently than RNN-based models.

---

## 7. Upcoming: Deep Dive into Self-Attention Mechanism

This video has laid the groundwork by introducing the overall Transformer architecture and highlighting the critical role of self-attention and contextual embeddings.

In the **next video**, we will delve much deeper into the **exact mathematical workings of the self-attention module**. We will understand:
* How the fixed input vectors are transformed into contextual vectors.
* The concepts of Query, Key, and Value vectors.
* The scaling factor and softmax operation within self-attention.

Understanding these details is crucial to fully grasp how Transformers process information and achieve their remarkable capabilities.