## Transformers in Deep Learning

Transformers are a type of **neural network architecture** introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al, that excels at processing sequential data by using a mechanism called `attention` to weigh the importance of different input elements, enabling powerful parallel processing and capturing long-range dependencies in data.

They revolutionized **Natural Language Processing (NLP)** by allowing models to process entire sequences (like sentences or paragraphs) `in parallel`, rather than step-by-step like RNNs or LSTMs.

This architecture forms the foundation of many modern AI advancements, including `large language models (LLMs)` like **GPT**, and has expanded beyond text to applications in image processing, audio generation, and even biology. 

`Key Idea: Attention Mechanism`

Instead of processing words in order, Transformers use `self-attention` to figure out which words are important to each other, no matter their position in a sentence.

Example:
    In the sentence "The cat sat on the mat because it was tired",
    the word `it` should focus on `cat` to understand meaning.

Self-attention calculates relationships between all words at once, capturing context efficiently.

### Transformer Architecture
A transformer is built from `Encoder` and `Decoder` blocks.

#### 1. Encoder
- Takes the input sequence (e.g., a sentence in English).
- Uses:
    - **Multi-Head Self-Attention:** Learns relationships between words.
    - **Feed-Forward Network:** Processes the attention output.
    - **Positional Encoding:** Adds word order information (since attention doesn’t care about position).
- The encoder outputs a context-rich representation of the input.

#### 2. Decoder
- Takes the encoder’s output and generates the target sequence (e.g., translation to French).
- Uses:
    - **Masked Multi-Head Attention:** Ensures the model predicts the next word without "seeing the future."
    - Another attention layer to focus on encoder outputs.

![image.png](attachment:image.png)


### Key Components

| Component                | Purpose                                                                     |
| ------------------------ | --------------------------------------------------------------------------- |
| **Self-Attention**       | Finds relationships between all words in a sequence.                        |
| **Multi-Head Attention** | Uses multiple attention layers to capture different types of relationships. |
| **Feed-Forward Layers**  | Processes attention outputs for better representation.                      |
| **Positional Encoding**  | Injects order information.                                                  |
| **Layer Normalization**  | Stabilizes training.                                                        |
| **Residual Connections** | Prevents loss of information across layers.                                 |

### How They Work
1. **Tokenization:** Input data (like text) is broken down into smaller units called `tokens`, which are then converted into high-dimensional vectors or "embeddings" that represent their meaning. 
2. **Encoding:** The encoder layers process these embeddings, using `self-attention` to understand the relationships between tokens and creating richer, contextualized representations. 
3. **Decoding:** The decoder then uses these encoded representations to generate the output sequence, word by word or token by token. 
4. **Parallel Processing:** The attention mechanism's ability to process multiple input elements simultaneously allows for significantly faster training times compared to older, sequential models like RNNs. 

### Advantages of Transformers
- **Parallelization** – Process all tokens simultaneously (faster training than RNNs).
- **Long-Range Dependencies** – Captures relationships between distant words.
- **Scalability** – Powers massive models (GPT, BERT, T5).
- **Versatility** – Works for text, images (Vision Transformers), audio, and even protein folding.

### Applications of Transformers
Some of the applications of transformers are:

- **NLP Tasks:** Transformers are used for machine translation, text summarization, named entity recognition and sentiment analysis.
- **Speech Recognition:** They process audio signals to convert speech into transcribed text.
- **Computer Vision:** Transformers are applied to image classification, object detection and image generation.
- **Recommendation Systems:** They provide personalized recommendations based on user preferences.
- **Text and Music Generation:** Transformers are used for generating text like articles and composing music.

### Real-World Examples
- **GPT** (e.g., ChatGPT) → Text generation.
- **BERT** → Text understanding (search engines, sentiment analysis).
- **ViT (Vision Transformer)** → Image recognition.
- **Whisper** → Speech recognition.