# 🤖 Week 7: Transformers for Generative Tasks

---

## ⚙️ What Are Transformers?

Transformers are **deep learning architectures** based on **self-attention mechanisms**, first introduced in the paper  
> “Attention Is All You Need” (Vaswani et al., 2017).  

They have revolutionized **sequence modeling** by enabling **parallel processing**, better **long-range dependency modeling**, and **scalability**.

---

## 🧠 Core Concepts Recap

- **Self-Attention**: Allows the model to weigh the importance of each part of the input sequence dynamically.
- **Positional Encoding**: Since Transformers lack recurrence, they use position encodings to capture sequence order.
- **Encoder-Decoder Architecture**:
  - Encoder: Processes the input
  - Decoder: Generates output sequence (used in tasks like translation, summarization)

---

## 🪄 Why Transformers for Generation?

Transformers excel in generative tasks due to:
- **Scalability to large data and models**
- **Effective modeling of sequential and contextual information**
- **Generalization across modalities (text, image, audio)**

---

## 📝 Text Generation

| Model          | Description |
|----------------|-------------|
| **GPT (OpenAI)**   | Autoregressive decoder-only transformer trained to predict next token |
| **BERT (Google)**  | Encoder-only model, good for understanding but not generation |
| **T5 (Text-to-Text Transfer Transformer)** | Unified text-to-text framework, supports summarization, Q&A, translation |
| **BART**         | Encoder-decoder trained with denoising objectives, good for summarization and translation |
| **LLaMA, PaLM, Mistral** | Large-scale open and closed-source language models for versatile generation |

> 🔁 These models predict output **one token at a time**, conditioning on previously generated tokens (autoregressive).

---

## 🧠 Image Generation with Transformers

Transformers have also been adapted for **image synthesis**:

| Model             | Highlights |
|------------------|------------|
| **DALL·E**       | Transformer-based text-to-image generation using VQ-VAE tokenized images |
| **Imagen (Google)** | Text-to-image diffusion guided by language models |
| **MaskGIT, Parti** | Image generation using discrete tokens and transformer-based decoding |

---

## 🔊 Audio & Multimodal Generation

Transformers are also used in:
- **Speech synthesis** (e.g., FastSpeech, SpeechT5)
- **Music generation** (e.g., Music Transformer, Jukebox)
- **Multimodal generation** (e.g., CLIP, Flamingo, Gato)

---

## 📈 Key Applications

| Domain             | Use Case Examples                              |
|--------------------|------------------------------------------------|
| 🧾 Text             | Story generation, summarization, translation   |
| 🎨 Vision           | Text-to-image, inpainting, style transfer      |
| 🗣️ Audio            | Text-to-speech, music composition              |
| 🌐 Multimodal       | Image captioning, video narration              |
| 🧪 Code Generation  | GitHub Copilot, ChatGPT coding                 |

---

## 📚 Notable Papers & Resources

- [Attention is All You Need](https://arxiv.org/abs/1706.03762)
- [GPT-3: Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
- [T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683)
- [DALL·E](https://openai.com/dall-e)
- [Imagen](https://imagen.research.google/)

---

## 💬 Discussion Prompt

> How do decoder-only transformers differ from encoder-decoder transformers in generative tasks?  
> What challenges arise when using transformers for high-resolution image or video generation?

---

## ✅ Summary

- Transformers have become the **backbone of generative AI** across modalities.
- Decoder-only (e.g., GPT) and encoder-decoder (e.g., T5, BART) serve different purposes.
- Combined with tokenization and pretraining strategies, they enable powerful generation tasks in **text, vision, audio, and beyond**.
