# 🎥 Week 9: Multimodal Generative AI

---

## 🌐 What is Multimodal Generative AI?

**Multimodal Generative AI** refers to models that can understand and generate data across multiple modalities such as:

- 📝 Text
- 🖼️ Images
- 🔊 Audio
- 🎞️ Video
- 🧑‍🤝‍🧑 Human interactions (e.g., gestures, speech)

These models **bridge multiple data types** and enable rich generative experiences — such as generating an image from a prompt or captioning a video.

---

## 🧠 Why Multimodal?

Real-world data is multimodal. Humans naturally **perceive, communicate, and understand through multiple senses**. Generative AI aims to replicate this behavior:

| Modality | Task Example                         |
|----------|--------------------------------------|
| Text + Image | Text-to-image (e.g., DALL·E)       |
| Image + Text | Image captioning, VQA             |
| Audio + Text | Speech-to-text, voice generation  |
| Video + Text | Video narration, scene description|
| Code + Text | Generate visual UI from text       |

---

## 🧪 Key Multimodal Models

| Model          | Modality Support | Capabilities                            |
|----------------|------------------|-----------------------------------------|
| **CLIP**       | Text + Image     | Learns joint embedding space            |
| **DALL·E 2**   | Text → Image     | High-resolution image generation        |
| **Flamingo**   | Text + Image     | Visual question answering, captioning   |
| **Gato**       | Many             | Single model for multiple tasks         |
| **BLIP / BLIP-2** | Image + Text  | Captioning, visual Q&A, prompt tuning   |
| **Stable Diffusion** | Text → Image | Open-source diffusion for image gen    |
| **Make-A-Video (Meta)** | Text → Video | Short video generation              |
| **Gemini**     | Multimodal       | Successor to Bard with enhanced fusion  |
| **LLaVA**      | Vision + LLM     | Visual + language assistant             |

---

## 🔄 How It Works

Multimodal models either:
- **Fuse different modalities** in a shared representation space (e.g., joint embeddings)
- **Generate one modality from another** (e.g., image from text using diffusion models)

They typically use:
- **Cross-attention mechanisms**
- **Multimodal encoders & decoders**
- **Adapters for modality-specific tokens**

---

## 🖇️ Common Architectures

- **Two-Tower Models**: Independent encoders for each modality, merged later (e.g., CLIP)
- **Unified Transformers**: One model processes everything with modality-specific embeddings
- **Encoder-Decoder + Diffusion**: Encodes input (text) and generates modality (image/video)

---

## 📈 Applications

| Domain           | Use Case                               |
|------------------|----------------------------------------|
| Creativity       | Text-to-image, music generation         |
| Education        | Visual explanations, multimodal tutors |
| Accessibility    | Image captioning for the visually impaired |
| Healthcare       | Radiology reports from images           |
| Social Media     | Auto-captioning, meme generation        |
| E-commerce       | Virtual try-on, image-based search      |

---

## 📚 Notable Resources

- [CLIP (OpenAI)](https://openai.com/research/clip)
- [DALL·E](https://openai.com/dall-e)
- [BLIP-2](https://github.com/salesforce/BLIP)
- [Meta's Make-A-Video](https://makeavideo.studio/)
- [Google's Gemini](https://deepmind.google/technologies/gemini/)

---

## 💬 Discussion Prompt

> How do you think multimodal models will impact creative industries like design or filmmaking?  
> What challenges arise when aligning meaning across different modalities?

---

## ✅ Summary

- Multimodal generative AI opens up new frontiers by integrating vision, text, audio, and beyond.
- These models enable creative, conversational, and assistive applications.
- Key advances include shared embeddings, transformer fusion, and diffusion-based generation.

