# Project¬†5: **Build a Multi-Modal Generation Agent**

Welcome to the final project! In this project, you'll use open-source text-to-image and text-to-video models to generate content. Next, you'll build a **unified multi-modal agent** similar to modern chatbots, where a single agent can support general questions, image generation, and video generation requests.

By the end of this project, you'll understand how to integrate multiple model types under one  routing system capable of deciding what modality to use based on the user's intent.



## Learning Objectives

* Use **Text-to-Image** models to generate images from a text.
* Generate short clips with a **Text-to-Video** model
* Build a **Multi-Modal Agent** that answers questions and routes media requests
* Build a simple **Gradio** UI and interact with the multi-modal agent

## Roadmap
1. Environment setup
2. Text‚Äëto‚ÄëImage
3. Text‚Äëto‚ÄëVideo
4. Multimodal Agent
5. Gradio UI
6. Celebrate

## 1 - Environment Setup

In this project, we'll use open-source Text-to-Image and Text-to-Video models to generate visuals from natural-language prompts. These models are computationally heavy and perform best on GPUs, so we recommend running this notebook in Google Colab or another GPU-enabled environment. We'll load all models from Hugging Face, which requires authentication.

Before continuing:
1. Open this project in Google Colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bytebyteai/ai-engineering-cohort-3/blob/main/project_5/multimodal_agent.ipynb)
2. Create a Hugging Face account and generate an access token at huggingface.co/settings/tokens
3. Paste your token in the field below.
4. In the Colab environment, enable GPU acceleration by selecting Runtime ‚Üí Change runtime type ‚Üí a GPU-enabled machine (e.g., T4 GPU).

In [None]:
from huggingface_hub import login
import os

HF_ACCESS_TOKEN = "YOUR_TOKEN_HERE"

login(token=HF_ACCESS_TOKEN)
os.environ["HF_TOKEN"] = HF_ACCESS_TOKEN

Let's import the required libraries and confirm that PyTorch can detect the available GPU.

In [None]:
import torch, diffusers, transformers, os, random, gc
print('torch', torch.__version__, '| CUDA:', torch.cuda.is_available())

## 2 - Text-to-Image (T2I)
T2I models translate natural-language descriptions into images. They are typically based on diffusion models, which gradually refine random noise into a coherent picture guided by the text prompt. In this section, you'll load and test T2I models to generate images directly from text inputs.

### 2.1: Load a T2I Model
We'll use **Stable Diffusion v1.5**, a well-established **UNet-based** diffusion model that runs comfortably on Google Colab's free GPU tier. It produces solid 512x512 images and is a great starting point for learning the diffusion pipeline.

If you have access to a more powerful machine or a paid Colab tier with extra RAM, you can explore these newer, larger alternatives:

- **SSD-1B** by Segmind (`segmind/SSD-1B`): a distilled version of Stable Diffusion XL (SDXL). Also **UNet-based**, but produces higher-resolution images at ~2.5 GB in fp16. Requires more GPU memory than the free Colab tier typically provides.
- **PixArt-Sigma** by PixArt-alpha (`PixArt-alpha/PixArt-Sigma-XL-2-512-MS`): a newer model based on a **Diffusion Transformer (DiT)** architecture with 0.6B parameters. DiT replaces the UNet backbone with a transformer, which can improve scalability and image quality ‚Äî but also demands more memory.

Use `enable_attention_slicing()` to keep GPU memory usage low and prevent out-of-memory (OOM) crashes when multiple models coexist.

You'll load the model from Hugging Face using the diffusers library. To learn more: https://huggingface.co/docs/diffusers/main/index

In [None]:
from diffusers import DiffusionPipeline

# Load `stable-diffusion-v1-5` model using DiffusionPipeline.
"""
YOUR CODE HERE (~2 lines of code)
"""

### 2.2: Generate an image

In [None]:
# Generate and display an image from a text prompt using the loaded pipeline
"""
YOUR CODE HERE (~3 lines of code)
"""

### 2.3: Experimenting with "inference_steps"

The number of inference steps determines how many refinement passes the diffusion model makes. Fewer steps give quicker but less detailed images, while more steps improve clarity and structure at the cost of speed.

Try generating images with different step counts and compare the results.

In [None]:
# Generate an image for different values of num_inference_steps (e.g., 10, 25, 50) and compare sharpness and detail

import matplotlib.pyplot as plt

images = []

"""
YOUR CODE HERE (~5 lines)
"""

# Plot results side-by-side
plt.figure(figsize=(12, 4))
for i, (steps, img) in enumerate(images, 1):
    plt.subplot(1, len(images), i)
    plt.imshow(img)
    plt.axis("off")
    plt.title(f"{steps} steps")
plt.tight_layout()
plt.show()


### 2.4 (Optional): Visualizing the Diffusion Process
Diffusion models start from random noise and iteratively refine it into an image that matches the prompt. If you are curious, visualize all intermediate steps to see how the noise gradually turns into a coherent picture.

In [None]:
import torch
import matplotlib.pyplot as plt

# Step 1: Run the pipeline with 30 inference steps
# Step 2: Capture intermediate latents during generation using a callback
# Step 3: Decode selected latents through the VAE and plot them

"""
YOUR CODE HERE (~15 lines)
"""


### 2.5 (Optional): Experiment with other models.
Different text-to-image models vary in speed, style, and visual quality. If you have access to more powerful machines, try swapping in other open-source diffusion models and compare how their outputs differ in detail, realism, or artistic tone.

You can browse available models on Hugging Face here: https://huggingface.co/models?library=diffusers

In [None]:
# Step 1: Replace model_id with another text-to-image model from Hugging Face
# Step 2: Reload the pipeline and generate a few test images
# Step 3: Compare image quality, color balance, and prompt fidelity
"""
YOUR CODE HERE
"""

## 3 - Text-to-Video (T2V)
T2V models extend the idea of diffusion from still images to moving sequences. Instead of generating one frame, they create a series of coherent frames that depict motion consistent with the text prompt. These models are computationally heavier and often generate short clips (typically 2-10 seconds).

In this section, you'll load an open-source video diffusion model and generate videos.

### 3.1: Load a T2V model
We'll use **damo-vilab 1.7B** (`damo-vilab/text-to-video-ms-1.7b`), a **UNet-based** video diffusion model that generates 256x256 clips. It is lightweight enough to run on Google Colab's free GPU tier, making it a practical choice for learning text-to-video generation.

If you have access to more powerful hardware or a paid Colab tier, you can explore this newer alternative:

- **Wan 2.1 T2V 1.3B** by Alibaba (`Wan-AI/Wan2.1-T2V-1.3B-Diffusers`): a modern **DiT-based** video model that generates **480p** video with significantly better motion coherence. It uses a specialized 3D causal VAE to reduce flickering. While the parameter count is comparable, the DiT architecture and higher output resolution require substantially more GPU memory than the free Colab tier provides.

In [None]:
# Free cached GPU tensors before loading the video model.
gc.collect()
torch.cuda.empty_cache()
print("GPU cache cleared.")

Load the `damo-vilab/text-to-video-ms-1.7b` model, which produces short video clips from text prompts and fits within free Colab's memory limits.

In [None]:
from diffusers import DiffusionPipeline

# Load `text-to-video-ms-1.7b` model using DiffusionPipeline.
"""
YOUR CODE HERE (~2 lines of code)
"""

### 3.2: Generate a clip
Create a short video clip from a text prompt using a text-to-video model.

In [None]:
# Step 1: Write a text prompt describing the video you want to generate
# Step 2: Run the text-to-video pipeline with your chosen prompt
"""
YOUR CODE HERE (~2-3 lines)
"""

### 3.3: Frame inspection
Inspect a single frame to sanity-check colors, resolution, and subject positioning before writing a full video.

In [None]:
import numpy as np
from PIL import Image

# Step 1: Select one frame from vid_frames (e.g., index 0)
# Step 2: Display as a PIL image (use Image.fromarray)
"""
YOUR CODE HERE (~1-2 lines)
"""


### 3.4: Convert frames to MP4
Write the generated frames to an MP4 file so you can preview and share the result.

In [None]:
# Step 1: Use diffusers.utils.export_to_video to write vid_frames to an MP4
# Step 2: Capture and print the saved video path
"""
YOUR CODE HERE (~3-4 lines)
"""

### 3.5: Video inspection
Play the saved video inside the notebook to check motion and temporal consistency.

In [None]:
# Display the saved MP4 inline
from IPython.display import Video

"""
YOUR CODE HERE (1 line of code)
"""

### 3.6 (Optional): Experiment with different configs
Increase `num_frames` or decrease `num_inference_steps` to experiment with clip length versus quality.

## 4 - Multimodal Generation Agent
Now that you have working text-to-image and text-to-video pipelines, you will build a single agent that routes user requests to the right capability. The agent will read a prompt, infer intent (chat vs. image vs. video), and return the appropriate output.

To do this, we also need a small LLM. It will serve a dual role: answering general questions directly, and acting as a **router** that classifies each user prompt so the agent knows which pipeline to call.

### 4.1: Load an LLM for generic queries
Load `gemma-3-1b-it` using the Hugging Face `pipeline`. This compact model is small enough to coexist in memory alongside the image and video pipelines on the free Colab tier.

In [None]:
from transformers import pipeline

# Load google/gemma-3-1b-it using HuggingFace pipeline

"""
YOUR CODE HERE (~2-15 lines)
"""

### 4.2: Build a routing mechanism to route requests

**Step 1:** Implement two helper functions:
- `generate_media(prompt, mode)` ‚Äî a thin wrapper that calls the image pipeline when `mode='image'` or the video pipeline when `mode='video'`.
- `llm_generate(prompt, ...)` ‚Äî sends a prompt to the Gemma LLM and returns the generated text.

In [None]:
import torch, textwrap, json, re

def generate_media(prompt: str, mode: str):
    # Produce either an image or a short video clip from a text prompt.
    """
    YOUR CODE HERE (~3-6 lines)
    """

def llm_generate(prompt, max_new_tokens=64, temperature=0.7):
    # Return a response to the prompt with the loaded gemma
    """
    YOUR CODE HERE (~2 lines of code)
    """

**Step 2:** Implement `classify_prompt(prompt)` ‚Äî the router. This function uses the LLM itself to classify a user prompt into one of three categories: `"qa"`, `"image"`, or `"video"`. For image and video requests, it should also return an `expanded_prompt` ‚Äî an improved, more detailed version of the user's request that will produce better generation results. On parse failure, default to `"qa"`.

In [None]:
def classify_prompt(prompt: str):
    """Classify the user prompt into QA, image, or video."""

    # Step 1: Define a system prompt explaining how to classify requests (qa, image, video)
    # Step 2: Format the user message and system message as input to the LLM
    # Step 3: Generate a response with llm_generate() and parse it using regex
    # Step 4: Extract fields "type" and "expanded_prompt" from the LLM response
    # Step 5: Return a dict with classification results or default to {"type": "qa"} on failure

    """
    YOUR CODE HERE (~5-25 lines of code)
    """

### 4.3: Build the multimodal agent
Implement `multimodal_agent(user_prompt)` ‚Äî the main entry point. It takes a single user prompt, calls `classify_prompt` to determine the request type, and then routes to the appropriate handler:
- **QA** ‚Äî call `llm_generate` to produce a conversational answer.
- **Image** ‚Äî call `generate_media` with the expanded prompt and `mode='image'`.
- **Video** ‚Äî call `generate_media` with the expanded prompt and `mode='video'`.

In [None]:
def multimodal_agent(user_prompt: str):
    # Step 1: Classify the request
    # Step 2: Route the prompt and generate output
    """
    YOUR CODE HERE (~12-16 lines)
    """

### 4.4: Test the agent
Now let's test your multimodal agent end to end. Each prompt will automatically be routed to the correct capability: text Q&A, image generation, or video generation, and display the corresponding output.

In [None]:
from diffusers.utils import export_to_video
from IPython.display import display, Video

# Step 1: Define a few diverse prompts (QA, image, video)
# Step 2: For each prompt, call multimodal_agent and inspect the returned result
"""
YOUR CODE HERE (~15-18 lines)
"""

Replace the sample queries with your own and verify that the agent chooses the correct generation path.

## 5 - Interactive Web UI

Launch a simple Gradio web interface so you (or your users) can play with the multimodal agent from the browser.


In [None]:
import gradio as gr
with gr.Blocks() as demo:
    gr.Markdown('# Multimodal Agent')
    """
    YOUR CODE HERE (~15-18 lines)
    """

demo.launch()

After the UI launches, open the link and generate your own images and videos directly from the browser.

## üéâ Congratulations!

* You have built a **multi-modal agent** capable of understanding various requests, and routing them to the proper model.
* Try experimenting with other T2I and T2V models.
* Try making your system more efficient. For example, load a separate lightweight llm for routing, and a more capable llm for QA.


üëè **Great job!** Take a moment to celebrate. The techniques you implemented here power many production agents and chatbots.