# CogVideoX - Text-to-Video Generation & Experiments

This notebook demonstrates **inference** with [CogVideoX-2b](https://huggingface.co/THUDM/CogVideoX-2b) and runs **ablation studies** on key hyperparameters.

**Model:** CogVideoX-2b (THUDM) — 2B parameter text-to-video diffusion model  
**Runtime:** Google Colab with T4/A100 GPU  

**Experiments:**
1. Ablation on `guidance_scale` (1, 6, 12) — prompt adherence strength
2. Ablation on `num_inference_steps` (10, 25, 50) — quality vs speed
3. Seed variation (42, 123, 999) — generation diversity
4. Frame count variation (25, 49) — video length vs VRAM

---
## 1. Setup & GPU Check

In [None]:
!nvidia-smi

In [None]:
!pip install -q diffusers>=0.30.1 transformers>=4.44.2 accelerate>=0.33.0 imageio-ffmpeg>=0.5.1

In [None]:
import torch
import time
import gc
import numpy as np
import matplotlib.pyplot as plt
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
from IPython.display import Video, display, HTML

print(f"CUDA disponibile: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
print(f"PyTorch: {torch.__version__}")

### Helper function
Utility to run a generation experiment and collect results.

In [None]:
def run_experiment(pipe, prompt, filename, steps=50, frames=49, guidance=6, seed=42, fps=8):
    """Generate a video and return timing + key frames."""
    gc.collect()
    torch.cuda.empty_cache()

    start = time.time()
    video_frames = pipe(
        prompt=prompt,
        num_videos_per_prompt=1,
        num_inference_steps=steps,
        num_frames=frames,
        guidance_scale=guidance,
        generator=torch.Generator(device="cuda").manual_seed(seed),
    ).frames[0]
    elapsed = time.time() - start

    export_to_video(video_frames, filename, fps=fps)

    n = len(video_frames)
    key_frames = [
        np.array(video_frames[0]),
        np.array(video_frames[n // 2]),
        np.array(video_frames[-1]),
    ]

    print(f"  -> {elapsed:.1f}s | {filename}")
    return {"time_s": elapsed, "file": filename, "key_frames": key_frames}


def plot_comparison(results, labels, param_name, title):
    """Plot key frames grid and timing bar chart."""
    n_runs = len(results)
    fig, axes = plt.subplots(n_runs, 3, figsize=(14, 4.5 * n_runs))
    if n_runs == 1:
        axes = [axes]
    frame_names = ["First Frame", "Middle Frame", "Last Frame"]
    for i, (res, label) in enumerate(zip(results, labels)):
        for j, (frame, fn) in enumerate(zip(res["key_frames"], frame_names)):
            axes[i][j].imshow(frame)
            axes[i][j].set_title(f"{label} - {fn}", fontsize=11)
            axes[i][j].axis("off")
    plt.suptitle(title, fontsize=14, y=1.01)
    plt.tight_layout()
    plt.show()

    # Timing chart
    fig, ax = plt.subplots(figsize=(7, 3.5))
    colors = ["#3498db", "#2ecc71", "#e74c3c", "#f39c12"][:n_runs]
    times = [r["time_s"] for r in results]
    bars = ax.bar(labels, times, color=colors)
    ax.set_ylabel("Time (s)")
    ax.set_title(f"Inference Time vs {param_name}")
    for bar, t in zip(bars, times):
        ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 1,
                f"{t:.1f}s", ha="center", fontsize=10)
    plt.tight_layout()
    plt.show()

print("Helpers loaded.")

---
## 2. Load Model

In [None]:
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16
)

pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

print("CogVideoX-2b loaded successfully!")

---
## 3. Inference — Baseline Video

Single video generation with default parameters to demonstrate the pipeline.

In [None]:
prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool "
    "in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, "
    "producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously "
    "and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle "
    "glow on the scene. The panda's face is expressive, showing concentration and joy as it "
    "plays. The background includes a small, flowing stream and vibrant green foliage, "
    "enhancing the peaceful and magical atmosphere of this unique musical performance."
)

print("Generating baseline video (steps=50, GS=6, seed=42, frames=49)...")
baseline = run_experiment(pipe, prompt, "output_baseline.mp4")
Video("output_baseline.mp4", embed=True)

---
## 4. Experiment 1 — Ablation on `guidance_scale`

The `guidance_scale` (classifier-free guidance) controls prompt adherence:
- **GS=1:** creative/random, weak conditioning
- **GS=6:** balanced quality and adherence (default)
- **GS=12:** strong adherence, risk of artifacts

All other params fixed: steps=50, seed=42, frames=49.

In [None]:
gs_values = [1, 6, 12]
gs_results = []

for gs in gs_values:
    print(f"guidance_scale={gs}")
    res = run_experiment(pipe, prompt, f"output_gs{gs}.mp4", guidance=gs)
    res["param"] = gs
    gs_results.append(res)

print("Done!")

In [None]:
plot_comparison(gs_results, [f"GS={r['param']}" for r in gs_results],
               "guidance_scale", "Experiment 1: guidance_scale Ablation")

In [None]:
for r in gs_results:
    print(f"\nguidance_scale={r['param']} ({r['time_s']:.1f}s)")
    display(Video(r["file"], embed=True))

---
## 5. Experiment 2 — Ablation on `num_inference_steps`

The number of diffusion steps directly affects quality and speed:
- **10 steps:** fast but noisy/grainy output
- **25 steps:** good compromise between speed and quality
- **50 steps:** maximum quality (default), slowest

All other params fixed: GS=6, seed=42, frames=49.

In [None]:
step_values = [10, 25, 50]
step_results = []

for s in step_values:
    print(f"steps={s}")
    res = run_experiment(pipe, prompt, f"output_steps{s}.mp4", steps=s)
    res["param"] = s
    step_results.append(res)

print("Done!")

In [None]:
plot_comparison(step_results, [f"Steps={r['param']}" for r in step_results],
               "num_inference_steps", "Experiment 2: Inference Steps Ablation")

In [None]:
for r in step_results:
    print(f"\nsteps={r['param']} ({r['time_s']:.1f}s)")
    display(Video(r["file"], embed=True))

---
## 6. Experiment 3 — Seed Variation

Different seeds produce different videos from the same prompt, demonstrating generation diversity.
We use three seeds (42, 123, 999) with all other params at default.

In [None]:
seed_values = [42, 123, 999]
seed_results = []

for sd in seed_values:
    print(f"seed={sd}")
    res = run_experiment(pipe, prompt, f"output_seed{sd}.mp4", seed=sd)
    res["param"] = sd
    seed_results.append(res)

print("Done!")

In [None]:
plot_comparison(seed_results, [f"Seed={r['param']}" for r in seed_results],
               "seed", "Experiment 3: Seed Variation")

In [None]:
for r in seed_results:
    print(f"\nseed={r['param']} ({r['time_s']:.1f}s)")
    display(Video(r["file"], embed=True))

---
## 7. Experiment 4 — Frame Count Variation

More frames produce longer videos but require more VRAM and time.
We compare 25 frames (~3s) vs 49 frames (~6s) at 8 FPS.

In [None]:
frame_values = [25, 49]
frame_results = []

for f in frame_values:
    print(f"frames={f}")
    res = run_experiment(pipe, prompt, f"output_frames{f}.mp4", frames=f)
    res["param"] = f
    frame_results.append(res)

print("Done!")

In [None]:
plot_comparison(frame_results, [f"Frames={r['param']}" for r in frame_results],
               "num_frames", "Experiment 4: Frame Count Variation")

In [None]:
for r in frame_results:
    print(f"\nframes={r['param']} ({r['time_s']:.1f}s)")
    display(Video(r["file"], embed=True))

---
## 8. Overall Summary

In [None]:
print("=" * 65)
print("EXPERIMENT SUMMARY")
print("=" * 65)

experiments = [
    ("Exp 1: guidance_scale", gs_results, "guidance_scale"),
    ("Exp 2: inference_steps", step_results, "steps"),
    ("Exp 3: seed", seed_results, "seed"),
    ("Exp 4: num_frames", frame_results, "frames"),
]

for name, res_list, param in experiments:
    print(f"\n{name}")
    print("-" * 45)
    for r in res_list:
        print(f"  {param}={r['param']:<8} -> {r['time_s']:.1f}s  ({r['file']})")

total_runs = sum(len(r) for _, r, _ in experiments) + 1  # +1 baseline
total_time = baseline["time_s"] + sum(r["time_s"] for _, res, _ in experiments for r in res)
print(f"\nTotal runs: {total_runs}")
print(f"Total generation time: {total_time/60:.1f} min")

---
## 9. Conclusions

### guidance_scale (Experiment 1)
- **GS=1**: Blurry, unfocused, model ignores the prompt.
- **GS=6** (default): Best balance of quality and prompt adherence.
- **GS=12**: Over-conditioned, oversaturated colors and artifacts.
- **Impact on time**: Minimal — generation time is dominated by diffusion steps, not guidance strength.

### num_inference_steps (Experiment 2)
- **10 steps**: Fast (~5x speedup) but noisy, grainy output with visible artifacts.
- **25 steps**: Good compromise — noticeable quality improvement over 10, much faster than 50.
- **50 steps** (default): Best quality, slowest generation.
- **Impact on time**: Linear — halving steps roughly halves generation time.

### seed (Experiment 3)
- Different seeds produce **visually distinct videos** from the same prompt.
- Composition, colors, and motion patterns vary significantly.
- **Impact on time**: None — seed does not affect generation speed.

### num_frames (Experiment 4)
- **25 frames** (~3s video): Faster, less VRAM, but very short output.
- **49 frames** (~6s video): Default, longer and more coherent motion.
- **Impact on time**: Roughly proportional — fewer frames = faster generation.

### Key Takeaway
The two most impactful hyperparameters are **`guidance_scale`** (quality/adherence trade-off) and **`num_inference_steps`** (quality/speed trade-off). The default values (GS=6, steps=50) are well-chosen for maximum quality.