<a href="https://colab.research.google.com/github/ayagup/stablediffusion/blob/main/t2v.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
%pip install torch>=2.0.0 diffusers>=0.25.0 transformers>=4.35.0 accelerate>=0.24.0 safetensors>=0.4.0 opencv-python>=4.8.0 imageio>=2.31.0 imageio-ffmpeg>=0.4.9 pillow>=10.0.0 numpy>=1.24.0 huggingface-hub>=0.19.0

In [None]:
"""
Simple Text-to-Video Inference Example
A simplified version for quick testing with publicly available models
"""

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video
import os

# Suppress TensorFlow warnings
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'


def generate_video(
    prompt: str,
    output_path: str = "output.mp4",
    model_id: str = "damo-vilab/text-to-video-ms-1.7b",
    num_frames: int = 16,
    height: int = 256,
    width: int = 256,
):
    """
    Generate a video from a text prompt using OpenSora

    Args:
        prompt: Text description of the video to generate
        output_path: Path to save the output video
        model_id: Hugging Face model identifier
        num_frames: Number of frames to generate
        height: Video height in pixels
        width: Video width in pixels
    """

    # Determine device and check for multiple GPUs
    device = "cuda" if torch.cuda.is_available() else "cpu"

    if torch.cuda.is_available():
        num_gpus = torch.cuda.device_count()
        print(f"Using device: {device}")
        print(f"Number of GPUs available: {num_gpus}")
        for i in range(num_gpus):
            print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")
            print(f"    Memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.2f} GB")
    else:
        print(f"Using device: {device}")

    # Load model
    print(f"\nLoading model: {model_id}...")
    pipe = DiffusionPipeline.from_pretrained(
        model_id,
        torch_dtype=torch.float16 if device == "cuda" else torch.float32,
        variant="fp16" if device == "cuda" else None,
    )

    # Use faster scheduler
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

    # Multi-GPU setup
    if device == "cuda" and torch.cuda.device_count() > 1:
        print(f"\nðŸš€ Multi-GPU mode enabled: Using {torch.cuda.device_count()} GPUs")
        # Enable sequential CPU offload for better multi-GPU memory management
        pipe.enable_sequential_cpu_offload()
        pipe.enable_vae_slicing()
        # Note: DiffusionPipeline handles multi-GPU automatically via accelerate
    else:
        pipe = pipe.to(device)
        # Enable optimizations for single GPU
        if device == "cuda":
            pipe.enable_model_cpu_offload()
            pipe.enable_vae_slicing()

    # Generate video
    print(f"Generating video for prompt: '{prompt}'")
    result = pipe(
        prompt=prompt,
        num_frames=num_frames,
        height=height,
        width=width,
        num_inference_steps=50,
        guidance_scale=7.5,
    )

    # Extract frames
    video_frames = result.frames[0]

    # Export to video file
    export_to_video(video_frames, output_path, fps=8)
    print(f"Video saved to: {output_path}")

    return video_frames


if __name__ == "__main__":
    # Example usage
    prompt = "A beautiful woman standing in the middle of a room. full body frame. She is combing her hair."

    try:
        frames = generate_video(
            prompt=prompt,
            output_path="/kaggle/working/sunset_video.mp4",
            num_frames=160,
            height=256,
            width=256,
        )
        print(f"Successfully generated {len(frames)} frames!")
    except Exception as e:
        print(f"Error: {e}")
        import traceback
        traceback.print_exc()


2025-10-18 10:24:57.728035: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1760783097.936641      37 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1760783097.996536      37 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Using device: cuda
Number of GPUs available: 2
  GPU 0: Tesla T4
    Memory: 15.83 GB
  GPU 1: Tesla T4
    Memory: 15.83 GB

Loading model: damo-vilab/text-to-video-ms-1.7b...


model_index.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

scheduler_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/644 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/755 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/787 [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

text_encoder/model.fp16.safetensors:   0%|          | 0.00/681M [00:00<?, ?B/s]

unet/diffusion_pytorch_model.fp16.safete(â€¦):   0%|          | 0.00/2.82G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/657 [00:00<?, ?B/s]

vae/diffusion_pytorch_model.fp16.safeten(â€¦):   0%|          | 0.00/167M [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

The TextToVideoSDPipeline has been deprecated and will not receive bug fixes or feature updates after Diffusers version 0.33.1. 



ðŸš€ Multi-GPU mode enabled: Using 2 GPUs
Generating video for prompt: 'A beautiful woman standing in the middle of a room. full body frame. She is combing her hair.'


  0%|          | 0/50 [00:00<?, ?it/s]

Video saved to: /kaggle/working/sunset_video.mp4
Successfully generated 160 frames!


In [None]:
%pip uninstall numpy -y

In [None]:
%pip install "numpy<2.0.0"