## CogVideoX Text-to-Video

This notebook demonstrates how to run [CogVideoX-2b](https://huggingface.co/THUDM/CogVideoX-2b) and [CogVideoX-5b](https://huggingface.co/THUDM/CogVideoX-5b) with 🧨 Diffusers on a free-tier Colab GPU.

Additional resources:
- [Docs](https://huggingface.co/docs/diffusers/en/api/pipelines/cogvideox)
- [Quantization with TorchAO](https://github.com/sayakpaul/diffusers-torchao/)
- [Quantization with Quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)

Note: If, for whatever reason, you randomly get an OOM error, give it a try on Kaggle T4 instances instead. I've found that Colab free-tier T4 can be unreliable at times. Sometimes, the notebook will run smoothly, but other times it will crash with an error 🤷🏻‍♂️

Mount with google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#### Install the necessary requirements

In [2]:
!pip install diffusers transformers hf_transfer
!pip install openpyxl

Collecting hf_transfer
  Downloading hf_transfer-0.1.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.7 kB)
Downloading hf_transfer-0.1.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m59.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: hf_transfer
Successfully installed hf_transfer-0.1.9


In [3]:
#!pip install git+https://github.com/huggingface/accelerate
!pip install accelerate==0.33.0
!pip install streamlit torch transformers

Collecting accelerate==0.33.0
  Downloading accelerate-0.33.0-py3-none-any.whl.metadata (18 kB)
Downloading accelerate-0.33.0-py3-none-any.whl (315 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/315.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.1/315.1 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 1.2.1
    Uninstalling accelerate-1.2.1:
      Successfully uninstalled accelerate-1.2.1
Successfully installed accelerate-0.33.0
Collecting streamlit
  Downloading streamlit-1.41.1-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting watchdog<7,>=2.1.5 (from streamlit)
  Downloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Collecting pydeck<1

#### Import required libraries

The following block is optional but if enabled, downloading models from the HF Hub will be much faster

In [4]:
import pandas as pd
import cv2
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

In [5]:
import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video
from transformers import T5EncoderModel

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [6]:

# List all files in the folder
folder_path = "/content/drive/MyDrive/Colab Notebooks/data/MIMOS Dataset"  # Updated path to Google Drive
files = os.listdir(folder_path)
print("Files in MIMOS Dataset:", files)

Files in MIMOS Dataset: ['133vid.mp4', '84vid.mp4', '56vid.mp4', '92vid.mp4', '132vid.mp4', '93vid.mp4', '96vid.mp4', '3vid.mp4', '24vid.mp4', '65vid.mp4', '2vid.mp4', '37vid.mp4', '124vid.mp4', '127vid.mp4', '41vid.mp4', '30vid.mp4', '46vid.mp4', '12vid.mp4', '135vid.mp4', '25vid.mp4', '104vid.mp4', '52vid.mp4', '87vid.mp4', '15vid.mp4', '101vid.mp4', '137vid.mp4', '17vid.mp4', '86vid.mp4', '62vid.mp4', '23vid.mp4', '33vid.mp4', '1vid.mp4', '134vid.mp4', '75vid.mp4', '45vid.mp4', '125vid.mp4', '54vid.mp4', '64vid.mp4', '40vid.mp4', '131vid.mp4', '117vid.mp4', '6vid.mp4', '130vid.mp4', '5vid.mp4', '99vid.mp4', '119vid.mp4', '4vid.mp4', '74vid.mp4', '22vid.mp4', '88vid.mp4', '39vid.mp4', '10vid.mp4', '73vid.mp4', '102vid.mp4', '79vid.mp4', '95vid.mp4', '128vid.mp4', '82vid.mp4', '81vid.mp4', '126vid.mp4', '106vid.mp4', '42vid.mp4', '90vid.mp4', '28vid.mp4', '26vid.mp4', '76vid.mp4', '72vid.mp4', '80vid.mp4', '129vid.mp4', '91vid.mp4', '19vid.mp4', '50vid.mp4', '85vid.mp4', '34vid.mp4', 

In [7]:
# Updated path to load captions from the MIMOS Dataset folder
captions_path = "/content/drive/MyDrive/Colab Notebooks/data/MIMOS Dataset/Caption.xlsx"
captions_df = pd.read_excel(captions_path, engine='openpyxl')

# Folder path to MIMOS Dataset videos
video_folder = "/content/drive/MyDrive/Colab Notebooks/data/MIMOS Dataset"

# Process each video and its caption
for i in range(1, 139):  # Iterate from 1 to 138
    video_name = f"{i}vid"  # Create video name (e.g., 1vid, 2vid, etc.)
    video_path = os.path.join(video_folder, video_name)  # Path to each video

    # Get the corresponding caption from the DataFrame (assuming it's indexed by video number)
    try:
        caption = captions_df.loc[i - 1, 'caption']  # Assuming captions are indexed from 0
    except KeyError:
        print(f"Warning: Caption not found for video {video_name}")
        caption = ""  # Set an empty caption if not found

    video_data = cv2.VideoCapture(video_path)  # Load video
    # ... (Rest of your code to process the video and caption) ...

#### Load models and create pipeline

Note: `bfloat16`, which is the recommended dtype for running "CogVideoX-5b" will cause OOM errors due to lack of efficient support on Turing GPUs.

Therefore, we must use `float16`, which might result in poorer generation quality. The recommended solution is to use Ampere or above GPUs, which also support efficient quantization kernels from [TorchAO](https://github.com/pytorch/ao) :(

# @title Default title text
# Models: "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
model_id = "THUDM/CogVideoX-5b"

In [8]:
# Models: "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
model_id = "THUDM/CogVideoX-5b"

In [9]:
# Thank you [@camenduru](https://github.com/camenduru)!
# The reason for using checkpoints hosted by Camenduru instead of the original is because they exported
# with a max_shard_size of "5GB" when saving the model with `.save_pretrained`. The original converted
# model was saved with "10GB" as the max shard size, which causes the Colab CPU RAM to be insufficient
# leading to OOM (on the CPU)

!pip install hf_transfer

import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"


transformer = CogVideoXTransformer3DModel.from_pretrained("camenduru/cogvideox-5b-float16", subfolder="transformer", torch_dtype=torch.float16)
text_encoder = T5EncoderModel.from_pretrained("camenduru/cogvideox-5b-float16", subfolder="text_encoder", torch_dtype=torch.float16)
vae = AutoencoderKLCogVideoX.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float16)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


transformer/config.json:   0%|          | 0.00/798 [00:00<?, ?B/s]

(…)ion_pytorch_model.safetensors.index.json:   0%|          | 0.00/103k [00:00<?, ?B/s]

(…)pytorch_model-00001-of-00003.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

(…)pytorch_model-00002-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

(…)pytorch_model-00003-of-00003.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

text_encoder/config.json:   0%|          | 0.00/791 [00:00<?, ?B/s]

(…)ext_encoder/model.safetensors.index.json:   0%|          | 0.00/19.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/1.58G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

vae/config.json:   0%|          | 0.00/872 [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/862M [00:00<?, ?B/s]

The config attributes {'invert_scale_latents': False} were passed to AutoencoderKLCogVideoX, but are not expected and will be ignored. Please verify your config.json configuration file.


In [10]:
# Create pipeline and run inference
pipe = CogVideoXPipeline.from_pretrained(
    model_id,
    text_encoder=text_encoder,
    transformer=transformer,
    vae=vae,
    torch_dtype=torch.float16,
)

model_index.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

scheduler/scheduler_config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

tokenizer/added_tokens.json:   0%|          | 0.00/2.59k [00:00<?, ?B/s]

tokenizer/special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer/tokenizer_config.json:   0%|          | 0.00/20.6k [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

#### Enable memory optimizations

Note that sequential cpu offloading is necessary for being able to run the model on Turing or lower architectures. It aggressively maintains everything on the CPU and only moves the currently executing nn.Module to the GPU. This saves a lot of VRAM but adds a lot of overhead for inference, making generations extremely slow (1 hour+). Unfortunately, this is the only solution for running the model on Colab until efficient kernels are supported.

In [11]:
pipe.enable_sequential_cpu_offload()
# pipe.vae.enable_tiling()

In [12]:
import pandas as pd
import os
from diffusers import AutoencoderKLCogVideoX, CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video
from transformers import T5EncoderModel

   # ... (Other imports and code for loading models)

   # Load captions
captions_path = "/content/drive/MyDrive/Colab Notebooks/data/MIMOS Dataset/Caption.xlsx"
captions_df = pd.read_excel(captions_path, engine='openpyxl')

   # Folder to store generated videos
output_folder = "/content/drive/MyDrive/Colab Notebooks/generated_videos"
os.makedirs(output_folder, exist_ok=True)

   # Data structure to store video-caption pairs
video_caption_pairs = []

TRAINING!

In [13]:
!pip install datasets transformers
from datasets import Dataset
from transformers import TrainingArguments, Trainer




Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

 GENERATE VIDEOS!

In [14]:


%%writefile video_generator.py
import os
import torch
# Import your pipeline and video export methods here

def set_output_folder(folder_name="generated_videos") -> str:
    base_path = "/content/drive/My Drive"  # Adjust this if not using Google Drive
    output_folder = os.path.join(base_path, folder_name)
    os.makedirs(output_folder, exist_ok=True)
    return output_folder

def generate_video_from_caption(caption: str, output_folder: str) -> str:
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    video_name = "generated_video"
    video = pipe(
        prompt=caption,
        guidance_scale=6,
        use_dynamic_cfg=True,
        num_inference_steps=50,
        height=512,
        width=512
    ).frames[0]

    output_path = os.path.join(output_folder, f"{video_name}.mp4")
    export_to_video(video, output_path, fps=8)

    torch.cuda.empty_cache()

    return output_path

Writing video_generator.py


In [15]:
%%writefile /content/streamlit_app.py
import streamlit as st
from pathlib import Path
from video_generator import generate_video_from_caption, set_output_folder  # Your refactored code

# Streamlit Interface
st.title("Text-to-Video Generation")
st.write("Enter a caption to generate a video.")

# Input for caption
caption = st.text_input("Caption", placeholder="Type your video caption here...")

# Input for output folder
output_folder_name = st.text_input("Output Folder Name", "generated_videos")

# Set and display the output folder path
output_folder = set_output_folder(output_folder_name)
st.write(f"Videos will be saved to: `{output_folder}`")

# Generate video on button click
if st.button("Generate Video"):
    if caption.strip():
        st.write("Generating video, please wait...")
        try:
            # Call the video generation function
            video_path = generate_video_from_caption(caption, output_folder)
            st.success("Video generated successfully!")
            st.video(video_path)
        except Exception as e:
            st.error(f"Error generating video: {e}")
    else:
        st.warning("Please enter a caption.")

Writing streamlit_app.py


In [21]:
!pip freeze > requirements.txt