## Storage:
This code allows the various packages and installations to be downloaded on the network volume instead of the disk, avoiding any disk storage issues

In [1]:
import os

os.environ['HF_HOME'] = '/workspace/hf'
os.environ['TRANSFORMERS_CACHE'] = '/workspace/hf/transformers'
os.environ['HF_DATASETS_CACHE'] = '/workspace/hf/datasets'

## Manual Installations:
Installing dependencies in the case of not having a requirements.txt (will change later)

In [4]:
!echo ">>> Uninstalling potential conflicts..."
!pip uninstall -y torch torchvision torchaudio flash-attn transformers accelerate nvidia-pyindex nvidia-pip || true

!echo ">>> Upgrading pip..."
!pip install --upgrade pip

!echo ">>> Manually installing opencv and accelerate"
!pip install opencv-python
!pip install accelerate

!echo ">>> Installing PyTorch 2.3.1 + CUDA 12.1..."
!pip install torch==2.3.1+cu121 torchvision==0.18.1+cu121 torchaudio==2.3.1+cu121 --extra-index-url https://download.pytorch.org/whl/cu121

!echo ">>> Installing latest Transformers + Accelerate..."
!pip install transformers==4.46.3 accelerate==1.0.1

!echo ">>> Installing other dependencies..."
!pip install decord ffmpeg-python imageio opencv-python

!echo ">>> Setting CUDA environment variables..."
!export CUDA_HOME=/usr/local/cuda-12.1
!export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH

!echo ">>> Setup Complete!

>>> Uninstalling potential conflicts...
Found existing installation: torch 2.8.0.dev20250319+cu128
Uninstalling torch-2.8.0.dev20250319+cu128:
  Successfully uninstalled torch-2.8.0.dev20250319+cu128
Found existing installation: torchvision 0.22.0.dev20250319+cu128
Uninstalling torchvision-0.22.0.dev20250319+cu128:
  Successfully uninstalled torchvision-0.22.0.dev20250319+cu128
Found existing installation: torchaudio 2.6.0.dev20250319+cu128
Uninstalling torchaudio-2.6.0.dev20250319+cu128:
  Successfully uninstalled torchaudio-2.6.0.dev20250319+cu128
[0mFound existing installation: accelerate 1.9.0
Uninstalling accelerate-1.9.0:
  Successfully uninstalled accelerate-1.9.0
[0m>>> Upgrading pip...
Collecting pip
  Downloading pip-25.2-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.2-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m313.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting un

## Setting Up and Loading the Model
Here, we import all the necessary packages, use the setup() function and then load the model itself

In [5]:
import os
import subprocess
from typing import List
import cv2  # For frame extraction
from transformers import AutoModelForCausalLM, AutoProcessor
import torch


def setup(repo_url: str = "https://github.com/DAMO-NLP-SG/VideoLLaMA3.git", repo_dir: str = "VideoLLaMA3"):
    """
    Sets up the environment by cloning the model repository and installing dependencies.
    Only installs requirements if needed, and defaults to known dependencies if 
    requirements.txt is missing.
    """
    # Checking if the repo already exists or not
    if not os.path.exists(repo_dir):
        print(f"[INFO] Cloning repository {repo_url}...")
        subprocess.run(["git", "clone", repo_url], check=True)
    else:
        print(f"[INFO] Repository '{repo_dir}' already exists. Skipping clone.")

    # Checks if a requirements.txt is present in the folder
    #   - may have to change this if requirements.txt is in another folder/has another path
    req_path = "requirements.txt"
    if os.path.exists(req_path):
        print(f"[INFO] Found requirements.txt at {req_path}. Installing...")
        subprocess.run(["pip", "install", "-r", req_path], check=True)
    else:
        # the default dependencies mentioned here are the dependencies commented out below
        # this can be sorted out further once we have a unified requirements.txt
        print(f"[INFO] No requirements.txt found. Using default dependencies.")

# we run the setup and make the model global
print("[INFO] Loading VideoLLaMA model...")
MODEL_NAME = "DAMO-NLP-SG/VideoLLaMA3-7B"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    offload_folder="offload"
)
processor = AutoProcessor.from_pretrained(MODEL_NAME, trust_remote_code=True)
print("[INFO] Model loaded successfully.")



[INFO] Loading VideoLLaMA model...


A new version of the following files was downloaded from https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-7B:
- configuration_videollama3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-7B:
- modeling_videollama3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

A new version of the following files was downloaded from https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-7B:
- processing_videollama3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


[INFO] Model loaded successfully.


# Functions for generating captions

## generate_caption
This function takes in one or more images in the array *images* and a given prompt *prmopt*, and will prompt the model with the inputs and prompt and return the "caption" that the model generates

In [6]:
def generate_caption(images: List[str], prompt: str, tokens = 64) -> str:
    """
    Generates a caption for one or more images given a text prompt.

    :param images: List of image file paths.
    :param prompt: Text prompt to guide the caption generation.
    :return: Generated caption as a string.
    """
    print(f"[INFO] Generating caption for {len(images)} image(s)...")
    conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            *[{"type": "image", "image": {"image_path": img}} for img in images],
            {"type": "text", "text": prompt}
        ]
    },
    ]
    inputs = processor(
        conversation=conversation,
        add_system_prompt=True,
        add_generation_prompt=True,
        return_tensors="pt",
    )
    inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
    if "pixel_values" in inputs:
      inputs["pixel_values"] = inputs["pixel_values"].to(torch.float16)
    with torch.inference_mode():
        output_ids = model.generate(**inputs, max_new_tokens=tokens)
    caption = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
    return caption

## extract_frames
This will extract a frame every *frame_interval* frames and stores this collection of frames in *output_dir*

In [19]:
import shutil
def extract_frames(video_path: str, output_dir: str, frame_interval: int = 120) -> List[str]:
    """
    Extracts frames from a video at a specified interval.

    :param video_path: Path to the input video.
    :param output_dir: Directory where extracted frames will be saved.
    :param frame_interval: Save every `frame_interval` frames (default: 120).
    :return: List of file paths to extracted frames.
    """
    print(f"[INFO] Extracting frames from {video_path} every {frame_interval} frames...")

    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)
        
    os.makedirs(output_dir, exist_ok=True)
    cap = cv2.VideoCapture(video_path)
    count, saved = 0, 0
    frame_paths = []

    while True:
        ret, frame = cap.read()
        if not ret:
            break
        if count % frame_interval == 0:
            frame_filename = os.path.join(output_dir, f"frame_{saved:04d}.jpg")
            cv2.imwrite(frame_filename, frame)
            frame_paths.append(frame_filename)
            saved += 1
        count += 1

    cap.release()
    print(f"[INFO] Extracted {len(frame_paths)} frames.")
    return frame_paths

## generate_video_caption
This function will extract the necessary frames from the video sent in as input and take the captions from each frame, and merge them to be a coherent summary of the entire video

In [20]:
def generate_video_caption(video_path: str) -> str:
    """
    Combines frame-wise captions into a single, coherent video-level caption.

    :param video_path: file path for the video we are working with
    :return: Final video caption string.
    """
    print("[INFO] Generating video-level caption...")
    output_dir = f"{video_path}_frames"
    frames = extract_frames(video_path, output_dir)
    frame_captions = [generate_caption([frame], "You are given a single frame extracted from a video. Generate a detailed and contextually rich caption that describes not only what is visually present (objects, people, setting, actions, emotions) but also infers the likely context of the scene based on visual cues such as lighting, expressions, clothing, or background elements. If the frame seems to be part of an ongoing action or event, include your best guess at what is happening before and after this moment, using natural language to provide a cohesive narrative. Be concise but descriptive, and focus on making the caption feel like a natural description someone might give while watching the video. Limit 64 tokens") for frame in frames]

    # Use the model to merge frame captions into a coherent summary
    merged_prompt = " ".join(frame_captions)
    summary_input = f"Merge these captions into one continuous, coherent, and natural-sounding description of the entire video. Avoid repeating the same details unless necessary, ensure smooth transitions between actions, and infer the overall context or story from the captions. If possible, describe the progression of events as if narrating the video from start to finish: {merged_prompt}"
    final_caption = generate_caption([], summary_input, 512)
    return final_caption

## generate_video_auto
This function will extract the necessary frames from the video sent in as input and builds the final caption autoregressively, using the formerly created captions and images as input for each successive frame

In [35]:
def build_autoregressive_prompt(generated_captions: list[str]) -> str:
    prompt = "You are an expert video captioning assistant. Here are the captions generated so far:\n\n"
    for i, cap in enumerate(generated_captions):
        prompt += f"Frame {i+1}: {cap.strip()}\n"
    prompt += f"\nNow describe Frame {len(generated_captions) + 1} in a way that is consistent with previous frames and captures the visual content accurately.\n"
    prompt += f"Frame {len(generated_captions) + 1}:"
    return prompt

def build_video_summary_prompt(generated_captions: list[str]) -> str:
    prompt = (
        "You are a professional film narrator tasked with summarizing a video based on frame-by-frame descriptions.\n\n"
        "Below is a list of captions generated from individual frames of the video. Use these to infer the setting, environment, and the sequence of events.\n\n"
        "Your task is to write a single, vivid, and coherent description that captures:\n"
        "- The environment and setting of the video\n"
        "- The sequence of actions and events\n"
        "- The overall mood, atmosphere, or intent behind the scene\n\n"
        "Avoid repeating the frame captions verbatim. Instead, synthesize the information into a fluid, descriptive paragraph as if narrating the video to someone who cannot see it.\n\n"
        "Frame Captions:\n"
    )
    for i, caption in enumerate(generated_captions):
        prompt += f"Frame {i+1}: {caption.strip()}\n"
    prompt += "\nFinal Video Caption:"
    return generate_caption([], prompt, 512)

def generate_video_auto(video_path: str) -> str:
    """
    Autoregressively creates a caption for an entire video

    :param video_path: file path for the video we are working with
    :return: Final video caption string
    """

    generated_captions = []
    print("[INFO] Generating video-level caption...")
    output_dir = f"{video_path}_frames"
    frames = extract_frames(video_path, output_dir)
    for frame in frames:
        prompt = build_autoregressive_prompt(generated_captions)
        caption = generate_caption([frame], prompt)
        generated_captions.append(caption)

    return build_video_summary_prompt(generated_captions)

# Testing Window

In [17]:
print(generate_video_caption("/workspace/video_LLama3/VideoLLaMA3/assets/cat_and_chicken.mp4"))

[INFO] Generating video-level caption...
[INFO] Extracting frames from /workspace/video_LLama3/VideoLLaMA3/assets/cat_and_chicken.mp4 every 120 frames...
[INFO] Extracted 29 frames.
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption 

In [36]:
print(generate_video_auto("/workspace/video_LLama3/VideoLLaMA3/assets/cat_and_chicken.mp4"))

[INFO] Generating video-level caption...
[INFO] Extracting frames from /workspace/video_LLama3/VideoLLaMA3/assets/cat_and_chicken.mp4 every 120 frames...
[INFO] Extracted 29 frames.
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption for 1 image(s)...
[INFO] Generating caption 

In [34]:
frames = extract_frames("/workspace/video_LLama3/VideoLLaMA3/assets/cat_and_chicken.mp4", "/workspace/video_LLama3/VideoLLaMA3/assets/cat_and_chicken.mp4_frames")
prompt = """You are a professional video description model. You are given a sequence of video frames representing a short video clip.

Your task is to:
- Analyze all frames in temporal order
- Infer the setting and environment
- Understand the actions or events taking place
- Capture the mood or atmosphere of the video

Generate a single, coherent, and descriptive caption that summarizes the entire video. This caption should read like a natural-language description of what a viewer would see and feel while watching the video.

Avoid listing what's in each frame individually. Instead, synthesize the full sequence into one meaningful and detailed narrative."""
print(generate_caption(frames, prompt, 512))

[INFO] Extracting frames from /workspace/video_LLama3/VideoLLaMA3/assets/cat_and_chicken.mp4 every 120 frames...
[INFO] Extracted 29 frames.
[INFO] Generating caption for 29 image(s)...
a small yellow chick and a small orange kitten are cuddling together. The chick is lying on the kitten's back and the kitten is lying on its side. The kitten is sleeping and the chick is looking around. The chick then snuggles into the kitten's chest and the kitten wakes up. The kitten starts to play with the chick and they both seem to be enjoying each other's company.
