<a href="https://colab.research.google.com/github/brunocostarendon/nvidia-omniverse/blob/main/nvidia_cosmos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Generating quality driving data with Cosmos

We provide a short demo using Cosmos-Predict to generate a short driving clip, and Cosmos-Reason to determine if the video is realistic enough for training.

The video is indeed scored as real (=0). If we decide Cosmos-Reason is a good enough critic, we can increase this video's generation to create large datasets for training and set up evaluation benchmarkings, noting their source origin.

We can also argue this video is not of good enough quality. In reviewing the video, we see artifacts propagated from present to future frames, and a large amount of shakiness to the video. We could address the shakiness of the video by using different prompts, fine-tuning a Cosmos-Predict for  a single + fixed camera position, or review the datasets used to train this network.

In [6]:
!pip install cosmos_guardrail
!pip install peft==0.17.0
!sudo apt-get update
!sudo apt-get install ffmpeg libavformat-dev libavcodec-dev libavutil-dev
#!pip install pyav
!pip install torchvision
!pip install av
!pip install decord

0% [Working]            Hit:1 https://cli.github.com/packages stable InRelease
0% [Connecting to archive.ubuntu.com (185.125.190.83)] [Connecting to security.                                                                               Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acq

In [1]:
from huggingface_hub import login
from google.colab import userdata
token = userdata.get('HF_TOKEn')
login(token=token)

## Run Cosmos-Predict-2B

In [6]:
import torch
from diffusers import Cosmos2VideoToWorldPipeline
from diffusers.utils import export_to_video, load_image
import os

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
model_id = "nvidia/Cosmos-Predict2-2B-Video2World"
pipe = Cosmos2VideoToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "A drive on the highway at night in Seoul, going back home into the city, and driving very well obeying all traffic laws."
image = load_image(
    "/content/night_highway_drive.jpg"
)

video = pipe(
    image=image, prompt=prompt, generator=torch.Generator().manual_seed(1)
).frames[0]
export_to_video(video, "night_highway_drive.mp4", fps=16)


Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

The config attributes {'final_sigmas_type': 'sigma_min', 'sigma_data': 1.0, 'sigma_max': 80.0, 'sigma_min': 0.002} were passed to FlowMatchEulerDiscreteScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Fetching 146 files:   0%|          | 0/146 [00:00<?, ?it/s]

Fetching 146 files:   0%|          | 0/146 [00:00<?, ?it/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 146 files:   0%|          | 0/146 [00:00<?, ?it/s]

Fetching 146 files:   0%|          | 0/146 [00:00<?, ?it/s]

Fetching 146 files:   0%|          | 0/146 [00:00<?, ?it/s]

  0%|          | 0/35 [00:00<?, ?it/s]

'output.mp4'

## Run Cosmos-Reason1-7B
We use Cosmos-Reason1-7B to determine if the video generated by Cosmos-Predict-2B is real (score = 0) or generated (score = 10). Cosmos-Reason-1B returns real, the generated video can be used for training.

In [2]:
import gc
import torch
gc.collect()
torch.cuda.empty_cache()

In [3]:
from transformers import AutoModel, AutoProcessor
from transformers import Qwen2_5_VLForConditionalGeneration

critic_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "nvidia/Cosmos-Reason1-7B",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to("cuda")
critic_processor = AutoProcessor.from_pretrained("nvidia/Cosmos-Reason1-7B")

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


In [4]:
import cv2
import decord
import numpy as np

video_path = "/content/output.mp4"
video_frames = decord.VideoReader(video_path)
total_frames = len(video_frames)

# Reduce video frames so we don't run OOM.
indices = np.linspace(0, total_frames - 1, num=5).astype(int)
video_frames = video_frames.get_batch(list(range(total_frames))).asnumpy()
processed_frames = []
for frame in video_frames:
    resized_frame = cv2.resize(frame, (224, 224))
    processed_frames.append(resized_frame)
video_frames = processed_frames

critic_prompt = (
    "Rate this video between [1-10] as being AI generated (=10) or real (=0)."
)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": video_path},
            {"type": "text", "text": critic_prompt},
        ],
    },
]
text_prompt = critic_processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = critic_processor(
    text=[text_prompt],
    videos=[video_frames],
    padding=True,
    return_tensors="pt"
).to("cuda", dtype=torch.bfloat16)

In [5]:
critique_output = critic_model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.6,
    do_sample=False
)

critique = critic_processor.decode(critique_output[0], skip_special_tokens=True)
print(critique)



system
You are a helpful assistant.
user
Rate this video between [1-10] as being AI generated (=10) or real (=0).
assistant
0
