In [1]:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch
from utils.video_utils import load_video, add_pertubation
from utils.genearte_cd import generate_cd
import copy

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

clean_images = load_video(
    "/data/kyubin/VidHalluc/eval/inference/VidHalluc/data/STH/_AZ5jo3y7V4_clip_2_5.mp4",
    num_segments=8,
)
noise_images = copy.deepcopy(clean_images)
noise_images, has_candidate = add_pertubation(video_np=noise_images, gaussian_std=75, THRESHOLD=0.8)

clean_prompt = [{
            "role": "user",
            "content": [
                {"type": "video", "video": clean_images},
                {"type": "text", "text": "Describe this video"},
            ]
        }]

noise_prompt = [{
    "role": "user",
    "content": [
        {"type": "video", "video": noise_images},
        {"type": "text", "text": "Describe this video"},
    ]
}]

clean_inputs = processor.apply_chat_template(
    clean_prompt,
    add_generation_prompt=True,
    tokenize=True,
    num_frames=8,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

noise_inputs = processor.apply_chat_template(
    noise_prompt,
    add_generation_prompt=True,
    tokenize=True,
    num_frames=8,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

with torch.inference_mode():
    generate_ids = generate_cd(
        clean_inputs, noise_inputs, model, processor,
        cd_alpha=0, cd_beta=0, use_cd=False
    )
    output_text = processor.decode(generate_ids[0, clean_inputs["input_ids"].shape[1]:],
                                        skip_special_tokens=True)
    print(output_text)

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.


This video appears to be a cooking tutorial from the "Bake Like A Pro" YouTube channel. 

In the initial frames, we see a close-up of a person straining a mixture into a bowl. The text overlay states, "We now want to strain our mixture, to remove most of the moisture." This suggests that the person is preparing a mixture for a recipe and removing excess liquid or moisture to achieve the desired consistency.

In the subsequent frames, the focus shifts to the preparation of pancakes. The person is seen using kitchen tongs to flip some pancake slices in a pan. There is also a bowl with what seems to be pancake batter next to the pan, which indicates that the pancakes are in the process of cooking.

The video likely provides step-by-step instructions on how to make pancakes, with a focus on techniques such as straining the batter and cooking the pancakes properly.

