# Setting up environment

Check cuda version

In [1]:
!nvidia-smi

Mon Apr  7 08:55:46 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   70C    P0             41W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00

Change CUDA memory config

In [2]:
!export 'PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True'

Install packages

In [3]:
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
%pip install --upgrade datasets transformers bitsandbytes
%pip install --upgrade qwen-vl-utils[decord]

Looking in indexes: https://download.pytorch.org/whl/cu126
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


# Using fine-tuned Qwen 2.5-VL one time

Import packages

In [4]:
import time
import csv
import torch
from datasets import load_dataset
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from qwen_vl_utils import process_vision_info

Load dataset and model

In [5]:
dataset = load_dataset("lmms-lab/AISG_Challenge")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct",
    torch_dtype=torch.float32,
    device_map=device)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Download video function

In [6]:
def retrieve_video(video_id):
    filename = f"{video_id}.mp4"
    video_path = f"../input/videos/videos/{filename}"
    return video_path

Process sample

In [None]:
def process_test_case(example):
    video_id = example["video_id"]
    question = example["question"]
    question_prompt = example["question_prompt"]
    expected_answer = example["answer"]
    video_path = retrieve_video(video_id)

    conversation = [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": video_path,
                    "max_pixels": 81 * 144,
                    "fps": 1,
                },
                {"type": "text", "text": f"{question}\n{question_prompt}\nAnswer in English only."},
            ],
        }
    ]

    text = processor.apply_chat_template(
        conversation, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs, video_kwargs = process_vision_info(conversation, return_video_kwargs=True)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
        **video_kwargs,
    )
    inputs = inputs.to(model.device)

    generated_ids = model.generate(**inputs, max_new_tokens=1280)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
    '''
    print(f"Video URL: {example['youtube_url']}")
    print(f"Question:\n{question}\n{question_prompt}")
    print(f"Answer: {output_text}")
    '''
    return example['qid'], output_text

Run test cases

In [9]:
start_time = time.time()
output_file = "./results.csv"
with open(output_file, mode="w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(["qid", "pred"])
    '''
    sample = dataset['test'].filter(lambda x: x['qid'] == '0957-3')[0]
    qid, pred = process_test_case(sample)
    writer.writerow([qid, pred])
    '''
    for sample in dataset['test']:
        print(f"processing qid {sample['qid']}")
        qid, pred = process_test_case(sample)
        writer.writerow([qid, pred])
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

Video URL: https://www.youtube.com/shorts/XK7kH7pxTcE
Question:
Are there five drainage channels shown in the video?
Please state your answer with a brief explanation.
Answer: Yes, there are five drainage channels shown in the video. The man is seen laying bricks to create these channels in the ground.
Time taken: 10.464519500732422 seconds
