## CCTV Image Analysis with Qwen3-VL
#### This Colab notebook loads CCTV images from Google Drive and analyzes them using Qwen/Qwen3-VL-8B-Instruct. It performs tasks such as counting the number of persons in each image, running visual question answering, and processing multiple images automatically.

[Huggingface Qwen3-VL-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)

In [1]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import os
import glob

folder_path = "/content/drive/My Drive/CCTV Generated Images"

image_paths = sorted(glob.glob(os.path.join(folder_path, "*.png")))

print(f"Found {len(image_paths)} images.")
image_paths[:5]


Found 165 images.


['/content/drive/My Drive/CCTV Generated Images/cctv-generated-1763214102512-1.png',
 '/content/drive/My Drive/CCTV Generated Images/cctv-generated-1763214105051-1.png',
 '/content/drive/My Drive/CCTV Generated Images/cctv-generated-1763214107200-1.png',
 '/content/drive/My Drive/CCTV Generated Images/cctv-generated-1763214109558-1.png',
 '/content/drive/My Drive/CCTV Generated Images/cctv-generated-1763214111809-1.png']

In [3]:
!pip install transformers accelerate bitsandbytes timm einops
!pip install opencv-python pillow




In [5]:
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch

# Load model + processor
model_name = "Qwen/Qwen3-VL-2B-Instruct"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# ---- Load your LOCAL image ----
img = Image.open(image_paths[0]).convert("RGB")   # <-- change index for different images

# ---- Build chat messages ----
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},               # <-- image placeholder
            {"type": "text", "text": "What objects do you see in this image? Count persons too."}
        ],
    }
]

# ---- Apply chat template (must NOT tokenize yet) ----
chat_prompt = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False
)

# ---- Prepare model inputs (image + text) ----
inputs = processor(
    text=chat_prompt,
    images=img,
    return_tensors="pt"
).to(model.device)

# ---- Generate output ----
output_ids = model.generate(**inputs, max_new_tokens=100)

# ---- Decode only the newly generated tokens ----
generated_text = processor.batch_decode(
    output_ids[:, inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True
)[0]

print(generated_text)


Based on the image provided, here is a detailed list of the objects and people visible:

### Objects

- **People**: There are four men in the room.
    - A man in a dark suit and light-colored pants is standing at the desk, interacting with a computer.
    - A man in a grey hoodie and dark pants is standing with his back to the camera.
    - A man in a light-colored, long-sleeved shirt and brown pants is holding a document and looking at
