# Gozumol - Qwen2-VL Testing Notebook

This notebook demonstrates testing the Qwen2-VL model for visual assistance.

**Purpose**: Test the model's ability to describe environments and provide navigation guidance for visually impaired users.

## Requirements

```bash
transformers>=4.45.0
torch>=2.0.0
pillow>=10.0.0
qwen-vl-utils
```

## 1. Install Dependencies

In [None]:
# Install required packages
!pip install transformers>=4.45.0
!pip install torch>=2.0.0
!pip install pillow>=10.0.0
!pip install qwen-vl-utils

## 2. Load Model and Processor

In [None]:
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch

# Model configuration
MODEL_ID = "Qwen/Qwen2-VL-2B-Instruct"

# Load model
print("Loading model...")
model = Qwen2VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load processor
print("Loading processor...")
processor = AutoProcessor.from_pretrained(MODEL_ID)

print("Model loaded successfully!")

## 3. Define Vision Assistant Interface

This function creates the appropriate message structure and processes inputs for the Qwen2-VL model.

In [None]:
import time
from PIL import Image

def qwen_vision_inference(
    processor,
    model,
    image: Image.Image,
    user_prompt: str,
    system_prompt: str = None,
    max_new_tokens: int = 1024
):
    """
    Generate a response from the Qwen2-VL model.
    
    Parameters:
        processor: The AutoProcessor for the model
        model: The loaded Qwen2-VL model
        image: PIL Image to analyze
        user_prompt: User's question/instruction
        system_prompt: Optional system message
        max_new_tokens: Maximum tokens to generate
    
    Returns:
        Tuple of (response_text, timing_info)
    """
    # Build messages
    messages = []
    
    if system_prompt:
        messages.append({
            "role": "system",
            "content": system_prompt
        })
    
    messages.append({
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": user_prompt}
        ]
    })
    
    # Track timing
    timing_info = {}
    
    # Process inputs
    timing_info["start_processor"] = time.time()
    text_prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(
        text=[text_prompt],
        images=[image],
        padding=True,
        return_tensors="pt"
    )
    inputs = inputs.to(model.device)
    timing_info["processor_time"] = time.time() - timing_info["start_processor"]
    
    # Generate response
    timing_info["start_generation"] = time.time()
    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
    timing_info["generation_time"] = time.time() - timing_info["start_generation"]
    
    # Extract generated tokens only
    generated_ids = [
        output_ids[len(input_ids):]
        for input_ids, output_ids in zip(inputs.input_ids, output_ids)
    ]
    
    # Decode response
    timing_info["start_decode"] = time.time()
    response = processor.batch_decode(
        generated_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True
    )[0]
    timing_info["decode_time"] = time.time() - timing_info["start_decode"]
    
    # Calculate total time
    timing_info["total_time"] = (
        timing_info["processor_time"] +
        timing_info["generation_time"] +
        timing_info["decode_time"]
    )
    
    return response, timing_info

## 4. Display Helper Function

In [None]:
from IPython.display import display, Markdown

def run_and_display(image, user_prompt, system_prompt=None, test_name="Test"):
    """
    Run inference and display results in a formatted way.
    
    Parameters:
        image: PIL Image to analyze
        user_prompt: User's prompt
        system_prompt: Optional system prompt
        test_name: Name to display for this test
    
    Returns:
        Tuple of (response, timing)
    """
    display(Markdown(f"## {test_name}"))
    
    # Show the user prompt
    display(Markdown(f"**User Prompt:** {user_prompt[:200]}..."))
    
    # Run inference
    response, timing = qwen_vision_inference(
        processor, model, image, user_prompt, system_prompt
    )
    
    # Display response
    display(Markdown("### Response:"))
    display(Markdown(f"> {response}"))
    
    # Display timing
    display(Markdown("### Timing:"))
    display(Markdown(f"""
- Processor time: {timing['processor_time']:.3f}s
- Generation time: {timing['generation_time']:.3f}s
- Decode time: {timing['decode_time']:.3f}s
- **Total time: {timing['total_time']:.3f}s**
    """))
    
    return response, timing

## 5. Visual Assistance Prompts

These are the prompts optimized for visually impaired navigation assistance.

In [None]:
# System prompt for visual assistance
SYSTEM_PROMPT = """You are an AI assistant guiding a visually impaired person through a live camera feed. 
Your role is to be a calm, trustworthy companion who helps them navigate safely and independently."""

# User prompt for environment description
USER_PROMPT = """Describe the surroundings in a friendly, conversational tone, speaking directly to the user. 
Give practical and helpful information about where they are, what is happening around them, and how busy the area is. 
If there are moving vehicles, bicycles, or potential dangers, clearly warn the user and gently guide them 
(for example, tell them to be careful, wait, or stay alert). 
Avoid tiny details, colors, or technical descriptions, but provide enough context to help the user feel oriented and informed, 
as if you are a calm and trustworthy companion walking next to them."""

## 6. Test: Outdoor Street Scene

Test the model with an outdoor urban scene.

In [None]:
import requests
from PIL import Image

# Load test image - outdoor urban scene
image_url = "https://heytripster.com/wp-content/uploads/2023/08/Trastevere-rome.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

# Display the test image
display(image.resize((400, 300)))

# Run test
run_and_display(image, USER_PROMPT, SYSTEM_PROMPT, "Test 1: Outdoor Urban Scene")

## 7. Test: Another Street Scene

Test with a different urban environment.

In [None]:
# Load second test image
image_url_2 = "https://oitheblog.com/wp-content/uploads/2016/03/SAM_7237.jpg"
image_2 = Image.open(requests.get(image_url_2, stream=True).raw)

# Display the test image
display(image_2.resize((400, 300)))

# Run test
run_and_display(image_2, USER_PROMPT, SYSTEM_PROMPT, "Test 2: Urban Street Scene")

## 8. Test: Quick Safety Scan

Test with a condensed prompt for quick safety assessment.

In [None]:
# Quick scan prompt - for rapid safety checks
QUICK_SCAN_PROMPT = """Quickly scan the environment and report only critical safety information. Focus on:
1. Any immediate dangers or obstacles
2. Moving vehicles or people on collision paths
3. Whether it's safe to proceed forward

Keep your response to 2-3 sentences maximum."""

# Run test
run_and_display(image, QUICK_SCAN_PROMPT, SYSTEM_PROMPT, "Test 3: Quick Safety Scan")

## 9. Test: Traffic Safety Check

Test with prompts focused on crossing safety.

In [None]:
# Traffic safety system prompt
TRAFFIC_SYSTEM_PROMPT = """You are an AI assistant specialized in traffic safety for visually impaired pedestrians. 
Your primary focus is keeping the user safe in traffic-heavy environments.

Your critical responsibilities:
- IMMEDIATELY warn about any approaching vehicles
- Clearly state traffic light status (red/green/changing)
- Identify safe crossing opportunities
- Provide clear WAIT or GO guidance for crossings"""

# Crossing assistance prompt
CROSSING_PROMPT = """Help me safely cross this street or intersection. Provide:
1. Current traffic light status (if visible)
2. Any approaching vehicles from any direction
3. Clear instruction: WAIT or SAFE TO CROSS

Be very clear and direct - safety is the top priority."""

# Run test
run_and_display(image_2, CROSSING_PROMPT, TRAFFIC_SYSTEM_PROMPT, "Test 4: Traffic Safety / Crossing Check")

## 10. Test with Your Own Image

Upload your own image to test the visual assistance system.

In [None]:
# For Google Colab: Upload your own image
# from google.colab import files
# uploaded = files.upload()
# image_path = list(uploaded.keys())[0]
# custom_image = Image.open(image_path)

# For local testing: Load from path
# custom_image = Image.open("path/to/your/image.jpg")

# Uncomment and modify the following to test with your image:
# run_and_display(custom_image, USER_PROMPT, SYSTEM_PROMPT, "Custom Image Test")

## Summary

This notebook demonstrates the core visual assistance functionality of Gozumol using the Qwen2-VL model.

**Model Characteristics:**
- Qwen2-VL-2B-Instruct: Lightweight (2B parameters), fast inference
- Good balance between speed and accuracy
- Suitable for edge device deployment

**Key Features Tested:**
- Environment description for outdoor scenes
- Quick safety scans for rapid assessment
- Traffic and crossing safety checks
- Warm, companion-like conversational tone

**Comparison with Phi-4:**
- Qwen2-VL is smaller and faster
- Phi-4 may provide more detailed descriptions
- Both work well for real-time assistance