# Gozumol - Phi-4 Multimodal Testing Notebook

This notebook demonstrates testing the Phi-4 Multimodal model for visual assistance.

**Purpose**: Test the model's ability to describe environments and provide navigation guidance for visually impaired users.

## Requirements

```bash
torch==2.6.0
flash_attn==2.7.4.post1
transformers==4.48.2
accelerate==1.3.0
pillow==11.1.0
```

## 1. Install Dependencies

In [None]:
# Install required packages
!pip install torch==2.6.0
!pip install flash_attn==2.7.4.post1
!pip install transformers==4.48.2
!pip install accelerate==1.3.0
!pip install pillow==11.1.0

## 2. Load Model and Processor

In [None]:
import requests
import torch
import time
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

# Model configuration
MODEL_ID = "microsoft/Phi-4-multimodal-instruct"

# Load processor and model
print("Loading processor...")
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)

print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
).cuda()

generation_config = GenerationConfig.from_pretrained(MODEL_ID)
print("Model loaded successfully!")

## 3. Define Vision Assistant Interface

This function creates the appropriate prompt structure and processes inputs for the Phi-4 model.

In [None]:
def vision_assistant_inference(
    processor,
    model,
    system_prompt: str,
    content_list: list,
    generation_config,
    max_new_tokens: int = 1024
):
    """
    Generate a response from the Phi-4 multimodal model.

    Parameters:
        processor: The AutoProcessor for the model
        model: The loaded Phi-4 model
        system_prompt: System message defining assistant behavior
        content_list: List of content items with type, content, and role
        generation_config: Generation configuration
        max_new_tokens: Maximum tokens to generate

    Returns:
        Tuple of (formatted_prompt, response_text, timing_info)
    """
    # Token definitions
    system_token = "<|system|>"
    user_token = "<|user|>"
    assistant_token = "<|assistant|>"
    end_token = "<|end|>"

    # Build prompt with system message
    complete_prompt = f"{system_token}{system_prompt}{end_token}"

    # Collect media
    images = []
    audios = []

    # Process content items
    current_role = None
    role_content = ""

    for item in content_list:
        item_type = item["type"]
        item_role = item.get("role", "user")

        # Handle role transitions
        if current_role is not None and current_role != item_role:
            role_token = user_token if current_role == "user" else assistant_token
            complete_prompt += f"{role_token}{role_content}{end_token}"
            role_content = ""

        current_role = item_role

        # Process content types
        if item_type == "text":
            role_content += item["content"]
        elif item_type == "image":
            image_index = len(images) + 1
            role_content += f"<|image_{image_index}|>"
            images.append(item["content"])
        elif item_type == "audio":
            audio_index = len(audios) + 1
            role_content += f"<|audio_{audio_index}|>"
            audios.append(item["content"])

    # Add final role content
    if current_role is not None:
        role_token = user_token if current_role == "user" else assistant_token
        complete_prompt += f"{role_token}{role_content}{end_token}"

    # Add assistant token to prompt response
    if current_role != "assistant":
        complete_prompt += f"{assistant_token}"

    # Track timing
    timing_info = {}

    # Process inputs
    timing_info["start_processor"] = time.time()
    inputs = processor(
        text=complete_prompt,
        images=images if images else None,
        audios=audios if audios else None,
        return_tensors="pt"
    ).to("cuda:0")
    timing_info["processor_time"] = time.time() - timing_info["start_processor"]

    # Generate response
    timing_info["start_generation"] = time.time()
    generate_ids = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        generation_config=generation_config,
    )
    timing_info["generation_time"] = time.time() - timing_info["start_generation"]

    # Extract new tokens only
    generate_ids = generate_ids[:, inputs["input_ids"].shape[1]:]

    # Decode response
    timing_info["start_decode"] = time.time()
    response = processor.batch_decode(
        generate_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )[0]
    timing_info["decode_time"] = time.time() - timing_info["start_decode"]

    # Calculate total time
    timing_info["total_time"] = (
        timing_info["processor_time"] +
        timing_info["generation_time"] +
        timing_info["decode_time"]
    )

    return complete_prompt, response, timing_info

## 4. Display Helper Function

In [None]:
from IPython.display import display, Markdown

def run_and_display(system_prompt, content_list, test_name="Test"):
    """
    Run inference and display results in a formatted way.

    Parameters:
        system_prompt: System message for the assistant
        content_list: List of content items
        test_name: Name to display for this test

    Returns:
        Tuple of (prompt, response, timing)
    """
    display(Markdown(f"## {test_name}"))

    # Show the user prompt
    user_text = [item["content"] for item in content_list if item["type"] == "text"]
    if user_text:
        display(Markdown(f"**User Prompt:** {user_text[-1][:200]}..."))

    # Run inference
    prompt, response, timing = vision_assistant_inference(
        processor, model, system_prompt, content_list, generation_config
    )

    # Display response
    display(Markdown("### Response:"))
    display(Markdown(f"> {response}"))

    # Display timing
    display(Markdown("### Timing:"))
    display(Markdown(f"""
- Processor time: {timing['processor_time']:.3f}s
- Generation time: {timing['generation_time']:.3f}s
- Decode time: {timing['decode_time']:.3f}s
- **Total time: {timing['total_time']:.3f}s**
    """))

    return prompt, response, timing

## 5. Visual Assistance Prompts

These are the prompts optimized for visually impaired navigation assistance.

In [None]:
# System prompt for visual assistance
SYSTEM_PROMPT = """You are an AI assistant guiding a visually impaired person through a live camera feed. 
Your role is to be a calm, trustworthy companion who helps them navigate safely and independently."""

# User prompt for environment description
USER_PROMPT = """Describe the surroundings in a friendly, conversational tone, speaking directly to the user. 
Give practical and helpful information about where they are, what is happening around them, and how busy the area is. 
If there are moving vehicles, bicycles, or potential dangers, clearly warn the user and gently guide them 
(for example, tell them to be careful, wait, or stay alert). 
Avoid tiny details, colors, or technical descriptions, but provide enough context to help the user feel oriented and informed, 
as if you are a calm and trustworthy companion walking next to them."""

## 6. Test: Outdoor Street Scene

Test the model with an outdoor urban scene.

In [None]:
# Load test image - outdoor urban scene
image_url = "https://heytripster.com/wp-content/uploads/2023/08/Trastevere-rome.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

# Display the test image
display(image.resize((400, 300)))

# Build content list
content_list = [
    {"type": "image", "content": image, "role": "user"},
    {"type": "text", "content": USER_PROMPT, "role": "user"},
]

# Run test
run_and_display(SYSTEM_PROMPT, content_list, "Test 1: Outdoor Urban Scene")

## 7. Test: Another Street Scene

Test with a different urban environment.

In [None]:
# Load second test image
image_url_2 = "https://oitheblog.com/wp-content/uploads/2016/03/SAM_7237.jpg"
image_2 = Image.open(requests.get(image_url_2, stream=True).raw)

# Display the test image
display(image_2.resize((400, 300)))

# Build content list
content_list_2 = [
    {"type": "image", "content": image_2, "role": "user"},
    {"type": "text", "content": USER_PROMPT, "role": "user"},
]

# Run test
run_and_display(SYSTEM_PROMPT, content_list_2, "Test 2: Urban Street Scene")

## 8. Test: Quick Safety Scan

Test with a condensed prompt for quick safety assessment.

In [None]:
# Quick scan prompt - for rapid safety checks
QUICK_SCAN_PROMPT = """Quickly scan the environment and report only critical safety information. Focus on:
1. Any immediate dangers or obstacles
2. Moving vehicles or people on collision paths
3. Whether it's safe to proceed forward

Keep your response to 2-3 sentences maximum."""

# Build content list with quick scan prompt
content_list_quick = [
    {"type": "image", "content": image, "role": "user"},
    {"type": "text", "content": QUICK_SCAN_PROMPT, "role": "user"},
]

# Run test
run_and_display(SYSTEM_PROMPT, content_list_quick, "Test 3: Quick Safety Scan")

## 9. Test: Traffic Safety Check

Test with prompts focused on crossing safety.

In [None]:
# Traffic safety system prompt
TRAFFIC_SYSTEM_PROMPT = """You are an AI assistant specialized in traffic safety for visually impaired pedestrians. 
Your primary focus is keeping the user safe in traffic-heavy environments.

Your critical responsibilities:
- IMMEDIATELY warn about any approaching vehicles
- Clearly state traffic light status (red/green/changing)
- Identify safe crossing opportunities
- Provide clear WAIT or GO guidance for crossings"""

# Crossing assistance prompt
CROSSING_PROMPT = """Help me safely cross this street or intersection. Provide:
1. Current traffic light status (if visible)
2. Any approaching vehicles from any direction
3. Clear instruction: WAIT or SAFE TO CROSS

Be very clear and direct - safety is the top priority."""

# Build content list for crossing assistance
content_list_crossing = [
    {"type": "image", "content": image_2, "role": "user"},
    {"type": "text", "content": CROSSING_PROMPT, "role": "user"},
]

# Run test
run_and_display(TRAFFIC_SYSTEM_PROMPT, content_list_crossing, "Test 4: Traffic Safety / Crossing Check")

## 10. Test with Your Own Image

Upload your own image to test the visual assistance system.

In [None]:
# For Google Colab: Upload your own image
# from google.colab import files
# uploaded = files.upload()
# image_path = list(uploaded.keys())[0]
# custom_image = Image.open(image_path)

# For local testing: Load from path
# custom_image = Image.open("path/to/your/image.jpg")

# Uncomment and modify the following to test with your image:
# content_list_custom = [
#     {"type": "image", "content": custom_image, "role": "user"},
#     {"type": "text", "content": USER_PROMPT, "role": "user"},
# ]
# run_and_display(SYSTEM_PROMPT, content_list_custom, "Custom Image Test")

## Summary

This notebook demonstrates the core visual assistance functionality of Gozumol using the Phi-4 Multimodal model.

**Key Features Tested:**
- Environment description for outdoor scenes
- Quick safety scans for rapid assessment
- Traffic and crossing safety checks
- Warm, companion-like conversational tone
- Action-oriented guidance (wait, proceed, be careful)

**Next Steps:**
- Integrate with real-time camera feeds
- Add text-to-speech for audio output
- Optimize for edge device deployment
- Test with more diverse scenarios (indoor, crowded areas, etc.)