# VLM - Vision Language Model Examples

This notebook demonstrates practical examples of using Vision Language Models (VLMs) that can understand both images and text.

## What is a VLM?
- **Purpose**: Understanding images + text together
- **Input**: Images + Text prompts
- **Output**: Text descriptions/answers
- **Examples**: GPT-4V, Claude 3, LLaVA, Gemini Vision

---

## Setup

In [None]:
# Install required packages
!pip install openai anthropic pillow requests matplotlib -q

In [None]:
import os
import base64
import requests
from io import BytesIO
from PIL import Image
import matplotlib.pyplot as plt
from openai import OpenAI
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()

# Initialize clients
openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
anthropic_client = Anthropic(api_key=os.getenv('ANTHROPIC_API_KEY'))

print("‚úì Setup complete!")

## Helper Functions

In [None]:
def encode_image(image_path):
    """Encode image to base64"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def download_sample_image(url):
    """Download and display sample image"""
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    return img

def display_image(img):
    """Display image with matplotlib"""
    plt.figure(figsize=(10, 6))
    plt.imshow(img)
    plt.axis('off')
    plt.show()

print("‚úì Helper functions defined")

---
## Example 1: Image Description

Get detailed descriptions of images.

In [None]:
def describe_image_gpt4v(image_url, prompt="Describe this image in detail."):
    """Describe image using GPT-4 Vision"""
    response = openai_client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {"type": "image_url", "image_url": {"url": image_url}}
                ]
            }
        ],
        max_tokens=300
    )
    return response.choices[0].message.content

# Example with a sample image
sample_url = "https://images.unsplash.com/photo-1517849845537-4d257902454a?w=800"

print("Image:")
img = download_sample_image(sample_url)
display_image(img)

print("\nVLM Description:")
description = describe_image_gpt4v(sample_url)
print(description)

---
## Example 2: Visual Question Answering (VQA)

Ask specific questions about images.

In [None]:
def ask_about_image(image_url, question):
    """Ask questions about an image"""
    response = openai_client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": question},
                    {"type": "image_url", "image_url": {"url": image_url}}
                ]
            }
        ],
        max_tokens=200
    )
    return response.choices[0].message.content

# Multiple questions about the same image
questions = [
    "What animals are in this image?",
    "What is the setting or environment?",
    "What colors are prominent in this image?",
    "What mood or atmosphere does this image convey?"
]

print("Image:")
display_image(img)

print("\nQuestions and Answers:")
for q in questions:
    answer = ask_about_image(sample_url, q)
    print(f"\nQ: {q}")
    print(f"A: {answer}")
    print("-" * 80)

---
## Example 3: OCR and Text Reading

Extract text from images (documents, signs, menus, etc.)

In [None]:
def extract_text_from_image(image_url):
    """Extract all text from an image"""
    response = openai_client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Extract all text from this image. Maintain formatting and structure."},
                    {"type": "image_url", "image_url": {"url": image_url}}
                ]
            }
        ],
        max_tokens=500
    )
    return response.choices[0].message.content

# Example with a document or sign image
# Note: Replace with actual image URL containing text
print("This example would work with an image containing text (menu, sign, document, etc.)")
print("\nExample workflow:")
print("1. Provide image with text")
print("2. VLM reads and extracts all text")
print("3. Text is returned in structured format")

---
## Example 4: Image Comparison

Compare multiple images and identify differences.

In [None]:
def compare_images(image_url1, image_url2):
    """Compare two images"""
    response = openai_client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Compare these two images. What are the similarities and differences?"},
                    {"type": "image_url", "image_url": {"url": image_url1}},
                    {"type": "image_url", "image_url": {"url": image_url2}}
                ]
            }
        ],
        max_tokens=400
    )
    return response.choices[0].message.content

# Example with two different images
url1 = "https://images.unsplash.com/photo-1517849845537-4d257902454a?w=400"
url2 = "https://images.unsplash.com/photo-1548199973-03cce0bbc87b?w=400"

print("Image 1:")
img1 = download_sample_image(url1)
display_image(img1)

print("\nImage 2:")
img2 = download_sample_image(url2)
display_image(img2)

print("\nComparison:")
comparison = compare_images(url1, url2)
print(comparison)

---
## Example 5: Chart and Graph Analysis

Analyze data visualizations and extract insights.

In [None]:
def analyze_chart(image_url):
    """Analyze charts and extract data/insights"""
    response = openai_client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": """Analyze this chart or graph:
                    1. What type of visualization is it?
                    2. What data does it show?
                    3. What are the key trends or insights?
                    4. What conclusions can be drawn?"""},
                    {"type": "image_url", "image_url": {"url": image_url}}
                ]
            }
        ],
        max_tokens=400
    )
    return response.choices[0].message.content

print("Chart analysis example - would work with actual chart/graph images")
print("\nCapabilities:")
print("‚úì Identify chart type (bar, line, pie, scatter, etc.)")
print("‚úì Extract data points and values")
print("‚úì Identify trends and patterns")
print("‚úì Provide insights and conclusions")

---
## Example 6: Object Detection and Counting

Identify and count specific objects in images.

In [None]:
def count_objects(image_url, object_type):
    """Count specific objects in an image"""
    response = openai_client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": f"How many {object_type} are in this image? List their locations and characteristics."},
                    {"type": "image_url", "image_url": {"url": image_url}}
                ]
            }
        ],
        max_tokens=300
    )
    return response.choices[0].message.content

# Example
print("Counting dogs in image:")
display_image(img)
result = count_objects(sample_url, "dogs")
print(f"\nResult:\n{result}")

---
## Example 7: Image Accessibility (Alt Text Generation)

Generate descriptive alt text for accessibility.

In [None]:
def generate_alt_text(image_url, length="medium"):
    """Generate accessibility-friendly alt text"""
    length_prompts = {
        "short": "Generate a concise alt text (1 sentence) for accessibility.",
        "medium": "Generate a descriptive alt text (2-3 sentences) for accessibility.",
        "long": "Generate a detailed alt text describing all important elements for visually impaired users."
    }
    
    response = openai_client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": length_prompts[length]},
                    {"type": "image_url", "image_url": {"url": image_url}}
                ]
            }
        ],
        max_tokens=200
    )
    return response.choices[0].message.content

# Generate different lengths of alt text
print("Image:")
display_image(img)

for length in ["short", "medium", "long"]:
    print(f"\n{length.upper()} Alt Text:")
    print(generate_alt_text(sample_url, length))

---
## Example 8: Product Analysis for E-commerce

Analyze product images for descriptions and features.

In [None]:
def analyze_product(image_url):
    """Analyze product image for e-commerce"""
    response = openai_client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": """Analyze this product image and provide:
                    1. Product type and category
                    2. Key features and characteristics
                    3. Suggested product title
                    4. Marketing description (2-3 sentences)
                    5. Estimated price range"""},
                    {"type": "image_url", "image_url": {"url": image_url}}
                ]
            }
        ],
        max_tokens=400
    )
    return response.choices[0].message.content

print("Product Analysis Example")
print("\nCapabilities for E-commerce:")
print("‚úì Automatic product categorization")
print("‚úì Feature extraction")
print("‚úì Title and description generation")
print("‚úì Quality assessment")
print("‚úì Price range estimation")

---
## Example 9: Using Claude 3 Vision (Alternative VLM)

Compare with Anthropic's Claude 3 Vision model.

In [None]:
def describe_image_claude(image_url, prompt="Describe this image."):
    """Describe image using Claude 3"""
    # Download image and encode
    response = requests.get(image_url)
    image_data = base64.b64encode(response.content).decode('utf-8')
    
    # Determine media type from URL
    media_type = "image/jpeg"
    if ".png" in image_url:
        media_type = "image/png"
    
    message = anthropic_client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=300,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": media_type,
                            "data": image_data,
                        },
                    },
                    {
                        "type": "text",
                        "text": prompt
                    }
                ],
            }
        ],
    )
    return message.content[0].text

# Compare GPT-4V vs Claude 3
print("Comparing VLM Models:\n")
print("="*80)

prompt = "Describe this image in detail."

try:
    print("\nGPT-4 Vision:")
    gpt4_response = describe_image_gpt4v(sample_url, prompt)
    print(gpt4_response)
except Exception as e:
    print(f"Error: {e}")

print("\n" + "="*80)

try:
    print("\nClaude 3:")
    claude_response = describe_image_claude(sample_url, prompt)
    print(claude_response)
except Exception as e:
    print(f"Error: {e}")

---
## Summary

This notebook demonstrated VLM capabilities:

1. ‚úÖ **Image Description** - Detailed scene understanding
2. ‚úÖ **Visual Q&A** - Answer specific questions about images
3. ‚úÖ **OCR** - Text extraction from images
4. ‚úÖ **Image Comparison** - Identify similarities and differences
5. ‚úÖ **Chart Analysis** - Extract insights from visualizations
6. ‚úÖ **Object Detection** - Count and locate objects
7. ‚úÖ **Accessibility** - Generate alt text for screen readers
8. ‚úÖ **Product Analysis** - E-commerce descriptions
9. ‚úÖ **Model Comparison** - GPT-4V vs Claude 3

### Key Takeaways:
- VLMs combine vision and language understanding
- Can handle complex visual reasoning tasks
- Useful for accessibility, e-commerce, analysis
- Different models have different strengths

### Use Cases:
- üì∏ Content moderation
- üõí E-commerce product descriptions
- ‚ôø Accessibility (alt text generation)
- üìä Data visualization analysis
- üìÑ Document understanding (OCR + context)
- üîç Visual search and categorization

### Limitations:
- Image resolution limits
- Cannot generate images (only understand them)
- May hallucinate details
- Processing time longer than text-only LLMs

### Next Steps:
- Explore LMM for multi-modal tasks (audio + video)
- Try SAM for image segmentation
- Learn about LAM for taking actions based on visual input