# 8.1 Vision (图像理解)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/forhow134/ai-coding-guide/blob/main/demos/08-multimodal/vision.ipynb)

**预计 API 费用: ~$0.02**

使用 GPT-4o 分析图像、OCR、多图对比。

In [None]:
!pip install -q openai pillow

In [None]:
import os
from getpass import getpass

if not os.environ.get('OPENAI_API_KEY'):
    os.environ['OPENAI_API_KEY'] = getpass('OpenAI API Key: ')

## Experiment 1: BasicsImage理解 (URL)
<!-- 实验 1: 基础图像理解 (URL) -->

In [None]:
from openai import OpenAI

client = OpenAI()

image_url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg'

response = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=[{
        'role': 'user',
        'content': [
            {'type': 'text', 'text': '请详细描述这张图片的内容'},
            {'type': 'image_url', 'image_url': {'url': image_url}}
        ]
    }],
    max_tokens=300
)

print(response.choices[0].message.content)

## Experiment 2: Base64 本地图片
<!-- 实验 2: Base64 本地图片 -->

In [None]:
import base64
from openai import OpenAI

client = OpenAI()

def encode_image(image_path):
    with open(image_path, 'rb') as f:
        return base64.b64encode(f.read()).decode('utf-8')

# 示例: 如果你有本地图片
# image_base64 = encode_image('screenshot.png')
# response = client.chat.completions.create(
#     model='gpt-4o-mini',
#     messages=[{
#         'role': 'user',
#         'content': [
#             {'type': 'text', 'text': '提取图片中的文字'},
#             {'type': 'image_url', 'image_url': {'url': f'data:image/png;base64,{image_base64}'}}
#         ]
#     }]
# )

print('提示: 将本地图片路径替换到 encode_image() 中即可测试')

## Experiment 3: 多张图片Comparison
<!-- 实验 3: 多张图片对比 -->

In [None]:
from openai import OpenAI

client = OpenAI()

img1 = 'https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg'
img2 = 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Cat_November_2010-1a.jpg/1200px-Cat_November_2010-1a.jpg'

response = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=[{
        'role': 'user',
        'content': [
            {'type': 'text', 'text': '对比这两张图片的差异'},
            {'type': 'image_url', 'image_url': {'url': img1}},
            {'type': 'image_url', 'image_url': {'url': img2}}
        ]
    }]
)

print(response.choices[0].message.content)

## Experiment 4: 结构化数据提取 (OCR)
<!-- 实验 4: 结构化数据提取 (OCR) -->

In [None]:
from openai import OpenAI
import json

client = OpenAI()

# 假设有一张包含表格的图片
image_url = 'https://example.com/invoice.jpg'

# response = client.chat.completions.create(
#     model='gpt-4o-mini',
#     messages=[{
#         'role': 'user',
#         'content': [
#             {'type': 'text', 'text': '提取发票信息,以 JSON 格式返回: {"amount": 金额, "date": 日期, "items": []}'},
#             {'type': 'image_url', 'image_url': {'url': image_url}}
#         ]
#     }],
#     response_format={'type': 'json_object'}
# )

# data = json.loads(response.choices[0].message.content)
# print(json.dumps(data, indent=2, ensure_ascii=False))

print('提示: 替换为真实发票图片 URL 后测试')

## 关键要点

1. **两种输入方式**: URL (公开图片) 或 Base64 (本地)
2. **detail 参数**: low (快速) vs high (准确)
3. **多图支持**: 可同时发送多张图片
4. **JSON Mode**: 提取结构化数据
5. **图像计费**: 基于分辨率,高清图约 765 tokens

---

**下一步**: 学习 [8.2 Image Generation](./image_generation.ipynb)