<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/multi_modal/dashscope_multi_modal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="在 Colab 中打开"/></a>

# 使用 DashScope qwen-vl 模型进行多模态LLM进行图像推理

在这个笔记本中，我们展示了如何使用 DashScope qwen-vl 多模态LLM类/抽象来理解/推理图像。
目前不支持异步操作。

我们还展示了我们现在支持的几个 DashScope LLM 函数：
* `complete`（同步）：针对单个提示和图像列表
* `chat`（同步）：针对多个聊天消息
* `stream complete`（同步）：用于流式输出 complete 的结果
* `stream chat`（同步）：用于流式输出 chat 的结果
* 多轮对话。


In [None]:
!pip install -U llama-index-multi-modal-llms-dashscope

## 使用DashScope来理解来自URL的图像

在这个示例中，我们将使用DashScope来显示来自URL的图像。我们将使用`requests`库来获取图像，并使用`PIL`库来处理图像数据。然后，我们将使用DashScope来显示图像并进行交互。


In [None]:
# 设置API密钥
%env DASHSCOPE_API_KEY=YOUR_DASHSCOPE_API_KEY


## 初始化DashScopeMultiModal并从URL加载图像


In [None]:
from llama_index.multi_modal_llms.dashscope import (
    DashScopeMultiModal,
    DashScopeMultiModalModels,
)

from llama_index.core.multi_modal_llms.generic_utils import load_image_urls


image_urls = [
    "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg",
]

image_documents = load_image_urls(image_urls)

dashscope_multi_modal_llm = DashScopeMultiModal(
    model_name=DashScopeMultiModalModels.QWEN_VL_MAX,
)

### 完成一个带有图片的提示


In [None]:
complete_response = dashscope_multi_modal_llm.complete(
    prompt="What's in the image?",
    image_documents=image_documents,
)
print(complete_response)

The image captures a serene moment on a sandy beach at sunset. A woman, dressed in a blue and white plaid shirt, is seated on the ground. She is holding a treat in her hand, which is being gently taken by a dog. The dog, wearing a blue harness, is sitting next to the woman, its paw resting on her leg. The backdrop of this heartwarming scene is the vast ocean, with the sun setting in the distance, casting a warm glow over the entire landscape. The image beautifully encapsulates the bond between the woman and her dog, set against the tranquil beauty of nature.


In [None]:
### 完成一个包含多个图片的提示
multi_image_urls = [
    "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg",
    "https://dashscope.oss-cn-beijing.aliyuncs.com/images/panda.jpeg",
]

multi_image_documents = load_image_urls(multi_image_urls)
complete_response = dashscope_multi_modal_llm.complete(
    prompt="这些图片中有什么动物？",
    image_documents=multi_image_documents,
)
print(complete_response)

There is a dog in Picture 1, and there is a panda in Picture 2.


### Steam 完成一个包含大量图片的提示

在这个任务中，您需要使用大量图片来完成一个提示。


In [None]:
stream_complete_response = dashscope_multi_modal_llm.stream_complete(
    prompt="What's in the image?",
    image_documents=image_documents,
)

for r in stream_complete_response:
    print(r.delta, end="")

The image captures a serene moment on a sandy beach at sunset. A woman, dressed in a blue and white plaid shirt, is seated on the ground. She is holding a treat in her hand, which is being gently taken by a dog. The dog, wearing a blue harness, is sitting next to the woman, its paw resting on her leg. The backdrop of this heartwarming scene is the vast ocean, with the sun setting in the distance, casting a warm glow over the entire landscape. The image beautifully encapsulates the bond between the woman and her dog, set against the tranquil beauty of nature.

### 多轮对话与聊天消息


In [None]:
from llama_index.core.base.llms.types import MessageRole
from llama_index.multi_modal_llms.dashscope.utils import (
    create_dashscope_multi_modal_chat_message,
)

chat_message_user_1 = create_dashscope_multi_modal_chat_message(
    "What's in the image?", MessageRole.USER, image_documents
)
chat_response = dashscope_multi_modal_llm.chat([chat_message_user_1])
print(chat_response.message.content[0]["text"])
chat_message_assistent_1 = create_dashscope_multi_modal_chat_message(
    chat_response.message.content[0]["text"], MessageRole.ASSISTANT, None
)
chat_message_user_2 = create_dashscope_multi_modal_chat_message(
    "what are they doing?", MessageRole.USER, None
)
chat_response = dashscope_multi_modal_llm.chat(
    [chat_message_user_1, chat_message_assistent_1, chat_message_user_2]
)
print(chat_response.message.content[0]["text"])

The image shows two photos of a panda sitting on a wooden log in an enclosure. In the top photo, the panda is sitting upright with its front paws on the log, facing three crows that are perched on the log. The panda looks alert and curious, while the crows seem to be observing the panda. In the bottom photo, the panda is lying down on the log, its head resting on its front paws. One crow has landed on the ground next to the log, and it seems to be interacting with the panda. The background of the photo shows green plants and a wire fence, creating a natural and relaxed atmosphere.
The woman is sitting on the beach with her dog, and they are giving each other high fives. The panda and the crows are sitting together on a log, and the panda seems to be communicating with the crows.


### 通过聊天消息列表进行流式聊天

这个示例演示了如何使用Python的生成器函数来模拟从聊天消息列表中流式传输聊天消息。


In [None]:
stream_chat_response = dashscope_multi_modal_llm.stream_chat(
    [chat_message_user_1, chat_message_assistent_1, chat_message_user_2]
)
for r in stream_chat_response:
    print(r.delta, end="")

The woman is sitting on the beach, holding a treat in her hand, while the dog is sitting next to her, taking the treat from her hand.

### 使用本地文件中的图片
使用本地文件:  
    Linux和mac文件路径: file:///home/images/test.png  
    Windows文件路径: file://D:/images/abc.png  


In [None]:
from llama_index.multi_modal_llms.dashscope.utils import load_local_images

local_images = [
    "file://THE_FILE_PATH1",
    "file://THE_FILE_PATH2",
]

image_documents = load_local_images(local_images)
chat_message_local = create_dashscope_multi_modal_chat_message(
    "What animals are in the pictures?", MessageRole.USER, image_documents
)
chat_response = dashscope_multi_modal_llm.chat([chat_message_local])
print(chat_response.message.content[0]["text"])

There is a dog in Picture 1, and there is a panda in Picture 2.
