[Feature Request] MultimodalPrompt #317

chenllliang · 2023-10-16T15:50:15Z

Required prerequisites

I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
Consider asking first in a Discussion.

Motivation

To enable multimodal perception of agents, we need a flexible multimodal prompt class at the first stage.

The class should have some basic features:

flexible to add new modailties.
conversion from human-readable to machine-readable format.
easy to save and transfer among different agents.

I am willing to add this feature.

Solution

The prompt could have following form:

Listen to this audio {audio} and see the image {image} to describe the scene.

I suggest base64 encoding to store all modality information.

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

lightaime · 2023-10-17T05:37:51Z

Thanks @chenllliang. That is a great idea! Is there any reference on implementing MultimodalPrompt this way?

chenllliang · 2023-10-17T07:44:33Z

@lightaime Hi, the design is inspired by current advance in multimodal llm like MMICL and MiniGPT5, which support interleaved text and image as input. The multimodal information could appear on different place of the prompt (at the head of the prompt for most current VLMs). The MultimodalPrompt class can be instantiated according to different multimodal LLM, for example:

for MMICL. the input prompt includes the image id and reference information

1. Interleaved Image-Text Data

Input:  Image 0 is <image0> {image 0}
        ...
        Image j is <imagej> {image j}
        {question}

MMICL:  {answer}

2. In−Context Demonstration Data

Input:  Image 0 is <image0> {image 0}.
        {question} 
        {answer} 
        ...
        Image j is <imagej> {image j}.
        {question} 

MMICL:  {answer}

Different VLMs handle the multimodal information differently, MultimodalPrompt should be flexible to fit different models.

lightaime · 2023-10-17T08:31:29Z

Thanks for the explanation! I think it is promising (Although I have no clue if gpt-4v does it this way). Please feel free to open a pull request. Also happy to discuss more if you want.

chenllliang · 2023-10-17T09:07:59Z

I don't know either lol, but from the gpt4-v interface provided to user, it seems that gpt4-v simply concats the image and the text.

lightaime · 2023-10-17T09:17:51Z

I guess they number the images as well. I tried to upload multiple images and asked what is the first one and what is the second one. It does work as expected.

chenllliang · 2023-10-17T09:36:43Z

thanks for the information

chenllliang added the enhancement New feature or request label Oct 16, 2023

chenllliang assigned lightaime Oct 16, 2023

chenllliang mentioned this issue Oct 19, 2023

[Feature] Multimodal agents demo #320

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] MultimodalPrompt #317

[Feature Request] MultimodalPrompt #317

chenllliang commented Oct 16, 2023

lightaime commented Oct 17, 2023

chenllliang commented Oct 17, 2023

lightaime commented Oct 17, 2023

chenllliang commented Oct 17, 2023

lightaime commented Oct 17, 2023

chenllliang commented Oct 17, 2023

[Feature Request] MultimodalPrompt #317

[Feature Request] MultimodalPrompt #317

Comments

chenllliang commented Oct 16, 2023

Required prerequisites

Motivation

Solution

Alternatives

Additional context

lightaime commented Oct 17, 2023

chenllliang commented Oct 17, 2023

lightaime commented Oct 17, 2023

chenllliang commented Oct 17, 2023

lightaime commented Oct 17, 2023

chenllliang commented Oct 17, 2023