Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] MultimodalPrompt #317

Open
2 tasks done
chenllliang opened this issue Oct 16, 2023 · 6 comments
Open
2 tasks done

[Feature Request] MultimodalPrompt #317

chenllliang opened this issue Oct 16, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@chenllliang
Copy link

Required prerequisites

Motivation

To enable multimodal perception of agents, we need a flexible multimodal prompt class at the first stage.

The class should have some basic features:

  1. flexible to add new modailties.
  2. conversion from human-readable to machine-readable format.
  3. easy to save and transfer among different agents.

I am willing to add this feature.

Solution

The prompt could have following form:

Listen to this audio {audio} and see the image {image} to describe the scene.

I suggest base64 encoding to store all modality information.

Alternatives

No response

Additional context

No response

@chenllliang chenllliang added the enhancement New feature or request label Oct 16, 2023
@lightaime
Copy link
Member

Thanks @chenllliang. That is a great idea! Is there any reference on implementing MultimodalPrompt this way?

@chenllliang
Copy link
Author

@lightaime Hi, the design is inspired by current advance in multimodal llm like MMICL and MiniGPT5, which support interleaved text and image as input. The multimodal information could appear on different place of the prompt (at the head of the prompt for most current VLMs). The MultimodalPrompt class can be instantiated according to different multimodal LLM, for example:

  1. for MMICL. the input prompt includes the image id and reference information
1. Interleaved Image-Text Data

Input:  Image 0 is <image0> {image 0}
        ...
        Image j is <imagej> {image j}
        {question}

MMICL:  {answer}

2. In−Context Demonstration Data

Input:  Image 0 is <image0> {image 0}.
        {question} 
        {answer} 
        ...
        Image j is <imagej> {image j}.
        {question} 

MMICL:  {answer}

Different VLMs handle the multimodal information differently, MultimodalPrompt should be flexible to fit different models.

@lightaime
Copy link
Member

Thanks for the explanation! I think it is promising (Although I have no clue if gpt-4v does it this way). Please feel free to open a pull request. Also happy to discuss more if you want.

@chenllliang
Copy link
Author

I don't know either lol, but from the gpt4-v interface provided to user, it seems that gpt4-v simply concats the image and the text.

@lightaime
Copy link
Member

I guess they number the images as well. I tried to upload multiple images and asked what is the first one and what is the second one. It does work as expected.

@chenllliang
Copy link
Author

thanks for the information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants