-
Notifications
You must be signed in to change notification settings - Fork 676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] MultimodalPrompt #317
Comments
Thanks @chenllliang. That is a great idea! Is there any reference on implementing |
@lightaime Hi, the design is inspired by current advance in multimodal llm like MMICL and MiniGPT5, which support interleaved text and image as input. The multimodal information could appear on different place of the prompt (at the head of the prompt for most current VLMs). The
Different VLMs handle the multimodal information differently, |
Thanks for the explanation! I think it is promising (Although I have no clue if |
I don't know either lol, but from the gpt4-v interface provided to user, it seems that gpt4-v simply concats the image and the text. |
I guess they number the images as well. I tried to upload multiple images and asked what is the first one and what is the second one. It does work as expected. |
thanks for the information |
Required prerequisites
Motivation
To enable multimodal perception of agents, we need a flexible multimodal prompt class at the first stage.
The class should have some basic features:
I am willing to add this feature.
Solution
The prompt could have following form:
Listen to this audio {audio} and see the image {image} to describe the scene.
I suggest base64 encoding to store all modality information.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: