Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support idefics multimodal #7

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sedrickkeh
Copy link

@sedrickkeh sedrickkeh commented Apr 26, 2024

This is similar to the Llama implementation in #4, but extended to multimodal HF models.

Idefics2 by HuggingFace supports multiple-image inputs. Its API output format is quite similar to ChatGPT's output format. I initially tried it with 50 frames, which is what GPT4-V was using, but that gave OOM, so I lowered the num_frames to 10.

Other multimodal models on HF should be quite similar to implement, though I think for things like Llava, multi-image input may not be supported off the shelf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants