Support idefics multimodal #7

sedrickkeh · 2024-04-26T01:11:32Z

This is similar to the Llama implementation in #4, but extended to multimodal HF models.

Idefics2 by HuggingFace supports multiple-image inputs. Its API output format is quite similar to ChatGPT's output format. I initially tried it with 50 frames, which is what GPT4-V was using, but that gave OOM, so I lowered the num_frames to 10.

Other multimodal models on HF should be quite similar to implement, though I think for things like Llava, multi-image input may not be supported off the shelf.

arjunmajum and others added 3 commits April 19, 2024 16:26

Add llama script

2d21e6d

Update README.md

70e733d

add idefics support

474d423

facebook-github-bot added the cla signed label Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support idefics multimodal #7

Support idefics multimodal #7

sedrickkeh commented Apr 26, 2024 •

edited

Support idefics multimodal #7

Are you sure you want to change the base?

Support idefics multimodal #7

Conversation

sedrickkeh commented Apr 26, 2024 • edited

sedrickkeh commented Apr 26, 2024 •

edited