This repo contains the code and data for our benchmark paper:
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal LLMs
H. Wang, H. Shi, S. Tan, W. Qin, W. Wang, T. Zhang, A. Nambi, T. Ganu, H. Wang
[Paper] [MMNeedle Dataset]
To use this benchmark, please download the MMNeedle dataset at this link. Alternatively, you could also construct your version of MMNeedle by following the instructions below.
[2024-06-27] New project page set up for MMNeedle.
[2024-06-24] We released the leaderboard for Multimodal Long Context Understanding on paper-with-code!
[2024-06-17] We released the paper, code, and data for Multimodal Needle in a Haystack (MMNeedle) benchmark!
MMNeedle Evaluation Overview. Correct answers are marked with checkmark (
MMNeedle Evaluation Performance Comparison (Claude-3 refers to Claude 3 Opus, and Gemini-1.0/1.5 refers to Gemini Pro 1.0/1.5). The x-axis shows the results of different models, and the y-axis shows the results on various input image number M and stitching size N. For each row, i.e., setting (M,N), we show the average accuracy (%) of each model. For each stitched image, the color of row r, column c indicates the accuracy of predicting the exact position for samples with the "needle" sub-image in position (r,c) of the stitched image. For the M=10 setting, we show the average accuracy of each location (r,c) over 10 images. A redder cell indicates lower accuracy, while a greener cell indicates higher accuracy. The best result for each row is marked with underlining.
conda env create -f context.yml
Download MS COCO
put val2014, annotations_trainval dir to current directory.
python ./annotations_trainval/file_to_caption.py
python sample_images.py
python sample_stitched_images.py
python sample_single_needles.py
python sample_multiple_needles.py
export BEGIN=0
export N_SEQ=1000
export N_NEEDLES=1
export MODEL_PROVIDER='Gemini'
bash test.sh
export BEGIN=0
export N_SEQ=1000
python evaluate.py
python evaluate_multi.py
@misc{wang2024multimodal,
title={Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models},
author={Hengyi Wang and
Haizhou Shi and
Shiwei Tan and
Weiyi Qin and
Wenyuan Wang and
Tuny Zhang and
Akshay Nambi and
Tanuja Ganu and
Hao Wang},
year={2024},
eprint={2406.11230},
archivePrefix={arXiv},
primaryClass={cs.LG}
}