This is the repo for the paper: 'MMRA: A Benchmark for Multi-granularity Multi-image Relational Association'.
Our benchmark dataset is released: Huggingface Dataset: m-a-p/MMRA, Google Drive, and Baidu Netdisk.
The MMRA.zip in Google Drive and Baidu Netdisk contains a metadata.json file, which includes all the sample information. We can input the relevant questions, options, and image pairs to LVMLs through it.
We define a multi-image relation association task, and meticulously curate MMRA benchmark, a Multi-granularity Multi-image Relational Association benchmark, consisted of 1,024 samples. In order to systematically and comprehensively evaluate mainstream LVLMs, we establish an associational relation system among images that contain 11 subtasks (e.g, UsageSimilarity, SubEvent, etc.) at two granularity levels (i.e., "image" and "entity") according to the relations in ConceptNet. Our experiments reveal that on the MMRA benchmark, current multi-image LVLMs exhibit distinct advantages and disadvantages across various subtasks. Notably, fine-grained, entity-level multi-image perception tasks pose a greater challenge for LVLMs compared to image-level tasks. Tasks that involve spatial perception are especially difficult for LVLMs to handle. Additionally, our findings indicate that while LVLMs demonstrate a strong capability to perceive image details, enhancing their ability to associate information across multiple images hinges on improving the reasoning capabilities of their language model component. Moreover, we explored the ability of LVLMs to perceive image sequences within the context of our multi-image association task. Our experiments indicate that the majority of current LVLMs do not adequately model image sequences during the pre-training process.
You can load our datasets by following codes:
MMRA_data = datasets.load_dataset('m-a-p/MMRA')['train']
print(MMRA_data[0])
We provide a demo to rerun the result of the LVLMs (i.e., Idefics2, Mantis, Phi3, Qwen-VL-Chat, GPT4o and GPT4v). You can run the following codes for the open-source multi-image LVMLs:
python ./evaluate/evaluate_demo.py --model_name Idefics2 --model_id HuggingFaceM4/idefics2-8b
python ./evaluate/evaluate_demo.py --model_name Qwen-VL-Chat --model_id Qwen/Qwen-VL-Chat
python ./evaluate/evaluate_demo.py --model_name Mantis_Idefics2 --model_id TIGER-Lab/Mantis-8B-Idefics2
python ./evaluate/evaluate_demo.py --model_name Phi3v --model_id microsoft/Phi-3-vision-128k-instruct
If you want to use the OpenAI's API, please change the function "gpt_4_api" for a different model (e.g., GPT4o and GPT4v) and update your api and base_url in the "--API_KEY" and "--BASE_URL".
If you want to evaluate your model, you need to change the function "gpt_message" and "process_sample" to cater to your model, and it is also sample.
Our original results are conducted by Lmms-eval framework. To make it easier for others to use our code and benchmark to evaluate the performance of their models, we replicate the result of Lmms-eval and provide the demo code. The results of our reproduced demo code are generally consistent with those of Lmms-eval, but there are some slight differences.
BibTeX:
@article{wu2024mmra,
title={MMRA: A Benchmark for Multi-granularity Multi-image Relational Association},
author={Wu, Siwei and Zhu, Kang and Bai, Yu and Liang, Yiming and Li, Yizhi and Wu, Haoning and Liu, Jiaheng and Liu, Ruibo and Qu, Xingwei and Cheng, Xuxin and others},
journal={arXiv preprint arXiv:2407.17379},
year={2024}
}