MaVEn (Multi-granularity Hybrid Visual Encoding) is a novel visual encoding framework designed for Multimodal Large Language Models (MLLMs) to enhance their performance in multi-image reasoning and single-image visual comprehension tasks. By combining discrete visual tokens for high-level semantic abstraction and continuous representations for fine-grained details, MaVEn achieves state-of-the-art performance on multiple benchmarks, bridging the gap between visual encoding and language understanding.
- Hybrid Encoding Framework: Combines discrete and continuous representations for comprehensive visual understanding.
- Efficient Patch Reduction: Reduces computational overhead while maintaining high performance using a dynamic patch reduction mechanism.
- Versatile Applicability: Excels in both multi-image reasoning tasks (e.g., DemonBench, SEED-Bench) and single-image tasks (e.g., VQA, MMBench).
- Zero-Shot Capability: Demonstrates strong zero-shot performance on multimodal benchmarks.
MaVEn employs a multi-granularity hybrid encoding strategy:
- Discrete Encoding: Captures high-dimensional, abstract semantics.
- Continuous Encoding: Preserves fine-grained, low-level details.
- Patch Selector: Dynamically selects relevant visual tokens based on task requirements.
MaVEn achieves state-of-the-art performance across various benchmarks:
- DemonBench: Superior multi-image reasoning, visual relation inference, and multi-modal cloze tasks.
- SEED-Bench: Exceptional video understanding in action recognition and procedure comprehension.
- VQA: Outperforms existing MLLMs like LLaVA-1.5, BLIP2, and Qwen-VL-Chat in single-image visual question answering.
- MMBench: Demonstrates significant gains in multimodal benchmarks.
For detailed results, refer to our paper.
- Python 3.8+
- PyTorch 1.11+
- CUDA 11.3+ (optional for GPU acceleration)
-
Clone the repository:
git clone https://github.com/your-repo/MaVEn.git cd MaVEn -
Install dependencies:
pip install -r requirements.txt
-
Download pre-trained weights and place them in the
checkpoints/directory:
Use the following script for inference on multi-image tasks:
from maven import MaVEn
model = MaVEn.load_pretrained('checkpoints/maven_model.pth')
result = model.infer(images=['image1.jpg', 'image2.jpg'], question='What is common between these images?')
print(result)Train MaVEn on a custom dataset:
python train.py --config configs/maven_config.yaml --data_path /path/to/dataEvaluate MaVEn on benchmarks:
python evaluate.py --config configs/maven_config.yaml --checkpoint checkpoints/maven_model.pthMaVEn/
├── configs/ # Configuration files
├── data/ # Data loaders and preprocessing
├── models/ # Model architecture
├── checkpoints/ # Pre-trained model weights
├── scripts/ # Helper scripts
├── train.py # Training script
├── evaluate.py # Evaluation script
└── README.md # Project documentation
If you find MaVEn helpful in your research, please cite our paper:
@article{maven2024,
title={MaVEn: Multi-Granularity Hybrid Visual Encoding Framework for Multimodal Large Language Models},
author={Author Name and Others},
journal={Arxiv preprint},
year={2024}
}This project is licensed under the MIT License. See the LICENSE file for details.
Happy coding with MaVEn! 🚀


