MaVEn: Multi-Granularity Hybrid Visual Encoding Framework

Introduction

MaVEn (Multi-granularity Hybrid Visual Encoding) is a novel visual encoding framework designed for Multimodal Large Language Models (MLLMs) to enhance their performance in multi-image reasoning and single-image visual comprehension tasks. By combining discrete visual tokens for high-level semantic abstraction and continuous representations for fine-grained details, MaVEn achieves state-of-the-art performance on multiple benchmarks, bridging the gap between visual encoding and language understanding.

Key Features

Hybrid Encoding Framework: Combines discrete and continuous representations for comprehensive visual understanding.
Efficient Patch Reduction: Reduces computational overhead while maintaining high performance using a dynamic patch reduction mechanism.
Versatile Applicability: Excels in both multi-image reasoning tasks (e.g., DemonBench, SEED-Bench) and single-image tasks (e.g., VQA, MMBench).
Zero-Shot Capability: Demonstrates strong zero-shot performance on multimodal benchmarks.

Architecture Overview

MaVEn employs a multi-granularity hybrid encoding strategy:

Discrete Encoding: Captures high-dimensional, abstract semantics.
Continuous Encoding: Preserves fine-grained, low-level details.
Patch Selector: Dynamically selects relevant visual tokens based on task requirements.

Benchmarks and Results

MaVEn achieves state-of-the-art performance across various benchmarks:

DemonBench: Superior multi-image reasoning, visual relation inference, and multi-modal cloze tasks.
SEED-Bench: Exceptional video understanding in action recognition and procedure comprehension.
VQA: Outperforms existing MLLMs like LLaVA-1.5, BLIP2, and Qwen-VL-Chat in single-image visual question answering.
MMBench: Demonstrates significant gains in multimodal benchmarks.

For detailed results, refer to our paper.

Installation

Prerequisites

Python 3.8+
PyTorch 1.11+
CUDA 11.3+ (optional for GPU acceleration)

Steps

Clone the repository:

git clone https://github.com/your-repo/MaVEn.git
cd MaVEn

Install dependencies:
```
pip install -r requirements.txt
```
Download pre-trained weights and place them in the checkpoints/ directory:
- MaVEn Pretrained Weights

Usage

Inference

Use the following script for inference on multi-image tasks:

from maven import MaVEn

model = MaVEn.load_pretrained('checkpoints/maven_model.pth')
result = model.infer(images=['image1.jpg', 'image2.jpg'], question='What is common between these images?')
print(result)

Training

Train MaVEn on a custom dataset:

python train.py --config configs/maven_config.yaml --data_path /path/to/data

Evaluation

Evaluate MaVEn on benchmarks:

python evaluate.py --config configs/maven_config.yaml --checkpoint checkpoints/maven_model.pth

Repository Structure

MaVEn/
├── configs/            # Configuration files
├── data/               # Data loaders and preprocessing
├── models/             # Model architecture
├── checkpoints/        # Pre-trained model weights
├── scripts/            # Helper scripts
├── train.py            # Training script
├── evaluate.py         # Evaluation script
└── README.md           # Project documentation

Results Visualization

Multi-Image Tasks

Single-Image Tasks

Citation

If you find MaVEn helpful in your research, please cite our paper:

@article{maven2024,
  title={MaVEn: Multi-Granularity Hybrid Visual Encoding Framework for Multimodal Large Language Models},
  author={Author Name and Others},
  journal={Arxiv preprint},
  year={2024}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Happy coding with MaVEn! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MaVEn: Multi-Granularity Hybrid Visual Encoding Framework

Introduction

Key Features

Architecture Overview

Benchmarks and Results

Installation

Prerequisites

Steps

Usage

Inference

Training

Evaluation

Repository Structure

Results Visualization

Multi-Image Tasks

Single-Image Tasks

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

MaVEn: Multi-Granularity Hybrid Visual Encoding Framework

Introduction

Key Features

Architecture Overview

Benchmarks and Results

Installation

Prerequisites

Steps

Usage

Inference

Training

Evaluation

Repository Structure

Results Visualization

Multi-Image Tasks

Single-Image Tasks

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages