Skip to content
/ vimoe Public

Vision Mixture of Experts with Multimodal Context Awareness

Notifications You must be signed in to change notification settings

arrdel/vimoe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ViMoE: Vision Mixture of Experts with Multimodal Context Awareness

Paper GitHub Project page License

🔥 News

  • [2026.01] 🚀 We release ViMoE code!

👀 About ViMoE

ViMoE introduces three key innovations to the mixture-of-vision-experts paradigm:

  1. Token-Level Sparse Expert Activation (TLSEA): Unlike prior methods that select experts at the image level, ViMoE enables different spatial tokens to utilize different expert combinations, achieving fine-grained, content-aware feature extraction.

  2. Hierarchical Context Aggregation (HCA): Captures multi-scale visual context and fuses it with textual context to guide expert routing at different granularities.

  3. Expert Confidence Calibration (ECC): Learns to estimate and calibrate expert contribution confidence to reduce noise from unreliable features.

🖼️ Framework Overview

Framework

📊 Results

MLLM Benchmarks

Model LLM MME MMBench QBench MathVista POPE
MoVA-8B Llama3-8B 1595.8/347.5 75.3 70.8 37.7 89.3
ViMoE-8B Llama3-8B 1612.3/358.2 76.8 72.3 39.2 90.1

Visual Question Answering

Model VQAv2 GQA TextVQA ChartQA DocVQA
MoVA-8B 83.5 65.2 77.1 70.5 83.8
ViMoE-8B 84.1 66.5 78.3 72.1 85.2

🛠️ Installation

# Clone the repository
git clone https://github.com/arrdel/vimoe.git
cd vimoe

# Create conda environment
conda create -n vimoe python=3.10 -y
conda activate vimoe

# Install dependencies
pip install -e .
pip install flash-attn --no-build-isolation

🚀 Quick Start

Inference

from vimoe import ViMoEAdapter, ViMoEConfig

# Load configuration
config = ViMoEConfig()

# Initialize adapter
adapter = ViMoEAdapter(config)

# Load pretrained weights
adapter.load_state_dict(torch.load("vimoe_adapter.pth"))

# Forward pass
output, aux_loss = adapter(features, routing_weights, prompts)

Training

# Pretraining
bash scripts/pretrain.sh

# Supervised Fine-tuning
bash scripts/finetune.sh

📁 Project Structure

ViMoE/
├── vimoe/
│   ├── __init__.py
│   ├── config.py              # Configuration classes
│   ├── constants.py           # Constants
│   ├── utils.py               # Utility functions
│   └── model/
│       ├── __init__.py
│       └── vimoe_adapter.py   # Core ViMoE-Adapter implementation
├── paper/
│   ├── main.tex               # Main paper
│   ├── references.bib         # Bibliography
│   ├── sec/                   # Paper sections
│   └── tables/                # Result tables
├── scripts/
│   ├── pretrain.sh
│   └── finetune.sh
└── README.md

🔑 Key Components

Token-Level Sparse Expert Activation (TLSEA)

from vimoe.model import TokenLevelSparseExpertActivation

tlsea = TokenLevelSparseExpertActivation(
    in_channels=1024,
    num_experts=7,
    topk=3
)

# Returns per-token routing weights
routing_weights, indices, aux_loss = tlsea(tokens, context, coarse_mask)

Hierarchical Context Aggregation (HCA)

from vimoe.model import HierarchicalContextAggregation

hca = HierarchicalContextAggregation(
    in_channels=1024,
    context_levels=[1, 2, 4],
    text_dim=1024
)

# Aggregates multi-scale visual-textual context
context = hca(visual_features, text_features)

Expert Confidence Calibration (ECC)

from vimoe.model import ExpertConfidenceCalibration

ecc = ExpertConfidenceCalibration(
    in_channels=1024,
    num_experts=7
)

# Calibrates routing weights based on confidence
calibrated_weights, confidence = ecc(base_features, expert_features, routing_weights)

🙏 Acknowledgements

This project builds upon the excellent work of:

📄 License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

About

Vision Mixture of Experts with Multimodal Context Awareness

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published