- [2026.01] 🚀 We release ViMoE code!
ViMoE introduces three key innovations to the mixture-of-vision-experts paradigm:
-
Token-Level Sparse Expert Activation (TLSEA): Unlike prior methods that select experts at the image level, ViMoE enables different spatial tokens to utilize different expert combinations, achieving fine-grained, content-aware feature extraction.
-
Hierarchical Context Aggregation (HCA): Captures multi-scale visual context and fuses it with textual context to guide expert routing at different granularities.
-
Expert Confidence Calibration (ECC): Learns to estimate and calibrate expert contribution confidence to reduce noise from unreliable features.
| Model | LLM | MME | MMBench | QBench | MathVista | POPE |
|---|---|---|---|---|---|---|
| MoVA-8B | Llama3-8B | 1595.8/347.5 | 75.3 | 70.8 | 37.7 | 89.3 |
| ViMoE-8B | Llama3-8B | 1612.3/358.2 | 76.8 | 72.3 | 39.2 | 90.1 |
| Model | VQAv2 | GQA | TextVQA | ChartQA | DocVQA |
|---|---|---|---|---|---|
| MoVA-8B | 83.5 | 65.2 | 77.1 | 70.5 | 83.8 |
| ViMoE-8B | 84.1 | 66.5 | 78.3 | 72.1 | 85.2 |
# Clone the repository
git clone https://github.com/arrdel/vimoe.git
cd vimoe
# Create conda environment
conda create -n vimoe python=3.10 -y
conda activate vimoe
# Install dependencies
pip install -e .
pip install flash-attn --no-build-isolationfrom vimoe import ViMoEAdapter, ViMoEConfig
# Load configuration
config = ViMoEConfig()
# Initialize adapter
adapter = ViMoEAdapter(config)
# Load pretrained weights
adapter.load_state_dict(torch.load("vimoe_adapter.pth"))
# Forward pass
output, aux_loss = adapter(features, routing_weights, prompts)# Pretraining
bash scripts/pretrain.sh
# Supervised Fine-tuning
bash scripts/finetune.shViMoE/
├── vimoe/
│ ├── __init__.py
│ ├── config.py # Configuration classes
│ ├── constants.py # Constants
│ ├── utils.py # Utility functions
│ └── model/
│ ├── __init__.py
│ └── vimoe_adapter.py # Core ViMoE-Adapter implementation
├── paper/
│ ├── main.tex # Main paper
│ ├── references.bib # Bibliography
│ ├── sec/ # Paper sections
│ └── tables/ # Result tables
├── scripts/
│ ├── pretrain.sh
│ └── finetune.sh
└── README.md
from vimoe.model import TokenLevelSparseExpertActivation
tlsea = TokenLevelSparseExpertActivation(
in_channels=1024,
num_experts=7,
topk=3
)
# Returns per-token routing weights
routing_weights, indices, aux_loss = tlsea(tokens, context, coarse_mask)from vimoe.model import HierarchicalContextAggregation
hca = HierarchicalContextAggregation(
in_channels=1024,
context_levels=[1, 2, 4],
text_dim=1024
)
# Aggregates multi-scale visual-textual context
context = hca(visual_features, text_features)from vimoe.model import ExpertConfidenceCalibration
ecc = ExpertConfidenceCalibration(
in_channels=1024,
num_experts=7
)
# Calibrates routing weights based on confidence
calibrated_weights, confidence = ecc(base_features, expert_features, routing_weights)This project builds upon the excellent work of:
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
