ViMoE: Vision Mixture of Experts with Multimodal Context Awareness

🔥 News

[2026.01] 🚀 We release ViMoE code!

👀 About ViMoE

ViMoE introduces three key innovations to the mixture-of-vision-experts paradigm:

Token-Level Sparse Expert Activation (TLSEA): Unlike prior methods that select experts at the image level, ViMoE enables different spatial tokens to utilize different expert combinations, achieving fine-grained, content-aware feature extraction.
Hierarchical Context Aggregation (HCA): Captures multi-scale visual context and fuses it with textual context to guide expert routing at different granularities.
Expert Confidence Calibration (ECC): Learns to estimate and calibrate expert contribution confidence to reduce noise from unreliable features.

🖼️ Framework Overview

📊 Results

MLLM Benchmarks

Model	LLM	MME	MMBench	QBench	MathVista	POPE
MoVA-8B	Llama3-8B	1595.8/347.5	75.3	70.8	37.7	89.3
ViMoE-8B	Llama3-8B	1612.3/358.2	76.8	72.3	39.2	90.1

Visual Question Answering

Model	VQAv2	GQA	TextVQA	ChartQA	DocVQA
MoVA-8B	83.5	65.2	77.1	70.5	83.8
ViMoE-8B	84.1	66.5	78.3	72.1	85.2

🛠️ Installation

# Clone the repository
git clone https://github.com/arrdel/vimoe.git
cd vimoe

# Create conda environment
conda create -n vimoe python=3.10 -y
conda activate vimoe

# Install dependencies
pip install -e .
pip install flash-attn --no-build-isolation

🚀 Quick Start

Inference

from vimoe import ViMoEAdapter, ViMoEConfig

# Load configuration
config = ViMoEConfig()

# Initialize adapter
adapter = ViMoEAdapter(config)

# Load pretrained weights
adapter.load_state_dict(torch.load("vimoe_adapter.pth"))

# Forward pass
output, aux_loss = adapter(features, routing_weights, prompts)

Training

# Pretraining
bash scripts/pretrain.sh

# Supervised Fine-tuning
bash scripts/finetune.sh

📁 Project Structure

ViMoE/
├── vimoe/
│   ├── __init__.py
│   ├── config.py              # Configuration classes
│   ├── constants.py           # Constants
│   ├── utils.py               # Utility functions
│   └── model/
│       ├── __init__.py
│       └── vimoe_adapter.py   # Core ViMoE-Adapter implementation
├── paper/
│   ├── main.tex               # Main paper
│   ├── references.bib         # Bibliography
│   ├── sec/                   # Paper sections
│   └── tables/                # Result tables
├── scripts/
│   ├── pretrain.sh
│   └── finetune.sh
└── README.md

🔑 Key Components

Token-Level Sparse Expert Activation (TLSEA)

from vimoe.model import TokenLevelSparseExpertActivation

tlsea = TokenLevelSparseExpertActivation(
    in_channels=1024,
    num_experts=7,
    topk=3
)

# Returns per-token routing weights
routing_weights, indices, aux_loss = tlsea(tokens, context, coarse_mask)

Hierarchical Context Aggregation (HCA)

from vimoe.model import HierarchicalContextAggregation

hca = HierarchicalContextAggregation(
    in_channels=1024,
    context_levels=[1, 2, 4],
    text_dim=1024
)

# Aggregates multi-scale visual-textual context
context = hca(visual_features, text_features)

Expert Confidence Calibration (ECC)

from vimoe.model import ExpertConfidenceCalibration

ecc = ExpertConfidenceCalibration(
    in_channels=1024,
    num_experts=7
)

# Calibrates routing weights based on confidence
calibrated_weights, confidence = ecc(base_features, expert_features, routing_weights)

🙏 Acknowledgements

This project builds upon the excellent work of:

📄 License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
scripts		scripts
vimoe		vimoe
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViMoE: Vision Mixture of Experts with Multimodal Context Awareness

🔥 News

👀 About ViMoE

🖼️ Framework Overview

📊 Results

MLLM Benchmarks

Visual Question Answering

🛠️ Installation

🚀 Quick Start

Inference

Training

📁 Project Structure

🔑 Key Components

Token-Level Sparse Expert Activation (TLSEA)

Hierarchical Context Aggregation (HCA)

Expert Confidence Calibration (ECC)

🙏 Acknowledgements

📄 License

About

Uh oh!

Releases

Packages

Languages

arrdel/vimoe

Folders and files

Latest commit

History

Repository files navigation

ViMoE: Vision Mixture of Experts with Multimodal Context Awareness

🔥 News

👀 About ViMoE

🖼️ Framework Overview

📊 Results

MLLM Benchmarks

Visual Question Answering

🛠️ Installation

🚀 Quick Start

Inference

Training

📁 Project Structure

🔑 Key Components

Token-Level Sparse Expert Activation (TLSEA)

Hierarchical Context Aggregation (HCA)

Expert Confidence Calibration (ECC)

🙏 Acknowledgements

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages