Three models. One decision. Inspired by the MAGI supercomputer from Neon Genesis Evangelion.
MAGI is not another agent framework. It is a structured disagreement engine: the same question goes to three different LLMs, each with a different perspective. They vote, debate, and critique each other to produce a Decision Dossier with the ruling, confidence, minority report, and full trace.
Three cheap models using MAGI critique mode scored 88% on our benchmark. A single Claude Sonnet 4.6 scored 76%.
Vote alone (72%) did not beat the single strong model. Critique did. The models caught each other's mistakes.
The value is not "more accurate answers." It is better decision quality: seeing where models agree, where they disagree, and why.
There are several EVA-inspired multi-model projects. Here's what makes this one different.
Other projects do voting. Three models answer, pick the majority. That's it.
MAGI does structured disagreement. Models don't just answer in parallel. They read each other's answers, critique the reasoning, and revise their positions across multiple rounds. The system tracks who changed their mind and why.
| Capability | Voting projects | MAGI |
|---|---|---|
| Multi-model query | Yes | Yes |
| Majority vote | Yes | Yes |
| Multi-round critique (ICE) | No | Yes |
| Mind change tracking | No | Yes |
| Adaptive protocol selection | No | Yes |
| Minority report / dissent analysis | No | Yes |
| Benchmark: ensemble > single model | No | Yes (88% > 76%) |
| Fault tolerance (node failures) | No | Yes |
| NERV hexagonal dashboard | No | Yes |
| CLI toolchain (diff, judge, bench) | No | Yes |
The key finding: vote alone (72%) does not beat a single strong model (76%). Every voting-only project hits this ceiling. MAGI's critique mode breaks through it (88%) by letting models catch each other's mistakes.
A NeurIPS 2025 paper (Debate or Vote) found that "debate doesn't systematically improve beliefs." But their debate asks models to persuade humans. MAGI's ICE protocol asks models to find errors in each other's reasoning. Different mechanism, different result.
pip install magi-systemOr from source:
git clone https://github.com/fshiori/magi.git
cd magi
pip install -e ".[dev]"# Set your API key (OpenRouter gives you access to all models with one key)
export OPENROUTER_API_KEY=sk-or-...
# Ask a question — three models debate, one decision emerges
magi ask "Should we use microservices or a monolith?"
# Multi-model code review (the killer use case)
magi diff --staged
# Critique mode: models debate until consensus (slower, higher quality)
magi ask "Is Rust better than Go for backend services?" --mode critique
# Adaptive mode: auto-selects vote/critique/escalate based on disagreement
magi ask "What caused the 2008 financial crisis?" --mode adaptive
# Multi-model answer scoring
magi judge -q "What is quantum entanglement?" -a "It means particles are connected"
# NERV Command Center — real-time dashboard
pip install magi-system[web]
magi dashboard
# Run benchmark, view analytics, replay decisions
magi bench
magi analytics
magi replay <trace-id>
# List persona presets
magi presets You ──▶ MAGI Engine ──▶ 3 LLMs in parallel ──▶ Protocol ──▶ Decision Dossier
│ │
Melchior Vote (fast)
Balthasar Critique (debate)
Casper Adaptive (auto)
Each Decision Dossier contains:
- Ruling — the final answer
- Confidence — how much the models agreed (0-100%)
- Minority Report — dissenting opinions and why they disagree
- Mind Changes — which models changed position during debate
- Trace — full JSONL history for replay and analytics
| Protocol | When to use | How it works |
|---|---|---|
vote |
Fast answers, clear-cut questions | Parallel query, structured position extraction, majority wins |
critique |
Complex or controversial questions | Multi-round debate (ICE), models critique each other until consensus |
escalate |
Forced decision on high-disagreement topics | Critique with 2-round limit, highest-trust node makes final call |
adaptive |
Default for most use cases | Auto-selects based on agreement score: high=vote, medium=critique, low=escalate |
MAGI comes with 5 built-in perspective sets:
$ magi presets
code-review Security Analyst / Performance Engineer / Code Quality Reviewer
eva Melchior / Balthasar / Casper
research Methodologist / Domain Expert / Devil's Advocate
strategy Optimist / Pessimist / Pragmatist
writing Editor / Reader Advocate / Fact Checker
# Use a specific preset
magi ask "Should we expand to the EU market?" --preset strategy
# magi diff always uses code-review preset automaticallyimport asyncio
from magi import MAGI
engine = MAGI(
melchior="openrouter/deepseek/deepseek-v3.2",
balthasar="openrouter/xiaomi/mimo-v2-pro",
casper="openrouter/minimax/minimax-m2.7",
)
decision = asyncio.run(engine.ask(
"What are the security implications of this API design?",
mode="adaptive",
))
print(decision.ruling) # The final answer
print(decision.confidence) # 0.0 - 1.0
print(decision.minority_report) # Dissenting views
print(decision.mind_changes) # Who changed their mind
print(decision.protocol_used) # Which protocol was selectedMAGI uses LiteLLM under the hood, so it supports 100+ LLM providers.
# Direct providers
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=AI...
# Or use OpenRouter for all models with one key
export OPENROUTER_API_KEY=sk-or-...magi ask "your question" \
--melchior openrouter/anthropic/claude-sonnet-4.6 \
--balthasar openrouter/openai/gpt-4o \
--casper openrouter/google/gemini-2.5-proTested on 25 MMLU-style questions across 5 categories:
| Group | Accuracy | Time | Errors |
|---|---|---|---|
| Claude Sonnet 4.6 (single) | 76% | 128s | 1 |
| 3x Cheap Models (vote) | 72% | 1745s | 0 |
| 3x Cheap Models (critique) | 88% | 1577s | 0 |
Models used: Xiaomi MiMo-v2-pro, MiniMax M2.7, DeepSeek V3.2
Key finding: Vote alone doesn't beat a strong single model. Critique mode does, by letting models catch each other's mistakes through structured debate.
MAGI keeps working when models fail:
- 1 of 3 fails — continues with 2 nodes, marks decision as degraded
- 2 of 3 fail — falls back to single model response
- All 3 fail — raises
MagiUnavailableError(never guesses) - Timeouts — 30s default per node, exponential backoff on rate limits
- Reasoning models — automatically extracts from
reasoning_content(e.g., MiniMax M2.7)
magi/
├── core/
│ ├── engine.py # MAGI engine, coordinates nodes
│ ├── node.py # LLM node wrapper with persona
│ └── decision.py # Decision dossier dataclass
├── protocols/
│ ├── vote.py # Structured voting with position extraction
│ ├── critique.py # ICE (Iterative Consensus Ensemble)
│ └── adaptive.py # Dynamic protocol selection
├── commands/
│ ├── diff.py # Multi-model code review
│ ├── judge.py # Multi-model answer scoring
│ └── analytics.py # Trace analysis and replay
├── web/
│ ├── server.py # FastAPI + WebSocket server
│ └── static/ # NERV Command Center UI
├── presets/ # Persona preset definitions
├── bench/ # Benchmark runner and datasets
├── trace/ # JSONL trace logging
└── cli.py # Click CLI entry point
git clone https://github.com/fshiori/magi.git
cd magi
uv venv && uv pip install -e ".[dev]"
python -m pytest tests/ -v83 tests covering all protocols, degradation modes, and edge cases.
Published on PyPI.
Real-time dashboard showing the three MAGI nodes thinking, debating, and reaching a decision. EVA-accurate hexagonal layout with vote status lamps (承認/否決/膠着).
pip install magi-system[web]
magi dashboard
# Open http://localhost:3000Features:
- Live WebSocket streaming of node responses
- Critique round tracking with agreement score
- EVA-style verdict display: 承認 (approve), 否決 (reject), 膠着 (deadlock)
- Click any hexagon to see the full response
- Auto-popup when nodes complete
- Markdown rendering for LLM output
- MAGI-as-API-Gateway — OpenAI-compatible proxy, any app just changes
base_url - LLM-as-judge agreement scoring (replace word-overlap heuristic)
- Scorecard weighted voting (after sufficient data collection)
- Streaming token output in NERV UI
In Evangelion, MAGI is a trio of supercomputers created by Dr. Naoko Akagi. Each embodies a different aspect of her personality: Melchior (the scientist), Balthasar (the mother), and Casper (the woman). Decisions are made by majority vote among the three.
MAGI applies this concept to LLMs: same question, three different perspectives, structured disagreement produces better decisions than any single model alone.
MIT
