MAGI — Disagreement OS for LLMs

Three models. One decision. Inspired by the MAGI supercomputer from Neon Genesis Evangelion.

MAGI is not another agent framework. It is a structured disagreement engine: the same question goes to three different LLMs, each with a different perspective. They vote, debate, and critique each other to produce a Decision Dossier with the ruling, confidence, minority report, and full trace.

Why?

Three cheap models using MAGI critique mode scored 88% on our benchmark. A single Claude Sonnet 4.6 scored 76%.

Vote alone (72%) did not beat the single strong model. Critique did. The models caught each other's mistakes.

The value is not "more accurate answers." It is better decision quality: seeing where models agree, where they disagree, and why.

How MAGI Differs

There are several EVA-inspired multi-model projects. Here's what makes this one different.

Other projects do voting. Three models answer, pick the majority. That's it.

MAGI does structured disagreement. Models don't just answer in parallel. They read each other's answers, critique the reasoning, and revise their positions across multiple rounds. The system tracks who changed their mind and why.

Capability	Voting projects	MAGI
Multi-model query	Yes	Yes
Majority vote	Yes	Yes
Multi-round critique (ICE)	No	Yes
Mind change tracking	No	Yes
Adaptive protocol selection	No	Yes
Minority report / dissent analysis	No	Yes
Benchmark: ensemble > single model	No	Yes (88% > 76%)
Fault tolerance (node failures)	No	Yes
NERV hexagonal dashboard	No	Yes
CLI toolchain (diff, judge, bench)	No	Yes

The key finding: vote alone (72%) does not beat a single strong model (76%). Every voting-only project hits this ceiling. MAGI's critique mode breaks through it (88%) by letting models catch each other's mistakes.

A NeurIPS 2025 paper (Debate or Vote) found that "debate doesn't systematically improve beliefs." But their debate asks models to persuade humans. MAGI's ICE protocol asks models to find errors in each other's reasoning. Different mechanism, different result.

Install

pip install magi-system

Or from source:

git clone https://github.com/fshiori/magi.git
cd magi
pip install -e ".[dev]"

Quick Start

# Set your API key (OpenRouter gives you access to all models with one key)
export OPENROUTER_API_KEY=sk-or-...

# Ask a question — three models debate, one decision emerges
magi ask "Should we use microservices or a monolith?"

# Multi-model code review (the killer use case)
magi diff --staged

# Critique mode: models debate until consensus (slower, higher quality)
magi ask "Is Rust better than Go for backend services?" --mode critique

# Adaptive mode: auto-selects vote/critique/escalate based on disagreement
magi ask "What caused the 2008 financial crisis?" --mode adaptive

# Multi-model answer scoring
magi judge -q "What is quantum entanglement?" -a "It means particles are connected"

# NERV Command Center — real-time dashboard
pip install magi-system[web]
magi dashboard

# Run benchmark, view analytics, replay decisions
magi bench
magi analytics
magi replay <trace-id>

# List persona presets
magi presets

How It Works

  You ──▶ MAGI Engine ──▶ 3 LLMs in parallel ──▶ Protocol ──▶ Decision Dossier
                              │                       │
                         Melchior               Vote (fast)
                         Balthasar          Critique (debate)
                         Casper            Adaptive (auto)

Each Decision Dossier contains:

Ruling — the final answer
Confidence — how much the models agreed (0-100%)
Minority Report — dissenting opinions and why they disagree
Mind Changes — which models changed position during debate
Trace — full JSONL history for replay and analytics

Protocols

Protocol	When to use	How it works
`vote`	Fast answers, clear-cut questions	Parallel query, structured position extraction, majority wins
`critique`	Complex or controversial questions	Multi-round debate (ICE), models critique each other until consensus
`escalate`	Forced decision on high-disagreement topics	Critique with 2-round limit, highest-trust node makes final call
`adaptive`	Default for most use cases	Auto-selects based on agreement score: high=vote, medium=critique, low=escalate

Persona Presets

MAGI comes with 5 built-in perspective sets:

$ magi presets

  code-review     Security Analyst / Performance Engineer / Code Quality Reviewer
  eva             Melchior / Balthasar / Casper
  research        Methodologist / Domain Expert / Devil's Advocate
  strategy        Optimist / Pessimist / Pragmatist
  writing         Editor / Reader Advocate / Fact Checker

# Use a specific preset
magi ask "Should we expand to the EU market?" --preset strategy

# magi diff always uses code-review preset automatically

Python API

import asyncio
from magi import MAGI

engine = MAGI(
    melchior="openrouter/deepseek/deepseek-v3.2",
    balthasar="openrouter/xiaomi/mimo-v2-pro",
    casper="openrouter/minimax/minimax-m2.7",
)

decision = asyncio.run(engine.ask(
    "What are the security implications of this API design?",
    mode="adaptive",
))

print(decision.ruling)          # The final answer
print(decision.confidence)      # 0.0 - 1.0
print(decision.minority_report) # Dissenting views
print(decision.mind_changes)    # Who changed their mind
print(decision.protocol_used)   # Which protocol was selected

Configuration

MAGI uses LiteLLM under the hood, so it supports 100+ LLM providers.

API Keys

# Direct providers
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=AI...

# Or use OpenRouter for all models with one key
export OPENROUTER_API_KEY=sk-or-...

Using OpenRouter

magi ask "your question" \
  --melchior openrouter/anthropic/claude-sonnet-4.6 \
  --balthasar openrouter/openai/gpt-4o \
  --casper openrouter/google/gemini-2.5-pro

Benchmark Results

Tested on 25 MMLU-style questions across 5 categories:

Group	Accuracy	Time	Errors
Claude Sonnet 4.6 (single)	76%	128s	1
3x Cheap Models (vote)	72%	1745s	0
3x Cheap Models (critique)	88%	1577s	0

Models used: Xiaomi MiMo-v2-pro, MiniMax M2.7, DeepSeek V3.2

Key finding: Vote alone doesn't beat a strong single model. Critique mode does, by letting models catch each other's mistakes through structured debate.

Fault Tolerance

MAGI keeps working when models fail:

1 of 3 fails — continues with 2 nodes, marks decision as degraded
2 of 3 fail — falls back to single model response
All 3 fail — raises MagiUnavailableError (never guesses)
Timeouts — 30s default per node, exponential backoff on rate limits
Reasoning models — automatically extracts from reasoning_content (e.g., MiniMax M2.7)

Project Structure

magi/
├── core/
│   ├── engine.py       # MAGI engine, coordinates nodes
│   ├── node.py         # LLM node wrapper with persona
│   └── decision.py     # Decision dossier dataclass
├── protocols/
│   ├── vote.py         # Structured voting with position extraction
│   ├── critique.py     # ICE (Iterative Consensus Ensemble)
│   └── adaptive.py     # Dynamic protocol selection
├── commands/
│   ├── diff.py         # Multi-model code review
│   ├── judge.py        # Multi-model answer scoring
│   └── analytics.py    # Trace analysis and replay
├── web/
│   ├── server.py       # FastAPI + WebSocket server
│   └── static/         # NERV Command Center UI
├── presets/             # Persona preset definitions
├── bench/              # Benchmark runner and datasets
├── trace/              # JSONL trace logging
└── cli.py              # Click CLI entry point

Development

git clone https://github.com/fshiori/magi.git
cd magi
uv venv && uv pip install -e ".[dev]"
python -m pytest tests/ -v

83 tests covering all protocols, degradation modes, and edge cases.

Published on PyPI.

NERV Command Center

Real-time dashboard showing the three MAGI nodes thinking, debating, and reaching a decision. EVA-accurate hexagonal layout with vote status lamps (承認/否決/膠着).

pip install magi-system[web]
magi dashboard
# Open http://localhost:3000

Features:

Live WebSocket streaming of node responses
Critique round tracking with agreement score
EVA-style verdict display: 承認 (approve), 否決 (reject), 膠着 (deadlock)
Click any hexagon to see the full response
Auto-popup when nodes complete
Markdown rendering for LLM output

Roadmap

MAGI-as-API-Gateway — OpenAI-compatible proxy, any app just changes base_url
LLM-as-judge agreement scoring (replace word-overlap heuristic)
Scorecard weighted voting (after sufficient data collection)
Streaming token output in NERV UI

Name

In Evangelion, MAGI is a trio of supercomputers created by Dr. Naoko Akagi. Each embodies a different aspect of her personality: Melchior (the scientist), Balthasar (the mother), and Casper (the woman). Decisions are made by majority vote among the three.

MAGI applies this concept to LLMs: same question, three different perspectives, structured disagreement produces better decisions than any single model alone.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
docs		docs
experiments		experiments
magi		magi
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
TODOS.md		TODOS.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAGI — Disagreement OS for LLMs

Why?

How MAGI Differs

Install

Quick Start

How It Works

Protocols

Persona Presets

Python API

Configuration

API Keys

Using OpenRouter

Benchmark Results

Fault Tolerance

Project Structure

Development

NERV Command Center

Roadmap

Name

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MAGI — Disagreement OS for LLMs

Why?

How MAGI Differs

Install

Quick Start

How It Works

Protocols

Persona Presets

Python API

Configuration

API Keys

Using OpenRouter

Benchmark Results

Fault Tolerance

Project Structure

Development

NERV Command Center

Roadmap

Name

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages