MI9 is a modular framework for synthetic scenario generation, runtime governance, and LLM-judged evaluation of agentic AI systems.
Paper: MI9 - Agent Intelligence Protocol: Runtime Governance for Agentic AI Systems β arXiv:2508.03858
- Full pipeline: Scenario β Governance β Evaluation (scripts + prompts)
- Structured outputs: Per-run
evaluation.jsonand corpus-levelevaluation_summary.json - LLM-as-judge with configurable rubrics and strict JSON contracts
- Concurrent, fast batch processing (multithreaded / asyncio where appropriate)
- Reproducible via seeds, prompt freezes, and model version pins
- Paper-ready tables/figures derivable directly from
evaluation_summary.json
- Overview
- Requirements
- Installation
- Quick Start
- Pipeline
- Usage
- Metrics
- Outputs & Schemas
- Reproducibility
- Project Structure
- Configuration
- Testing & CI
- Roadmap
- Citation
- License
- Acknowledgments
MI9 operationalizes agentic AI governance as an end-to-end, repeatable workflow:
- Generate Scenarios β Create diverse, realistic agent tasks across domains.
- Generate Governance β Apply a governance model to produce traces & multi-system logs.
- Evaluate Governance β Use an LLM judge to score compliance, risk discovery, and trace quality.
π Read the paper: arXiv:2508.03858
- OS: macOS / Linux / Windows
- Python: 3.10+
- API: Google Generative AI (Gemini) key in
GOOGLE_API_KEY - Pip:
pip>=22
git clone https://github.com/ORG/REPO.git
cd REPO
# (Optional) create a virtual environment
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# or minimal:
pip install google-generativeaiSet your key:
export GOOGLE_API_KEY='your-api-key-here' # macOS/Linux
# PowerShell (Windows):
# $env:GOOGLE_API_KEY = "your-api-key-here"# 1) Generate scenarios
python src/generate_scenario.py \
--output-dir data/ \
--count 5 \
--classes 'Autonomous Vehicle Navigation' 'Medical Diagnosis Assistant' \
--num-workers 8
# 2) Generate governance traces
python src/generate_governance.py data/ --concurrency 10
# 3) Evaluate governance (LLM-as-judge)
python src/evaluate_governance.py \
--input-dir data/ \
--num-workers 8 \
--model gemini-2.5-flash-latest \
--evaluation-prompt prompts/evaluation.txtOutputs
- Per-run:
data/<RUN_ID>/evaluation.json - Summary:
data/evaluation_summary.json
flowchart LR
A[Scenario Prompt] -->|generate_scenario.py| B[scenario.json]
B -->|generate_governance.py| C[governance.json]
B --> D[evaluation.txt]
C --> E[evaluate_governance.py]
D --> E
E --> F[evaluation.json per run]
E --> G[evaluation_summary.json corpus]
python src/generate_scenario.py \
--output-dir data/ \
--count 5 \
--classes 'Autonomous Vehicle Navigation' 'Medical Diagnosis Assistant' \
--num-workers 8 \
--model gemini-2.5-flash-latest \
--api-key $GOOGLE_API_KEY \
--scenario-prompt prompts/scenario_prompt.txtpython src/generate_governance.py data/ \
--concurrency 10 \
--model gemini-2.5-flash-latest \
--api-key $GOOGLE_API_KEY \
--governance-prompt prompts/governance_prompt.txt \
--overwrite # optionalpython src/evaluate_governance.py \
--input-dir data/ \
--num-workers 8 \
--model gemini-2.5-flash-latest \
--api-key $GOOGLE_API_KEY \
--evaluation-prompt prompts/evaluation.txt \
--overwrite # optionalDefault rubric (customizable via prompts/evaluation.txt):
compliance_score(0β1)violations[]with grounded evidence
risk_identification_recall(0β1)mitigation_quality(0β1)emergent_risk_tags[](safety, security, privacy, legal, reputational, β¦)
coherence(0β1)grounding(0β1)action_validity(0β1)
latency_ms,token_cost_estimatejudge_notes
If your scenarios include gold labels, plug in a post-processor to compute precision / recall / F1 for risk/violation detection.
{
"run_id": "1",
"model": "gemini-2.5-flash-latest",
"metrics": {
"compliance_score": 0.88,
"risk_identification_recall": 0.67,
"mitigation_quality": 0.74,
"coherence": 0.91,
"grounding": 0.85,
"action_validity": 0.90
},
"violations": [
{"policy": "Privacy", "severity": "medium", "evidence": "Excerpt or span reference"}
],
"emergent_risk_tags": ["safety", "privacy"],
"operational": {"latency_ms": 1240, "token_cost_estimate": 0.0031},
"judge_notes": "Concise rationale with references to scenario/log lines."
}{
"total_runs": 50,
"by_class": {
"Autonomous Vehicle Navigation": 20,
"Medical Diagnosis Assistant": 30
},
"macro_avgs": {
"compliance_score": 0.81,
"risk_identification_recall": 0.62,
"mitigation_quality": 0.70,
"coherence": 0.88,
"grounding": 0.84,
"action_validity": 0.87
},
"emergent_risk_counts": {
"safety": 19,
"security": 11,
"privacy": 14,
"legal": 9,
"reputational": 8
}
}- Seeds: Deterministic random seeds for any sampling.
- Model Pins: Always specify
--modeland prefer explicit version tags. - Prompt Freezes: Commit
prompts/*.txtwith versioned filenames for each study. - Concurrency:
--num-workersaffects throughput, not scores. - Audit Trails: Use
--verboseto log prompts, responses, and parse events.
.
βββ data/
β βββ 1/
β β βββ scenario.json
β β βββ governance.json
β β βββ evaluation.json
β βββ evaluation_summary.json
βββ prompts/
β βββ scenario_prompt.txt
β βββ governance_prompt.txt
β βββ evaluation.txt
βββ src/
β βββ generate_scenario.py
β βββ generate_governance.py
β βββ evaluate_governance.py
βββ requirements.txt
βββ LICENSE
βββ README.md
All core behavior is controlled via CLI flags and prompt templates:
- Prompts: Edit
prompts/evaluation.txtto change rubric dimensions and strict JSON output keys. - Hooks: Add post-processing in
src/evaluate_governance.pyto compute extra metrics (e.g., bootstrap CIs, PR curves). - Filtering: Point
--input-dirto specific runs or pass multiple run paths.
- Unit tests (suggested): prompt parsing, JSON validation, and aggregation logic.
- Smoke tests: single-sample end-to-end run.
- CI: add a workflow to lint (
ruff/flake8), test (pytest), and validate JSON schemas.
- Gold-label alignment for select domains (auto PR curves)
- Additional judge ensembles & adjudication strategies
- Pluggable risk taxonomies (sector-specific)
- Cost & latency dashboards
- Exporters: CSV/LaTeX for paper appendices
If you use MI9 or the evaluation suite, please cite:
@article{mi9_2025,
author = {Wang, Charles L. and Singhal, Trisha and Kelkar, Ameya and Tuo, Jason},
title = {MI9 -- Agent Intelligence Protocol: Runtime Governance for Agentic AI Systems},
journal = {arXiv:2508.03858 [cs.AI]},
year = {2025},
doi = {10.48550/arXiv.2508.03858},
url = {https://arxiv.org/abs/2508.03858}
}Released under MIT (see LICENSE).
Update the badge at the top if you choose Apache-2.0 or another license.
Thanks to the contributors and reviewers who shaped the MI9 evaluation rubric and reference implementation.
Contributions welcomeβplease open an Issue or start a Discussion to propose metrics, gold labels, or domain packs.
π‘ Tip: To produce camera-ready tables, consume
data/evaluation_summary.jsondirectly in your plotting/table scripts and pin the prompt version + model tag in your appendix.