Skip to content

charleslwang/MI9-Eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›°οΈ MI9 -- Agent Intelligence Protocol: Runtime Governance for Agentic AI Systems

arXiv Python 3.10+ License: MIT CI Reproducibility Issues Discussions

MI9 is a modular framework for synthetic scenario generation, runtime governance, and LLM-judged evaluation of agentic AI systems.

Paper: MI9 - Agent Intelligence Protocol: Runtime Governance for Agentic AI Systems β€” arXiv:2508.03858


✨ Highlights

  • Full pipeline: Scenario ➜ Governance ➜ Evaluation (scripts + prompts)
  • Structured outputs: Per-run evaluation.json and corpus-level evaluation_summary.json
  • LLM-as-judge with configurable rubrics and strict JSON contracts
  • Concurrent, fast batch processing (multithreaded / asyncio where appropriate)
  • Reproducible via seeds, prompt freezes, and model version pins
  • Paper-ready tables/figures derivable directly from evaluation_summary.json

πŸ—ΊοΈ Table of Contents


πŸ”Ž Overview

MI9 operationalizes agentic AI governance as an end-to-end, repeatable workflow:

  1. Generate Scenarios β€” Create diverse, realistic agent tasks across domains.
  2. Generate Governance β€” Apply a governance model to produce traces & multi-system logs.
  3. Evaluate Governance β€” Use an LLM judge to score compliance, risk discovery, and trace quality.

πŸ“„ Read the paper: arXiv:2508.03858


πŸ’» Requirements

  • OS: macOS / Linux / Windows
  • Python: 3.10+
  • API: Google Generative AI (Gemini) key in GOOGLE_API_KEY
  • Pip: pip>=22

βš™οΈ Installation

git clone https://github.com/ORG/REPO.git
cd REPO

# (Optional) create a virtual environment
python -m venv .venv && source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
# or minimal:
pip install google-generativeai

Set your key:

export GOOGLE_API_KEY='your-api-key-here'     # macOS/Linux
# PowerShell (Windows):
# $env:GOOGLE_API_KEY = "your-api-key-here"

πŸš€ Quick Start

# 1) Generate scenarios
python src/generate_scenario.py \
  --output-dir data/ \
  --count 5 \
  --classes 'Autonomous Vehicle Navigation' 'Medical Diagnosis Assistant' \
  --num-workers 8

# 2) Generate governance traces
python src/generate_governance.py data/ --concurrency 10

# 3) Evaluate governance (LLM-as-judge)
python src/evaluate_governance.py \
  --input-dir data/ \
  --num-workers 8 \
  --model gemini-2.5-flash-latest \
  --evaluation-prompt prompts/evaluation.txt

Outputs

  • Per-run: data/<RUN_ID>/evaluation.json
  • Summary: data/evaluation_summary.json

πŸ”§ Pipeline

flowchart LR
  A[Scenario Prompt] -->|generate_scenario.py| B[scenario.json]
  B -->|generate_governance.py| C[governance.json]
  B --> D[evaluation.txt]
  C --> E[evaluate_governance.py]
  D --> E
  E --> F[evaluation.json per run]
  E --> G[evaluation_summary.json corpus]
Loading

🧰 Usage

1) Scenario Generation

python src/generate_scenario.py \
  --output-dir data/ \
  --count 5 \
  --classes 'Autonomous Vehicle Navigation' 'Medical Diagnosis Assistant' \
  --num-workers 8 \
  --model gemini-2.5-flash-latest \
  --api-key $GOOGLE_API_KEY \
  --scenario-prompt prompts/scenario_prompt.txt

2) Governance Generation

python src/generate_governance.py data/ \
  --concurrency 10 \
  --model gemini-2.5-flash-latest \
  --api-key $GOOGLE_API_KEY \
  --governance-prompt prompts/governance_prompt.txt \
  --overwrite   # optional

3) Governance Evaluation

python src/evaluate_governance.py \
  --input-dir data/ \
  --num-workers 8 \
  --model gemini-2.5-flash-latest \
  --api-key $GOOGLE_API_KEY \
  --evaluation-prompt prompts/evaluation.txt \
  --overwrite   # optional

πŸ“ Metrics

Default rubric (customizable via prompts/evaluation.txt):

Policy Compliance

  • compliance_score (0–1)
  • violations[] with grounded evidence

Risk Discovery & Mitigation

  • risk_identification_recall (0–1)
  • mitigation_quality (0–1)
  • emergent_risk_tags[] (safety, security, privacy, legal, reputational, …)

Trace Quality

  • coherence (0–1)
  • grounding (0–1)
  • action_validity (0–1)

Operational (if available)

  • latency_ms, token_cost_estimate
  • judge_notes

If your scenarios include gold labels, plug in a post-processor to compute precision / recall / F1 for risk/violation detection.


πŸ“¦ Outputs & Schemas

Per-run β€” evaluation.json

{
  "run_id": "1",
  "model": "gemini-2.5-flash-latest",
  "metrics": {
    "compliance_score": 0.88,
    "risk_identification_recall": 0.67,
    "mitigation_quality": 0.74,
    "coherence": 0.91,
    "grounding": 0.85,
    "action_validity": 0.90
  },
  "violations": [
    {"policy": "Privacy", "severity": "medium", "evidence": "Excerpt or span reference"}
  ],
  "emergent_risk_tags": ["safety", "privacy"],
  "operational": {"latency_ms": 1240, "token_cost_estimate": 0.0031},
  "judge_notes": "Concise rationale with references to scenario/log lines."
}

Corpus β€” evaluation_summary.json

{
  "total_runs": 50,
  "by_class": {
    "Autonomous Vehicle Navigation": 20,
    "Medical Diagnosis Assistant": 30
  },
  "macro_avgs": {
    "compliance_score": 0.81,
    "risk_identification_recall": 0.62,
    "mitigation_quality": 0.70,
    "coherence": 0.88,
    "grounding": 0.84,
    "action_validity": 0.87
  },
  "emergent_risk_counts": {
    "safety": 19,
    "security": 11,
    "privacy": 14,
    "legal": 9,
    "reputational": 8
  }
}

πŸ” Reproducibility

  • Seeds: Deterministic random seeds for any sampling.
  • Model Pins: Always specify --model and prefer explicit version tags.
  • Prompt Freezes: Commit prompts/*.txt with versioned filenames for each study.
  • Concurrency: --num-workers affects throughput, not scores.
  • Audit Trails: Use --verbose to log prompts, responses, and parse events.

πŸ—‚οΈ Project Structure

.
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ 1/
β”‚   β”‚   β”œβ”€β”€ scenario.json
β”‚   β”‚   β”œβ”€β”€ governance.json
β”‚   β”‚   └── evaluation.json
β”‚   └── evaluation_summary.json
β”œβ”€β”€ prompts/
β”‚   β”œβ”€β”€ scenario_prompt.txt
β”‚   β”œβ”€β”€ governance_prompt.txt
β”‚   └── evaluation.txt
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ generate_scenario.py
β”‚   β”œβ”€β”€ generate_governance.py
β”‚   └── evaluate_governance.py
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ LICENSE
└── README.md

βš™οΈ Configuration

All core behavior is controlled via CLI flags and prompt templates:

  • Prompts: Edit prompts/evaluation.txt to change rubric dimensions and strict JSON output keys.
  • Hooks: Add post-processing in src/evaluate_governance.py to compute extra metrics (e.g., bootstrap CIs, PR curves).
  • Filtering: Point --input-dir to specific runs or pass multiple run paths.

βœ… Testing & CI

  • Unit tests (suggested): prompt parsing, JSON validation, and aggregation logic.
  • Smoke tests: single-sample end-to-end run.
  • CI: add a workflow to lint (ruff/flake8), test (pytest), and validate JSON schemas.

🧭 Roadmap

  • Gold-label alignment for select domains (auto PR curves)
  • Additional judge ensembles & adjudication strategies
  • Pluggable risk taxonomies (sector-specific)
  • Cost & latency dashboards
  • Exporters: CSV/LaTeX for paper appendices

πŸ“š Citation

If you use MI9 or the evaluation suite, please cite:

@article{mi9_2025,
  author       = {Wang, Charles L. and Singhal, Trisha and Kelkar, Ameya and Tuo, Jason},
  title        = {MI9 -- Agent Intelligence Protocol: Runtime Governance for Agentic AI Systems},
  journal      = {arXiv:2508.03858 [cs.AI]},
  year         = {2025},
  doi          = {10.48550/arXiv.2508.03858},
  url          = {https://arxiv.org/abs/2508.03858}
}

πŸ“œ License

Released under MIT (see LICENSE).

Update the badge at the top if you choose Apache-2.0 or another license.


πŸ™ Acknowledgments

Thanks to the contributors and reviewers who shaped the MI9 evaluation rubric and reference implementation.

Contributions welcomeβ€”please open an Issue or start a Discussion to propose metrics, gold labels, or domain packs.


πŸ’‘ Tip: To produce camera-ready tables, consume data/evaluation_summary.json directly in your plotting/table scripts and pin the prompt version + model tag in your appendix.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages