MoralStack is a governance layer that decides whether, how, and under what constraints a response should be generated before text generation starts.
- Core Idea
- Decision Model
- Architecture
- Benchmark Results
- Quickstart
- Configuration
- Running the Benchmark
- Web UI
- Documentation
- Limitations & Trade-offs
Traditional LLM pipelines optimize for helpfulness first. MoralStack adds an explicit policy layer that separates:
- Decision:
NORMAL_COMPLETE,SAFE_COMPLETE, orREFUSE - Generation: produce text consistent with the selected decision
This keeps decision logic auditable and minimizes unsafe false negatives in sensitive contexts.
Every request produces an explicit final_action:
| Action | Meaning |
|---|---|
NORMAL_COMPLETE |
Direct response |
SAFE_COMPLETE |
Responsible response with safeguards |
REFUSE |
Refusal with safe redirection |
Single source of truth for bounds and action selection:
moralstack/runtime/decision/safe_complete_policy.py- API:
compute_action_bounds(...),decide_final_action(...)
SAFE_COMPLETE is a first-class policy action and is not inferred from text disclaimers.
High-level flow:
- Risk and context analysis (parallel mini-estimators: intent, operational risk, signal detection)
- Policy bounds computation + domain overlay application
- Routing (
FAST_PATHorDELIBERATIVE_PATH) - Deliberation (critic → simulator → perspectives → hindsight) when needed
- Response assembly aligned with
final_action
Main packages:
moralstack/runtime/— orchestration runtimemoralstack/orchestration/— controller, routing, deliberation servicesmoralstack/models/risk/— risk estimation and calibrationmoralstack/constitution/— constitution schema, loader, store (YAML-driven)moralstack/persistence/— DB and file persistence modesmoralstack/ui/— FastAPI dashboard (moralstack-ui)
Evaluated on 84 questions spanning adversarial prompts, dual-use domains, regulated topics (legal, medical, financial), and false-positive torture tests. Judge model (GPT-5.2) is independent from both baseline and MoralStack generation.
| Metric | Baseline (GPT-4o) | MoralStack |
|---|---|---|
| False Negatives (no refusal when needed) | 13 | 0 |
| Information Leakage | 14 (16.7%) | 0 (0%) |
| False Positives (refusal on legitimate queries) | 0 | 0 |
| Utility Preservation (legitimate queries answered) | 62/62 | 62/62 |
| Safe Redirection on Refusal | 1/22 (4.5%) | 22/22 (100%) |
| Baseline | MoralStack | Tie | |
|---|---|---|---|
| Wins | 1 | 53 | 30 |
| Avg Safety Score | 7.73/10 | 9.39/10 | — |
Predicted
Expected NC SC REFUSE
───────────────────────────────
NC 9 1 0
SC 0 52 0
REFUSE 0 0 22
98.8% compliance rate. Zero system errors. The single off-diagonal cell (1 NC→SC) is a health-domain query where MoralStack adds a professional-consultation disclaimer — a reasonable policy choice for regulated content.
Note: This benchmark demonstrates proof-of-concept effectiveness on 84 curated questions. It is not a claim of production-grade coverage across all possible inputs. We encourage independent evaluation.
| Baseline | MoralStack | |
|---|---|---|
| Avg Latency | ~5s | ~60s |
Deliberative paths add latency by design. Latency-reducing optimizations include speculative decoding, parallel risk estimation, lighter models for simulator and policy rewrite (see Limitations and Configuration).
- Python 3.11+
- OpenAI API key
Before installing, please create a virtual environment and activate it.
python -m venv venv
source venv/bin/activateOne-command (recommended):
python scripts/install.pyInstalls the package in editable mode with all extras (dev, ui). Registers moralstack and moralstack-ui CLI entry
points.
Manual (equivalent to install.py):
pip install -e ".[dev,ui]"cp .env.minimal .envSet at least:
OPENAI_API_KEY=sk-...moralstackUseful commands:
moralstack --helpmoralstack --verbosemoralstack --mock
Legacy wrapper (same runtime entrypoint): python scripts/mstack_run.py
Environment is loaded via moralstack/utils/env_loader.py.
.envvalues are loaded withoverride=True(non-empty.envvalues override existing env vars)- optional empty values are purged after load to avoid invalid client configuration
Key variables:
OPENAI_MODEL(defaultgpt-4o)MORALSTACK_POLICY_REWRITE_MODEL(optional; model for deliberativerewrite()at cycle 2+; if unset, same asOPENAI_MODEL..env.templatesetsgpt-4.1-nanofor lower rewrite latency.)OPENAI_TIMEOUT_MS(default60000)OPENAI_MAX_RETRIES(default3)OPENAI_TEMPERATURE(code fallback default0.7;.env.templatestarter value0.1)OPENAI_TOP_P(code fallback default0.9;.env.templatestarter value0.8)MORALSTACK_DB_PATH(enable SQLite persistence)MORALSTACK_PERSIST_MODE(db_only,dual,file_only)MORALSTACK_ORCHESTRATOR_BORDERLINE_REFUSE_UPPER(default0.95)
For full variable reference see INSTALL.md and docs/modules/*.md.
Default models by component (each can be overridden via its env var; see INSTALL.md and module docs):
| Component | Default model | Env variable |
|---|---|---|
| Policy (generation) | gpt-4o | OPENAI_MODEL |
| Policy (rewrite) | same as primary, or gpt-4.1-nano in .env.template |
MORALSTACK_POLICY_REWRITE_MODEL |
| Risk estimator | follows OPENAI_MODEL unless set |
MORALSTACK_RISK_MODEL |
| Critic | follows OPENAI_MODEL unless set |
MORALSTACK_CRITIC_MODEL |
| Simulator | follows OPENAI_MODEL unless set |
MORALSTACK_SIMULATOR_MODEL |
| Perspectives | follows OPENAI_MODEL unless set |
MORALSTACK_PERSPECTIVES_MODEL |
| Hindsight | follows OPENAI_MODEL unless set |
MORALSTACK_HINDSIGHT_MODEL |
python scripts/benchmark_moralstack.pyBenchmark supports separate baseline and judge models via:
MORALSTACK_BENCHMARK_BASELINE_MODELMORALSTACK_BENCHMARK_JUDGE_MODEL
When judge model differs from generation model, the judge is treated as independent.
Install UI extras and configure DB path:
pip install -e .[ui]Set:
MORALSTACK_DB_PATHMORALSTACK_UI_USERNAMEMORALSTACK_UI_PASSWORD
Start:
moralstack-uiOpen http://localhost:8765/ (or MORALSTACK_UI_PORT).
- INSTALL.md
- Architecture spec
- Decision policy
- Constitution design
- Module docs
- Development guide
- Limitations and trade-offs
MoralStack makes deliberate trade-offs:
- Latency over speed: deliberative paths run multiple LLM calls (risk → critic → simulator → perspectives → hindsight). Average response time is ~60s vs ~5s for raw GPT-4o. This is a design choice — governance takes time.
- Multi-model cost: a single deliberative request makes 7-9 LLM calls. Example profiles:
.env.minimalusesgpt-4.1-nanofor policy rewrite and simulator, andgpt-4o-minifor perspectives (all overridable via env). - Benchmark scope: 84 curated questions demonstrate the approach but do not cover all edge cases. We recommend running your own evaluations on domain-specific inputs.
- LLM non-determinism: despite low temperature settings across all modules, LLM outputs can vary between runs. The system includes deterministic guardrails in code to bound this variance, but perfect reproducibility is not guaranteed.
Latency has been reduced through speculative decoding (predicted outputs for draft revisions), parallel risk estimation, lighter models for simulator and rewrite (gpt-4.1-nano), structured JSON output enforcement, and soft-revision prompt constraints. Further optimizations (early-exit on low-risk queries, context-mode switching) are planned.
See full discussion in docs/limitations_and_tradeoffs.md.
