Reduce runtime latency without degrading deliberation quality, and add first-class observability for scheduling/runtime decisions

# Reduce runtime latency without degrading deliberation quality, and add first-class observability for scheduling/runtime decisions

## Summary

Improve MoralStack’s end-to-end latency by eliminating unnecessary waits, removing duplicated retrieval work, and making orchestration more runtime-aware, **without weakening the current deliberative quality**.

At the same time, introduce a proper observability model for orchestration decisions so the request detail view can fully explain **what happened, why it happened, and what was skipped/reused/cancelled** during execution.

This work must be implemented as a sequence of safe, incremental changes and must preserve the current governance semantics.

---

## Motivation

MoralStack is already mature and includes useful latency-oriented primitives such as speculative generation and parallel module execution. However, the current runtime still has important inefficiencies:

- speculative generation is started in parallel with risk estimation, but the controller still waits for both results even when the speculative draft will never be used;
- the deliberative path supports parallel evaluation, but does not always short-circuit expensive modules when the critic has already established a hard violation;
- constitution/relevant-principle retrieval can be repeated multiple times within a single request;
- the domain prefilter cache is invalidated too aggressively;
- some runtime decisions that materially affect latency are not modeled explicitly in persistence or UI, which makes the request detail page incomplete from an audit/debug perspective.

This issue addresses both sides of the problem:

1. **Performance:** reduce avoidable latency without reducing deliberative rigor.
2. **Auditability:** make runtime orchestration decisions first-class, persisted, queryable, and visible in the UI.

---

## Goals

### Performance goals
- reduce wall-clock latency by eliminating avoidable blocking;
- reduce duplicated constitution/retrieval work within the same request;
- reduce parse-related retry/jitter in risk and retrieval paths;
- prepare the architecture for future multi-turn conversational governance.

### Observability goals
- persist structured runtime/orchestration decisions separately from raw LLM calls;
- make scheduling/gating/reuse decisions visible in the request detail page;
- preserve or improve the auditability of the current pipeline;
- keep markdown/report export aligned with the UI.

---

## Non-goals

This issue does **not** aim to:

- weaken the deliberative process;
- reduce the number of modules by default;
- blindly reduce the number of deliberation cycles;
- shorten prompts in a way that harms reasoning quality;
- replace structured runtime observability with generic debug logs.

---

## Verified evidence from the current codebase

The following findings were verified against the current codebase and drive this plan:

1. **Speculative overlap exists but is not truly lazy**
   - `moralstack/orchestration/controller.py`
   - `_run_speculative_overlap()` currently waits for both risk estimation and speculative generation before returning.
   - Result: in `REFUSE` / `DOMAIN_EXCLUDED` style paths, we still pay avoidable waiting time.

2. **Full parallel evaluation exists but does not provide true latency short-circuiting**
   - `moralstack/orchestration/deliberation_runner.py`
   - Both `_run_full_parallel_evaluation()` and `_run_critic_gated_parallel()` already exist.
   - Result: the runtime already has the right primitives, but the scheduler is not yet fully risk-aware.

3. **Relevant-principles retrieval is duplicated**
   - `moralstack/orchestration/deliberation_runner.py`
   - `moralstack/runtime/modules/critic_module.py`
   - Result: the same request can trigger repeated `get_relevant_principles()` work across runner + critic.

4. **DomainPrefilter cache invalidation is too aggressive**
   - `moralstack/constitution/retriever.py`
   - `set_domain_keywords()` clears cache even when the keyword map has not actually changed.
   - Result: cache hit rate is artificially reduced.

5. **Risk/retrieval stack still has room for structured-output hardening**
   - cognitive modules are already more structured;
   - risk estimator and retrieval path still have room to reduce parse-related retries/jitter.

6. **Current persistence/UI are not enough for new runtime decisions**
   - `moralstack/persistence/db.py`
   - `moralstack/persistence/sink.py`
   - `moralstack/reports/...`
   - `moralstack/ui/app.py`
   - `moralstack/ui/templates/request.html`
   - Result: LLM calls, decision traces, and debug events exist, but runtime decisions such as “speculative discarded”, “simulator skipped”, “critic short-circuit”, or “retrieval reused” are not yet modeled as first-class entities.

---

## High-level implementation strategy

This work should be implemented in phases. The ordering is important.

### Phase 0 — Build observability first
Before introducing new latency optimizations, add the persistence/report/UI substrate needed to explain those optimizations.

### Phase 1 — Apply low-risk, high-ROI fixes
- fix unnecessary cache invalidation;
- reduce object/client recreation overhead;
- improve structured outputs in risk/retrieval.

### Phase 2 — Remove avoidable blocking
- make speculative overlap truly lazy;
- stop waiting for speculative results when they are not needed.

### Phase 3 — Eliminate duplicated retrieval work
- retrieve relevant principles once per request path and reuse downstream.

### Phase 4 — Introduce runtime-aware scheduling
- choose between `critic_gated` and `full_parallel` based on risk posture;
- make simulator gating and cancellation decisions explicit and observable.

### Phase 5 — Add conservative early convergence
- allow cycle-1 convergence only when module outputs are strongly aligned.

### Phase 6 — Prepare the schema/runtime for future multi-turn governance
- add conversation-oriented persistence fields and state model foundations.

---

## Scope

# A. Observability foundation

## A1. Add a first-class `orchestration_events` persistence layer

Introduce a new table dedicated to runtime/orchestration decisions that are not equivalent to raw LLM calls.

### Proposed schema
Add a new `orchestration_events` table with fields such as:

- `id`
- `run_id`
- `request_id`
- `cycle`
- `stage`
- `component`
- `event_type`
- `decision`
- `status`
- `sequence`
- `started_at`
- `duration_ms`
- `reason_codes_json`
- `inputs_json`
- `outputs_json`
- `payload_json`

### Why this is needed
`debug_events` are useful as raw logs, but they are too generic to drive a clear request detail UI. We need a queryable, stable runtime event model for:

- scheduler strategy selection;
- speculative start/join/use/discard;
- simulator gate decisions;
- retrieval reuse decisions;
- cache invalidation/hit/miss events;
- convergence decisions;
- module cancellation / short-circuit outcomes.

### Files to update
- `moralstack/persistence/db.py`
- `moralstack/persistence/sink.py`
- `moralstack/persistence/__init__.py`
- any relevant persistence helpers

---

## A2. Extend `llm_calls` with runtime outcome metadata

Add minimal metadata to `llm_calls` so the UI can tell whether a call was:

- normal or speculative;
- used or discarded;
- cached/reused/skipped/cancelled.

### New fields
- `call_kind` (`normal | speculative | synthetic`)
- `call_outcome` (`used | discarded | skipped | cancelled | cached | none`)
- `cache_status` (`hit | miss | reused | none`)
- `related_event_id` (optional)

### Why this is needed
A speculative generation call should not look identical to a normal call in the UI.

### Files to update
- `moralstack/persistence/db.py`
- `moralstack/persistence/sink.py`
- all code paths persisting `llm_calls`

---

## A3. Add new decision trace stages

Introduce explicit audit-grade trace stages for:

- `RISK_ASSESSMENT`
- `REQUEST_ANALYSIS_CONTEXT`
- `CYCLE_SUMMARY`

### `RISK_ASSESSMENT` should include
- `risk_score`
- `risk_category`
- `operational_risk`
- `intent_to_harm`
- `requested_instructions`
- `intent_operational`
- `risk_policy_action`
- `estimation_mode`
- `detected_domain`
- `activated_signals`

### `REQUEST_ANALYSIS_CONTEXT` should include
- `relevant_principles`
- `constitution_domain`
- `retrieval_count`
- `reuse_targets`
- `prefilter_cache_status`
- `parallel_retrieval`

### `CYCLE_SUMMARY` should include
- `cycle`
- `scheduler_strategy`
- `modules_planned`
- `modules_executed`
- `modules_skipped`
- `modules_cancelled`
- `critic_decision`
- `violations_count`
- `violated_hard`
- `semantic_expected_harm`
- `perspectives_weighted_approval`
- `convergence_decision`
- `convergence_reason`
- `next_action`

### Files to update
- `moralstack/orchestration/controller.py`
- `moralstack/orchestration/deliberation_runner.py`
- trace/decision logging helpers

---

## A4. Upgrade the request detail UI

The request detail page should expose runtime decisions clearly and not rely only on raw debug events.

### Add new UI sections
1. **Execution Strategy**
   - risk estimation mode
   - speculative generation outcome
   - scheduler strategy
   - retrieval reuse summary
   - simulator gating summary
   - convergence summary

2. **Runtime Decisions**
   - ordered table of orchestration events
   - columns like:
     - cycle
     - stage
     - component
     - event
     - decision
     - reason
     - duration
     - status/badge

3. **Cycle Cards**
   - one card per deliberation cycle
   - strategy chosen
   - modules planned/executed/skipped/cancelled
   - critic outcome
   - simulator outcome or gate reason
   - perspectives outcome
   - convergence decision

4. **Semantic Timeline Enhancements**
   - speculative call badges
   - cancelled modules
   - skipped modules
   - reused computations
   - cache hits

### Keep existing raw diagnostics
- retain `Debug Events` / raw traces as advanced fallback sections;
- they should not remain the primary way to understand runtime behavior.

### Files to update
- `moralstack/reports/model.py`
- `moralstack/reports/orchestrator_observability.py`
- add new report/view-model helpers as needed
- `moralstack/ui/app.py`
- `moralstack/ui/templates/request.html`
- related static assets if needed

---

# B. High-ROI performance fixes

## B1. Make `DomainPrefilter.set_domain_keywords()` idempotent

### Problem
The prefilter cache is invalidated even when the keyword map has not changed.

### Change
Only invalidate if the effective keyword configuration actually changed.

### Expected benefit
Improved cache hit rate and lower repeated domain prefilter cost.

### Required observability
Emit structured events such as:
- `DOMAIN_PREFILTER_CACHE_HIT`
- `DOMAIN_PREFILTER_CACHE_MISS`
- `DOMAIN_PREFILTER_CACHE_INVALIDATED`

### Files
- `moralstack/constitution/retriever.py`

---

## B2. Reuse OpenAI clients / policy objects where safe

### Problem
The retrieval stack and some mini-estimator paths recreate clients/policy objects too often.

### Change
Introduce safe client/policy pooling/reuse to avoid repeated initialization overhead.

### Expected benefit
Lower infrastructure overhead and improved connection reuse.

### Notes
This is mainly an efficiency change and does not need prominent request-detail visualization, but may be surfaced in advanced diagnostics or run-level summaries.

### Files
- `moralstack/constitution/retriever.py`
- `moralstack/models/risk/estimator.py`

---

## B3. Strengthen structured-output enforcement in risk/retrieval

### Problem
Risk/retrieval still have avoidable parse-related retries/jitter.

### Change
Use `response_format={"type":"json_object"}` where appropriate and update parsers/fallbacks accordingly.

### Expected benefit
Fewer malformed responses, fewer retries, lower jitter.

### Required observability
Persist or surface:
- parse status
- parse attempts
- response contract type
- fallback usage

### Files
- `moralstack/models/risk/estimator.py`
- `moralstack/constitution/retriever.py`

---

# C. Remove avoidable controller blocking

## C1. Make speculative overlap truly lazy

### Problem
Speculative generation is started in parallel with risk estimation, but the controller still blocks on the speculative result even when it will never be used.

### Change
Refactor `_run_speculative_overlap()` so that it returns:
- the resolved risk estimation;
- a handle/future for speculative generation.

Only join the speculative future in branches that actually consume the draft.

### Required events
- `SPECULATIVE_STARTED`
- `SPECULATIVE_JOIN_REQUIRED`
- `SPECULATIVE_JOIN_SKIPPED`
- `SPECULATIVE_RESULT_USED`
- `SPECULATIVE_RESULT_DISCARDED`

### Required `llm_calls` metadata
- `call_kind = speculative`
- `call_outcome = used | discarded`

### Expected benefit
Avoidable blocking disappears in paths where speculative output is irrelevant.

### Files
- `moralstack/orchestration/controller.py`
- related persistence helpers

---

# D. Eliminate duplicated request-level retrieval work

## D1. Introduce a request-scoped analysis context

### Change
Add a request-scoped context object (for example `RequestAnalysisContext`) carrying:
- `relevant_principles`
- `constitution`
- `detected_domain`
- retrieval metadata
- prefilter cache status

### Files
- `moralstack/orchestration/types.py`
- `moralstack/orchestration/deliberation_runner.py`

---

## D2. Retrieve relevant principles once and reuse them downstream

### Problem
Relevant-principles retrieval can happen multiple times for a single request.

### Change
Perform retrieval once at the beginning of the deliberative path and pass the result to downstream consumers such as the critic.

### Required events
- `RELEVANT_PRINCIPLES_RETRIEVED`
- `RELEVANT_PRINCIPLES_REUSED`

### Required trace
- `REQUEST_ANALYSIS_CONTEXT`

### Expected benefit
Reduced duplicated retrieval cost without changing policy semantics.

### Files
- `moralstack/orchestration/deliberation_runner.py`
- `moralstack/runtime/modules/critic_module.py`

---

# E. Runtime-aware scheduling in the deliberative cycle

## E1. Dynamically choose `critic_gated` vs `full_parallel`

### Problem
The code already supports both strategies, but the runtime is not yet choosing dynamically based on risk posture.

### Change
Inside the deliberation runner, choose scheduling strategy based on the request’s posture, for example using signals like:
- high-risk category;
- `intent_to_harm = true`;
- `operational_risk = HIGH`;
- request for operational instructions in sensitive contexts;
- prior-cycle hard violations.

### Required events
- `PARALLEL_STRATEGY_SELECTED`
- `CRITIC_SHORT_CIRCUIT_TRIGGERED`
- `PARALLEL_MODULE_CANCEL_ATTEMPTED`
- `PARALLEL_MODULE_CANCELLED`
- `PARALLEL_MODULE_COMPLETED_AFTER_SHORT_CIRCUIT`

### Required trace
- `CYCLE_SUMMARY.scheduler_strategy`

### Expected benefit
Avoid paying for expensive modules in cases where the critic can safely close the path early.

### Files
- `moralstack/orchestration/deliberation_runner.py`

---

## E2. Make simulator gating explicit, conservative, and observable

### Problem
Simulator gating exists, but it is not yet the default operating mode and its decisions are not first-class in observability.

### Change
Enable simulator gating only after observability and tests are in place, and use conservative thresholds/logic such as:
- skip simulator when the critic returns `PROCEED` with zero violations;
- skip simulator when previous-cycle expected harm is very low and the draft did not materially change.

### Required events
- `SIMULATOR_GATE_DECISION`
- `SIMULATOR_EXECUTED`
- `SIMULATOR_SKIPPED`

### Required trace
- module skip/execution information in `CYCLE_SUMMARY`

### Expected benefit
Reduced cycle cost in clearly converged/clean cases without degrading governance quality.

### Files
- `moralstack/orchestration/config_loader.py`
- `moralstack/orchestration/deliberation_runner.py`

---

# F. Conservative early convergence

## F1. Allow early cycle-1 convergence only when all signals strongly align

### Change
Extend convergence logic so that cycle 1 can terminate early only when:
- critic is clean;
- expected harm is very low;
- perspectives are strongly aligned;
- no residual concern indicates the need for another cycle.

### Required events
- `CONVERGENCE_EVALUATED`
- `EARLY_CONVERGENCE_ACCEPTED`
- `EARLY_CONVERGENCE_REJECTED`

### Required trace
- `CYCLE_SUMMARY.convergence_decision`
- `CYCLE_SUMMARY.convergence_reason`

### Expected benefit
Avoid unnecessary second cycles in genuinely converged cases.

### Files
- `moralstack/orchestration/convergence_evaluator.py`

---

# G. Multi-turn readiness

## G1. Prepare persistence/schema for future conversational governance

### Change
Add fields such as:
- `conversation_id`
- `turn_index`
- `parent_request_id`

### Future direction
This issue does not require full conversational governance yet, but the schema should be ready for it.

### Files
- `moralstack/persistence/db.py`
- request persistence model(s)

---

## G2. Lay groundwork for a future `ConversationGovernanceState`

### Change
Prepare an optional runtime state model that can later support:
- domain carry-forward;
- overlay reuse;
- prior posture carry-forward;
- future delta-based refreshes.

### Note
This is preparatory and should come last.

---

## Testing requirements

The implementation must include or update tests for:

### Persistence
- `orchestration_events` creation, insertion, and retrieval
- `llm_calls` backward-compatible extension

### Observability/reporting
- execution strategy view model
- runtime decisions view model
- cycle summary rendering
- markdown export consistency

### UI
- request detail rendering of:
  - speculative used/discarded
  - scheduler strategy
  - simulator skipped/executed
  - convergence decisions
  - retrieval reuse summary

### Runtime behavior
- lazy speculative join
- single retrieval + downstream reuse
- dynamic scheduler selection
- simulator gating
- early convergence decisions

---

## Acceptance criteria

This issue is complete only when all of the following are true:

1. **Latency optimizations are implemented without changing governance semantics.**
2. **A first-class `orchestration_events` layer exists and is persisted.**
3. **`llm_calls` can distinguish speculative/used/discarded/cancelled/skipped behavior.**
4. **Request detail UI clearly shows execution strategy and runtime decisions.**
5. **Cycle-level summaries explain what was executed, skipped, reused, or cancelled.**
6. **Relevant-principles retrieval is performed once per request path and reused downstream.**
7. **Speculative generation is no longer awaited in branches that do not consume it.**
8. **Risk-aware scheduling between `critic_gated` and `full_parallel` is implemented and observable.**
9. **Simulator gating and early convergence are both conservative, tested, and visible in traces/UI.**
10. **Existing reports and raw diagnostics continue to work.**

---

## Recommended implementation order

### Block A — Observability before optimization
1. add `orchestration_events`
2. extend `llm_calls`
3. add new decision trace stages
4. update report/view-model layer
5. update request detail UI

### Block B — Low-risk, high-ROI fixes
6. make `set_domain_keywords()` idempotent
7. reuse clients/policies
8. strengthen structured outputs in risk/retrieval

### Block C — Remove avoidable waits
9. make speculative overlap truly lazy
10. introduce request-scoped analysis context and retrieval reuse

### Block D — Runtime-aware scheduling
11. dynamic `critic_gated` vs `full_parallel`
12. observable simulator gating
13. conservative early convergence

### Block E — Future-proofing
14. conversation-ready schema fields
15. runtime foundations for future conversation state

---

## Risks and cautions

- Do not implement multiple scheduler/gating changes at once without observability.
- Do not rely on raw debug events as the primary UI source.
- Do not reduce latency by weakening the deliberative process.
- Do not change thresholds and scheduling policy in the same commit unless strongly justified and well tested.
- Ensure backward compatibility for existing DB/report/UI paths wherever possible.

---

## Expected outcome

After this issue is completed, MoralStack should provide:

### Better latency
Through:
- fewer unnecessary waits;
- less duplicated retrieval work;
- smarter scheduling;
- fewer parse retries/jitter;
- better cache behavior.

### Better auditability
Through:
- structured orchestration events;
- richer traces;
- semantically richer LLM call records;
- a request detail page that explains runtime decisions clearly.

### Better readiness for multi-turn governance
Through:
- conversation-ready schema foundations;
- explicit separation between LLM calls, audit traces, and runtime orchestration decisions.

---

Reduce runtime latency without degrading deliberation quality, and add first-class observability for scheduling/runtime decisions #3

Description

Reduce runtime latency without degrading deliberation quality, and add first-class observability for scheduling/runtime decisions

Summary

Motivation

Goals

Performance goals

Observability goals

Non-goals

Verified evidence from the current codebase

High-level implementation strategy

Phase 0 — Build observability first

Phase 1 — Apply low-risk, high-ROI fixes

Phase 2 — Remove avoidable blocking

Phase 3 — Eliminate duplicated retrieval work

Phase 4 — Introduce runtime-aware scheduling

Phase 5 — Add conservative early convergence

Phase 6 — Prepare the schema/runtime for future multi-turn governance

Scope

A. Observability foundation

A1. Add a first-class orchestration_events persistence layer

Proposed schema

Why this is needed

Files to update

A2. Extend llm_calls with runtime outcome metadata

New fields

Why this is needed

Files to update

A3. Add new decision trace stages

RISK_ASSESSMENT should include

REQUEST_ANALYSIS_CONTEXT should include

CYCLE_SUMMARY should include

Files to update

A4. Upgrade the request detail UI

Add new UI sections

Keep existing raw diagnostics

Files to update

B. High-ROI performance fixes

B1. Make DomainPrefilter.set_domain_keywords() idempotent

Problem

Change

Expected benefit

Required observability

Files

B2. Reuse OpenAI clients / policy objects where safe

Problem

Change

Expected benefit

Notes

Files

B3. Strengthen structured-output enforcement in risk/retrieval

Problem

Change

Expected benefit

Required observability

Files

C. Remove avoidable controller blocking

C1. Make speculative overlap truly lazy

Problem

Change

Required events

Required llm_calls metadata

Expected benefit

Files

D. Eliminate duplicated request-level retrieval work

D1. Introduce a request-scoped analysis context

Change

Files

D2. Retrieve relevant principles once and reuse them downstream

Problem

Change

Required events

Required trace

Expected benefit

Files

E. Runtime-aware scheduling in the deliberative cycle

E1. Dynamically choose critic_gated vs full_parallel

Problem

Change

Required events

A1. Add a first-class `orchestration_events` persistence layer

A2. Extend `llm_calls` with runtime outcome metadata

`RISK_ASSESSMENT` should include

`REQUEST_ANALYSIS_CONTEXT` should include

`CYCLE_SUMMARY` should include

B1. Make `DomainPrefilter.set_domain_keywords()` idempotent

Required `llm_calls` metadata

E1. Dynamically choose `critic_gated` vs `full_parallel`

G2. Lay groundwork for a future `ConversationGovernanceState`