Skip to content

Reduce runtime latency without degrading deliberation quality, and add first-class observability for scheduling/runtime decisions #3

@fdidonato

Description

@fdidonato

Reduce runtime latency without degrading deliberation quality, and add first-class observability for scheduling/runtime decisions

Summary

Improve MoralStack’s end-to-end latency by eliminating unnecessary waits, removing duplicated retrieval work, and making orchestration more runtime-aware, without weakening the current deliberative quality.

At the same time, introduce a proper observability model for orchestration decisions so the request detail view can fully explain what happened, why it happened, and what was skipped/reused/cancelled during execution.

This work must be implemented as a sequence of safe, incremental changes and must preserve the current governance semantics.


Motivation

MoralStack is already mature and includes useful latency-oriented primitives such as speculative generation and parallel module execution. However, the current runtime still has important inefficiencies:

  • speculative generation is started in parallel with risk estimation, but the controller still waits for both results even when the speculative draft will never be used;
  • the deliberative path supports parallel evaluation, but does not always short-circuit expensive modules when the critic has already established a hard violation;
  • constitution/relevant-principle retrieval can be repeated multiple times within a single request;
  • the domain prefilter cache is invalidated too aggressively;
  • some runtime decisions that materially affect latency are not modeled explicitly in persistence or UI, which makes the request detail page incomplete from an audit/debug perspective.

This issue addresses both sides of the problem:

  1. Performance: reduce avoidable latency without reducing deliberative rigor.
  2. Auditability: make runtime orchestration decisions first-class, persisted, queryable, and visible in the UI.

Goals

Performance goals

  • reduce wall-clock latency by eliminating avoidable blocking;
  • reduce duplicated constitution/retrieval work within the same request;
  • reduce parse-related retry/jitter in risk and retrieval paths;
  • prepare the architecture for future multi-turn conversational governance.

Observability goals

  • persist structured runtime/orchestration decisions separately from raw LLM calls;
  • make scheduling/gating/reuse decisions visible in the request detail page;
  • preserve or improve the auditability of the current pipeline;
  • keep markdown/report export aligned with the UI.

Non-goals

This issue does not aim to:

  • weaken the deliberative process;
  • reduce the number of modules by default;
  • blindly reduce the number of deliberation cycles;
  • shorten prompts in a way that harms reasoning quality;
  • replace structured runtime observability with generic debug logs.

Verified evidence from the current codebase

The following findings were verified against the current codebase and drive this plan:

  1. Speculative overlap exists but is not truly lazy

    • moralstack/orchestration/controller.py
    • _run_speculative_overlap() currently waits for both risk estimation and speculative generation before returning.
    • Result: in REFUSE / DOMAIN_EXCLUDED style paths, we still pay avoidable waiting time.
  2. Full parallel evaluation exists but does not provide true latency short-circuiting

    • moralstack/orchestration/deliberation_runner.py
    • Both _run_full_parallel_evaluation() and _run_critic_gated_parallel() already exist.
    • Result: the runtime already has the right primitives, but the scheduler is not yet fully risk-aware.
  3. Relevant-principles retrieval is duplicated

    • moralstack/orchestration/deliberation_runner.py
    • moralstack/runtime/modules/critic_module.py
    • Result: the same request can trigger repeated get_relevant_principles() work across runner + critic.
  4. DomainPrefilter cache invalidation is too aggressive

    • moralstack/constitution/retriever.py
    • set_domain_keywords() clears cache even when the keyword map has not actually changed.
    • Result: cache hit rate is artificially reduced.
  5. Risk/retrieval stack still has room for structured-output hardening

    • cognitive modules are already more structured;
    • risk estimator and retrieval path still have room to reduce parse-related retries/jitter.
  6. Current persistence/UI are not enough for new runtime decisions

    • moralstack/persistence/db.py
    • moralstack/persistence/sink.py
    • moralstack/reports/...
    • moralstack/ui/app.py
    • moralstack/ui/templates/request.html
    • Result: LLM calls, decision traces, and debug events exist, but runtime decisions such as “speculative discarded”, “simulator skipped”, “critic short-circuit”, or “retrieval reused” are not yet modeled as first-class entities.

High-level implementation strategy

This work should be implemented in phases. The ordering is important.

Phase 0 — Build observability first

Before introducing new latency optimizations, add the persistence/report/UI substrate needed to explain those optimizations.

Phase 1 — Apply low-risk, high-ROI fixes

  • fix unnecessary cache invalidation;
  • reduce object/client recreation overhead;
  • improve structured outputs in risk/retrieval.

Phase 2 — Remove avoidable blocking

  • make speculative overlap truly lazy;
  • stop waiting for speculative results when they are not needed.

Phase 3 — Eliminate duplicated retrieval work

  • retrieve relevant principles once per request path and reuse downstream.

Phase 4 — Introduce runtime-aware scheduling

  • choose between critic_gated and full_parallel based on risk posture;
  • make simulator gating and cancellation decisions explicit and observable.

Phase 5 — Add conservative early convergence

  • allow cycle-1 convergence only when module outputs are strongly aligned.

Phase 6 — Prepare the schema/runtime for future multi-turn governance

  • add conversation-oriented persistence fields and state model foundations.

Scope

A. Observability foundation

A1. Add a first-class orchestration_events persistence layer

Introduce a new table dedicated to runtime/orchestration decisions that are not equivalent to raw LLM calls.

Proposed schema

Add a new orchestration_events table with fields such as:

  • id
  • run_id
  • request_id
  • cycle
  • stage
  • component
  • event_type
  • decision
  • status
  • sequence
  • started_at
  • duration_ms
  • reason_codes_json
  • inputs_json
  • outputs_json
  • payload_json

Why this is needed

debug_events are useful as raw logs, but they are too generic to drive a clear request detail UI. We need a queryable, stable runtime event model for:

  • scheduler strategy selection;
  • speculative start/join/use/discard;
  • simulator gate decisions;
  • retrieval reuse decisions;
  • cache invalidation/hit/miss events;
  • convergence decisions;
  • module cancellation / short-circuit outcomes.

Files to update

  • moralstack/persistence/db.py
  • moralstack/persistence/sink.py
  • moralstack/persistence/__init__.py
  • any relevant persistence helpers

A2. Extend llm_calls with runtime outcome metadata

Add minimal metadata to llm_calls so the UI can tell whether a call was:

  • normal or speculative;
  • used or discarded;
  • cached/reused/skipped/cancelled.

New fields

  • call_kind (normal | speculative | synthetic)
  • call_outcome (used | discarded | skipped | cancelled | cached | none)
  • cache_status (hit | miss | reused | none)
  • related_event_id (optional)

Why this is needed

A speculative generation call should not look identical to a normal call in the UI.

Files to update

  • moralstack/persistence/db.py
  • moralstack/persistence/sink.py
  • all code paths persisting llm_calls

A3. Add new decision trace stages

Introduce explicit audit-grade trace stages for:

  • RISK_ASSESSMENT
  • REQUEST_ANALYSIS_CONTEXT
  • CYCLE_SUMMARY

RISK_ASSESSMENT should include

  • risk_score
  • risk_category
  • operational_risk
  • intent_to_harm
  • requested_instructions
  • intent_operational
  • risk_policy_action
  • estimation_mode
  • detected_domain
  • activated_signals

REQUEST_ANALYSIS_CONTEXT should include

  • relevant_principles
  • constitution_domain
  • retrieval_count
  • reuse_targets
  • prefilter_cache_status
  • parallel_retrieval

CYCLE_SUMMARY should include

  • cycle
  • scheduler_strategy
  • modules_planned
  • modules_executed
  • modules_skipped
  • modules_cancelled
  • critic_decision
  • violations_count
  • violated_hard
  • semantic_expected_harm
  • perspectives_weighted_approval
  • convergence_decision
  • convergence_reason
  • next_action

Files to update

  • moralstack/orchestration/controller.py
  • moralstack/orchestration/deliberation_runner.py
  • trace/decision logging helpers

A4. Upgrade the request detail UI

The request detail page should expose runtime decisions clearly and not rely only on raw debug events.

Add new UI sections

  1. Execution Strategy

    • risk estimation mode
    • speculative generation outcome
    • scheduler strategy
    • retrieval reuse summary
    • simulator gating summary
    • convergence summary
  2. Runtime Decisions

    • ordered table of orchestration events
    • columns like:
      • cycle
      • stage
      • component
      • event
      • decision
      • reason
      • duration
      • status/badge
  3. Cycle Cards

    • one card per deliberation cycle
    • strategy chosen
    • modules planned/executed/skipped/cancelled
    • critic outcome
    • simulator outcome or gate reason
    • perspectives outcome
    • convergence decision
  4. Semantic Timeline Enhancements

    • speculative call badges
    • cancelled modules
    • skipped modules
    • reused computations
    • cache hits

Keep existing raw diagnostics

  • retain Debug Events / raw traces as advanced fallback sections;
  • they should not remain the primary way to understand runtime behavior.

Files to update

  • moralstack/reports/model.py
  • moralstack/reports/orchestrator_observability.py
  • add new report/view-model helpers as needed
  • moralstack/ui/app.py
  • moralstack/ui/templates/request.html
  • related static assets if needed

B. High-ROI performance fixes

B1. Make DomainPrefilter.set_domain_keywords() idempotent

Problem

The prefilter cache is invalidated even when the keyword map has not changed.

Change

Only invalidate if the effective keyword configuration actually changed.

Expected benefit

Improved cache hit rate and lower repeated domain prefilter cost.

Required observability

Emit structured events such as:

  • DOMAIN_PREFILTER_CACHE_HIT
  • DOMAIN_PREFILTER_CACHE_MISS
  • DOMAIN_PREFILTER_CACHE_INVALIDATED

Files

  • moralstack/constitution/retriever.py

B2. Reuse OpenAI clients / policy objects where safe

Problem

The retrieval stack and some mini-estimator paths recreate clients/policy objects too often.

Change

Introduce safe client/policy pooling/reuse to avoid repeated initialization overhead.

Expected benefit

Lower infrastructure overhead and improved connection reuse.

Notes

This is mainly an efficiency change and does not need prominent request-detail visualization, but may be surfaced in advanced diagnostics or run-level summaries.

Files

  • moralstack/constitution/retriever.py
  • moralstack/models/risk/estimator.py

B3. Strengthen structured-output enforcement in risk/retrieval

Problem

Risk/retrieval still have avoidable parse-related retries/jitter.

Change

Use response_format={"type":"json_object"} where appropriate and update parsers/fallbacks accordingly.

Expected benefit

Fewer malformed responses, fewer retries, lower jitter.

Required observability

Persist or surface:

  • parse status
  • parse attempts
  • response contract type
  • fallback usage

Files

  • moralstack/models/risk/estimator.py
  • moralstack/constitution/retriever.py

C. Remove avoidable controller blocking

C1. Make speculative overlap truly lazy

Problem

Speculative generation is started in parallel with risk estimation, but the controller still blocks on the speculative result even when it will never be used.

Change

Refactor _run_speculative_overlap() so that it returns:

  • the resolved risk estimation;
  • a handle/future for speculative generation.

Only join the speculative future in branches that actually consume the draft.

Required events

  • SPECULATIVE_STARTED
  • SPECULATIVE_JOIN_REQUIRED
  • SPECULATIVE_JOIN_SKIPPED
  • SPECULATIVE_RESULT_USED
  • SPECULATIVE_RESULT_DISCARDED

Required llm_calls metadata

  • call_kind = speculative
  • call_outcome = used | discarded

Expected benefit

Avoidable blocking disappears in paths where speculative output is irrelevant.

Files

  • moralstack/orchestration/controller.py
  • related persistence helpers

D. Eliminate duplicated request-level retrieval work

D1. Introduce a request-scoped analysis context

Change

Add a request-scoped context object (for example RequestAnalysisContext) carrying:

  • relevant_principles
  • constitution
  • detected_domain
  • retrieval metadata
  • prefilter cache status

Files

  • moralstack/orchestration/types.py
  • moralstack/orchestration/deliberation_runner.py

D2. Retrieve relevant principles once and reuse them downstream

Problem

Relevant-principles retrieval can happen multiple times for a single request.

Change

Perform retrieval once at the beginning of the deliberative path and pass the result to downstream consumers such as the critic.

Required events

  • RELEVANT_PRINCIPLES_RETRIEVED
  • RELEVANT_PRINCIPLES_REUSED

Required trace

  • REQUEST_ANALYSIS_CONTEXT

Expected benefit

Reduced duplicated retrieval cost without changing policy semantics.

Files

  • moralstack/orchestration/deliberation_runner.py
  • moralstack/runtime/modules/critic_module.py

E. Runtime-aware scheduling in the deliberative cycle

E1. Dynamically choose critic_gated vs full_parallel

Problem

The code already supports both strategies, but the runtime is not yet choosing dynamically based on risk posture.

Change

Inside the deliberation runner, choose scheduling strategy based on the request’s posture, for example using signals like:

  • high-risk category;
  • intent_to_harm = true;
  • operational_risk = HIGH;
  • request for operational instructions in sensitive contexts;
  • prior-cycle hard violations.

Required events

  • PARALLEL_STRATEGY_SELECTED
  • CRITIC_SHORT_CIRCUIT_TRIGGERED
  • PARALLEL_MODULE_CANCEL_ATTEMPTED
  • PARALLEL_MODULE_CANCELLED
  • PARALLEL_MODULE_COMPLETED_AFTER_SHORT_CIRCUIT

Required trace

  • CYCLE_SUMMARY.scheduler_strategy

Expected benefit

Avoid paying for expensive modules in cases where the critic can safely close the path early.

Files

  • moralstack/orchestration/deliberation_runner.py

E2. Make simulator gating explicit, conservative, and observable

Problem

Simulator gating exists, but it is not yet the default operating mode and its decisions are not first-class in observability.

Change

Enable simulator gating only after observability and tests are in place, and use conservative thresholds/logic such as:

  • skip simulator when the critic returns PROCEED with zero violations;
  • skip simulator when previous-cycle expected harm is very low and the draft did not materially change.

Required events

  • SIMULATOR_GATE_DECISION
  • SIMULATOR_EXECUTED
  • SIMULATOR_SKIPPED

Required trace

  • module skip/execution information in CYCLE_SUMMARY

Expected benefit

Reduced cycle cost in clearly converged/clean cases without degrading governance quality.

Files

  • moralstack/orchestration/config_loader.py
  • moralstack/orchestration/deliberation_runner.py

F. Conservative early convergence

F1. Allow early cycle-1 convergence only when all signals strongly align

Change

Extend convergence logic so that cycle 1 can terminate early only when:

  • critic is clean;
  • expected harm is very low;
  • perspectives are strongly aligned;
  • no residual concern indicates the need for another cycle.

Required events

  • CONVERGENCE_EVALUATED
  • EARLY_CONVERGENCE_ACCEPTED
  • EARLY_CONVERGENCE_REJECTED

Required trace

  • CYCLE_SUMMARY.convergence_decision
  • CYCLE_SUMMARY.convergence_reason

Expected benefit

Avoid unnecessary second cycles in genuinely converged cases.

Files

  • moralstack/orchestration/convergence_evaluator.py

G. Multi-turn readiness

G1. Prepare persistence/schema for future conversational governance

Change

Add fields such as:

  • conversation_id
  • turn_index
  • parent_request_id

Future direction

This issue does not require full conversational governance yet, but the schema should be ready for it.

Files

  • moralstack/persistence/db.py
  • request persistence model(s)

G2. Lay groundwork for a future ConversationGovernanceState

Change

Prepare an optional runtime state model that can later support:

  • domain carry-forward;
  • overlay reuse;
  • prior posture carry-forward;
  • future delta-based refreshes.

Note

This is preparatory and should come last.


Testing requirements

The implementation must include or update tests for:

Persistence

  • orchestration_events creation, insertion, and retrieval
  • llm_calls backward-compatible extension

Observability/reporting

  • execution strategy view model
  • runtime decisions view model
  • cycle summary rendering
  • markdown export consistency

UI

  • request detail rendering of:
    • speculative used/discarded
    • scheduler strategy
    • simulator skipped/executed
    • convergence decisions
    • retrieval reuse summary

Runtime behavior

  • lazy speculative join
  • single retrieval + downstream reuse
  • dynamic scheduler selection
  • simulator gating
  • early convergence decisions

Acceptance criteria

This issue is complete only when all of the following are true:

  1. Latency optimizations are implemented without changing governance semantics.
  2. A first-class orchestration_events layer exists and is persisted.
  3. llm_calls can distinguish speculative/used/discarded/cancelled/skipped behavior.
  4. Request detail UI clearly shows execution strategy and runtime decisions.
  5. Cycle-level summaries explain what was executed, skipped, reused, or cancelled.
  6. Relevant-principles retrieval is performed once per request path and reused downstream.
  7. Speculative generation is no longer awaited in branches that do not consume it.
  8. Risk-aware scheduling between critic_gated and full_parallel is implemented and observable.
  9. Simulator gating and early convergence are both conservative, tested, and visible in traces/UI.
  10. Existing reports and raw diagnostics continue to work.

Recommended implementation order

Block A — Observability before optimization

  1. add orchestration_events
  2. extend llm_calls
  3. add new decision trace stages
  4. update report/view-model layer
  5. update request detail UI

Block B — Low-risk, high-ROI fixes

  1. make set_domain_keywords() idempotent
  2. reuse clients/policies
  3. strengthen structured outputs in risk/retrieval

Block C — Remove avoidable waits

  1. make speculative overlap truly lazy
  2. introduce request-scoped analysis context and retrieval reuse

Block D — Runtime-aware scheduling

  1. dynamic critic_gated vs full_parallel
  2. observable simulator gating
  3. conservative early convergence

Block E — Future-proofing

  1. conversation-ready schema fields
  2. runtime foundations for future conversation state

Risks and cautions

  • Do not implement multiple scheduler/gating changes at once without observability.
  • Do not rely on raw debug events as the primary UI source.
  • Do not reduce latency by weakening the deliberative process.
  • Do not change thresholds and scheduling policy in the same commit unless strongly justified and well tested.
  • Ensure backward compatibility for existing DB/report/UI paths wherever possible.

Expected outcome

After this issue is completed, MoralStack should provide:

Better latency

Through:

  • fewer unnecessary waits;
  • less duplicated retrieval work;
  • smarter scheduling;
  • fewer parse retries/jitter;
  • better cache behavior.

Better auditability

Through:

  • structured orchestration events;
  • richer traces;
  • semantically richer LLM call records;
  • a request detail page that explains runtime decisions clearly.

Better readiness for multi-turn governance

Through:

  • conversation-ready schema foundations;
  • explicit separation between LLM calls, audit traces, and runtime orchestration decisions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions