feat(moderation): add input moderation guard for the UI assistant#13
Merged
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ution) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add GET /api/moderation/:type/:id (reports enabled state) and POST /api/moderation/:type/:id (runs moderation and returns allow/block verdict with refusal message). Includes service with timeout/fail-open logic and full API test suite (8 tests, all passing). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…locks The session recorder now captures every moderation verdict via recordModerationDecision (action + category + reason + skipped), surfaced as a 'moderation: <verdict>' trace entry. Recording happens at the gate for any visible turn, with a post-stream safety net for turns that emit no visible output, so allow/skip decisions are auditable in the debug dialog (not just blocks). Stays client-side/ephemeral — no server log. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Render moderation trace entries with a localized action chip (Allowed/Blocked/Skipped, color-coded) plus labeled category and reason, instead of the generic JSON fallback. Adds an e2e asserting the renderer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s-on Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eration Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…odule Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…marizer/assistant fallback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…essage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… timeout const Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… refusal Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mposable Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ariants Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Integrate latest main (anonymous-action token on gateway/summary, progressive tool disclosure, cache metrics in debug traces, drop user name from prompt). Conflict resolutions: - api/src/models/mock-model.ts: keep both processMockModeratorPrompt and processSelectToolsSeam; take main's processForModel signature (adds tools?). - ui/src/components/AgentChat.vue: main chat keeps both refusalMessage and toolExploration options; evaluator chat unmoderated (no refusalMessage). - ui/src/composables/use-agent-chat.ts: keep refusalMessage + toolExploration options and MODERATION_TIMEOUT_MS + exploration state side by side. Moderation now rides main's gatewayFetch, so it inherits anonymous-token handling. - docs/architecture.md: keep both sections; renumber tool-disclosure to §9. - ui/dts/auto-imports.d.ts: union of both generated import sets. Synced deps (npm install) to pick up @data-fair/lib-utils ^1.11.0 (needed by main's markdown.ts headingClasses). check-types and lint clean. Known pre-existing environmental failure (NOT from this merge): the gateway "anonymous request with valid token succeeds" test fails with a JWT NotBeforeError due to clock skew on the long-running simple-directory container.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a per-message input-moderation guard for the UI-integrated assistant. Each new user message is classified by a configurable moderator model (falling back to the summarizer model) and out-of-scope / abusive / prompt-injection messages are blocked with a generic refusal before any assistant output is shown.
Why: protect the assistant from profanity, prompt-injection, persona-override, and out-of-scope abuse, without delaying the request — moderation runs concurrently with the assistant turn and only withholds the first visible byte until the verdict arrives.
What changed:
api/src/moderation/{operations,service,router}.ts— new module: moderation prompt builder, tolerant verdict parser, model resolution (moderator→summarizer→ skip), and a fail-open 1.5s-timeout service.GETreturns the enabled flag,POSTreturns the verdict.api/types/settings/schema.js— newmoderatormodel slot and amoderationsettings block (enabled,refusalMessage).ui/src/composables/use-agent-chat.ts— parallel gate that buffers the stream until the verdict, blocks → refusal + drops the message from model context.ui/src/components/agent-chat/AgentChatDebugDialog.vue+session-recorder.ts— dedicated trace renderer; every decision (allow/skip/block) is recorded client-side.docs/architecture.md— new "Input Moderation Guard" section.mock-moderatormodel variant + unit/api/e2e tests.ui/src/components/vjsf/*andapi/doc/settings/put-req/*(build-types output).Regression risks:
api/src/moderation/router.tsonly callsreqSession— noassertCanUseModel/assertRoleQuota/checkQuota, unlikegateway/router.tsandsummary/router.ts. SincereqSessionpermits anonymous sessions, any caller canPOST /api/moderation/<any-owner>/<id>with arbitrary text and trigger that owner's configured model (consuming their provider API key/budget) with no quota accounting.GETalso discloses any owner's moderation-enabled flag. Cheap per call, but ungated.use-agent-chat.tsrecords the verdict but does not show the refusal if the stream produced notext-delta/tool-call. Edge case; a normal turn always streams something.replaceOnenow always writes amoderationfield (defaulting to{ enabled: false }); existing settings docs without it are unaffected on read viaemptySettings/?? defaultModeration.