Skip to content

fix(databases-on-aws): correct DSQL type guidance, add cluster-lifecycle troubleshooting#155

Merged
krokoko merged 2 commits into
awslabs:mainfrom
anwesham-lab:dsql-types
May 4, 2026
Merged

fix(databases-on-aws): correct DSQL type guidance, add cluster-lifecycle troubleshooting#155
krokoko merged 2 commits into
awslabs:mainfrom
anwesham-lab:dsql-types

Conversation

@anwesham-lab
Copy link
Copy Markdown
Member

@anwesham-lab anwesham-lab commented May 4, 2026

Summary

  • Correct DSQL type guidance in the databases-on-aws skill: JSON is a supported column type (1 MiB, auto-compressed); JSONB, arrays, and INET remain runtime-only. Quick-reference type lists are replaced with pointers to the canonical AWS supported data types doc plus an awsknowledge verify row, so the skill does not drift as DSQL's type surface evolves.
  • Add two entries to troubleshooting.md under a new Cluster Lifecycle section: the FATAL: unable to accept connection, waking up cluster error emitted when connecting to an INACTIVE cluster, and the FailedPrecondition returned when backing up an IDLE/INACTIVE cluster. Links to the cluster lifecycle docs.
  • Remove stale JSON.stringify(...) wrapping in data-operations.md examples now that metadata is a JSON column.
  • Extend the Tier 2 functional eval suite with three new evals covering the updated behaviors (JSON column storage, array-column rejection, INACTIVE-cluster wake flow), plus the grader clauses they need. Add --eval-ids to the runner so subsets can be run without executing the full suite.

Test plan

  • Verified against a live DSQL cluster in us-west-2:
    • CREATE TABLE ... (payload JSON) → accepted; information_schema reports data_type = json
    • CREATE TABLE ... (payload JSONB) → rejected with ERROR: datatype jsonb not supported
    • CREATE TABLE ... (tags TEXT[]) → rejected with ERROR: datatype text[] not supported
    • payload::jsonb->>'key', payload::jsonb @> '{...}'::jsonb, string_to_array(...) all succeed
  • Verified lifecycle behavior against a live cluster: INACTIVE cluster returns the documented FATAL error on first connection and reaches ACTIVE after polling; IDLE cluster wakes transparently
  • Ran the three new Tier 2 evals end-to-end via run_functional_evals.py --eval-ids 6,7,8: 10 / 10 expectations pass (eval 6: JSON column storage 3/3; eval 7: array storage guidance 3/3; eval 8: INACTIVE-cluster wake flow 4/4)
  • mise run build passes locally (lint, format, security scans green)

Generated with Claude Code


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

@anwesham-lab anwesham-lab requested review from a team as code owners May 4, 2026 10:47
@anwesham-lab anwesham-lab force-pushed the dsql-types branch 6 times, most recently from 6671e09 to 715fa78 Compare May 4, 2026 11:50
@anwesham-lab
Copy link
Copy Markdown
Member Author

Multi-agent audit of PR #155

Spawned 20+ independent review agents across two orchestrations: (1) a 13-agent fleet covering code review, simplification, comments, test coverage, silent failures, type design, security, regex correctness, cross-refs, drift, docs fact-check, authoring-style audit, and an independent second-opinion code-reviewer; (2) the /code-review:code-review skill's 5 Sonnet reviewers with 7 Haiku confidence scorers.

All findings at ≥60 confidence were reviewed; legitimate ones below were applied to the commit. Findings below 60 or ruled false-positives after verification are listed for transparency.

Legitimate findings applied

# Source (agent / skill) File:line Confidence Finding Fix
1 silent-failure-hunter run_functional_evals.py:354 95 Anti-regression regex must\s+store\s+json\s+as\s+text is too narrow — misses arrays/JSON as TEXT, replace JSON columns with TEXT, passive forms Replaced with does not claim JSONB is a valid column type check (5-pattern DDL-context detection) — tests the actually load-bearing invariant
2 regex-audit run_functional_evals.py:339 75 \bjson\b.*column uses greedy unbounded .*; excludes jsonb via word boundary Removed branch entirely; no longer needed with simplified eval 6
3 regex-audit run_functional_evals.py:364 85 Array-unsupported regex misses cannot, not available, not allowed, only at runtime Widened alternation, expanded window to 80 chars
4 drift-audit / superpowers-code-reviewer troubleshooting.md:100 90 Stale "Or use JSON.stringify: \"..\"" contradicts new JSON-column guidance Consolidated 4 bullets → 2; replaced JSON.stringify path with JSON column option
5 pr-test-analyzer evals.json (missing eval) 95 FailedPrecondition backup-on-IDLE path added to troubleshooting.md but has no eval Added eval 9: FailedPrecondition prompt with 3 expectations, 3 new grader branches, 3/3 PASS live
6 regex-audit run_functional_evals.py:380 70 Bare \bseparator\b, \bdelimiter\b, \bcsv\b fire anywhere in transcript (false positives) Tightened to require array/tag co-occurrence within 120-char window
7 regex-audit run_functional_evals.py:408 70 Bare inactive matches any mention (false positives) Tightened to cluster.{0,60}inactive|inactive.{0,60}cluster|inactive\s+state|in\s+the\s+inactive
8 regex-audit run_functional_evals.py:424 80 Poll-until-ACTIVE window too tight (40 chars); verb set missing sleep, until, keep checking Widened to 80 chars + verb alternation
9 regex-audit run_functional_evals.py:432 85 Retry-after-ACTIVE missed reconnect, re-establish, open new connection Widened verb set
10 authoring-style / code-simplifier development-guide.md:128 85 Dense paragraph combining MUST + URLs + runtime-only caveat Split to two one-line bullets
11 code-simplifier SKILL.md:209 85 Two rules on one bullet (MUST arrays... SHOULD JSON...) Removed redundant SHOULD rule (JSON column type was tautological once JSONB-not-column-type was stated in dev-guide.md)
12 user feedback (live-test-driven) troubleshooting.md:60 Eval 8 regression — agent said IDLE when the FATAL error is INACTIVE-only (verified against live cluster: INACTIVE returns FATAL, IDLE wakes transparently) Added The cluster is \INACTIVE` and waking up.` disambiguation
13 user feedback (house-style) multiple .md SHOULD store JSON in a JSON column is tautological given context that both JSON and TEXT are valid (verified: node-pg auto-serializes JS objects to both column types identically against live cluster) Dropped redundant rule from SKILL.md, development-guide.md (2 sites), patterns.md, onboarding.md (2 sites)
14 user feedback patterns.md:132 Section header Data Serialization misleading (rules are about runtime-only types, not serialization) Renamed to Runtime-Only Types with clearer framing
15 drift-audit + context7 examples/schema.md:24 85 Pre-existing metadata TEXT now contradictory — flagged by superpowers agent as unfixed legacy Investigated: both JSON and TEXT are valid — no change needed (pre-existing shape is a valid choice, skill no longer prescribes one over the other)

False positives / low-confidence findings (verified and dismissed)

# Source File:line Confidence Claim Verification
A shallow-bug-scan run_functional_evals.py:489 0 missing calc order wrong — computed before filtering Read code: eval_items is filtered before missing is computed; requested - {filtered_ids} correctly yields requested - all_available_ids. Confirmed by type-design-analyzer and scoring-agent-2.
B shallow-bug-scan patterns.md:150 0 fromTextJSON function removal may cause ReferenceError downstream patterns.md is documentation markdown — example code snippets, not executed. grep -rn fromTextJSON finds no callers.
C shallow-bug-scan / type-design run_functional_evals.py:482 50 --eval-ids "" triggers ValueError with ugly argparse message Rare edge case; argparse does catch ValueError and prints usage error. Not worth churn.
D code-simplifier troubleshooting.md:64 45 IDLE and INACTIVE used interchangeably — should be just INACTIVE Verified against live cluster: distinct states with distinct behaviors. Backup error mentions both because the FailedPrecondition can apply to either state per AWS docs.
E code-comments / prior-PR multiple 55 Potential duplication between SKILL.md and development-guide.md per PR #66 precedent Intentional per PR #66 author response — entry file holds quick-hit rules; reference file holds detail. Reviewed and kept the pattern.
F security patterns.md:145-162 55 SQL examples missing tenant_id Pre-existing; not touched by this PR. Flagged for a future cleanup PR.
G context7 troubleshooting.md:59 FATAL wire error string not literally in AWS docs Reproduced live against cluster obtk34wjzneklebnsltvbsvjnu in us-west-2 during this PR — empirically observed, not a documentation claim
H authoring-style multiple 55 Casing drift (JSON vs json) Intentional — uppercase backticked matches existing skill convention for all SQL type names
I code-comments development-guide.md:57 60 MUST→SHOULD downgrade under section titled "Rules" Original MUST was factually wrong (JSON column is supported). SHOULD is semantically correct since both JSON and TEXT work.
J cross-references all anchors/links 0 Broken links mise run lint:cross-refs passes; anchors in AWS docs verified via WebFetch

Agent fleet results

  • pr-review-toolkit agents: code-reviewer, code-simplifier, comment-analyzer, pr-test-analyzer, silent-failure-hunter, type-design-analyzer — 7 agents total
  • superpowers:code-reviewer: 1 independent second-opinion agent
  • general-purpose agents (specialized): security review, context7 documentation fact-check, authoring-style audit, regex correctness audit, cross-reference/link audit — 5 agents
  • Explore: drift / legacy-language audit — 1 agent
  • /code-review:code-review skill orchestration: 5 parallel Sonnet reviewers (CLAUDE.md audit, shallow bug scan, git blame history, prior PR comments, code-comments compliance) + 7 Haiku confidence scorers
  • Total: 22 review agents, 0 issues at ≥80 confidence per the /code-review:code-review rubric; 10 issues at ≥60 confidence consolidated across the broader audit

Live-cluster verification (us-west-2, cluster obtk34wjzneklebnsltvbsvjnu)

  • CREATE TABLE ... (payload JSON) → accepted; information_schema reports data_type = json
  • CREATE TABLE ... (payload JSONB)ERROR: datatype jsonb not supported
  • CREATE TABLE ... (tags TEXT[])ERROR: datatype text[] not supported
  • payload::jsonb->>'key', payload::jsonb @> '{...}'::jsonb → work as expected
  • INACTIVE-state connection → FATAL: unable to accept connection, waking up cluster, please retry later (reproduced), transitions to ACTIVE after ~2 minutes
  • IDLE-state connection (cluster 4rtqsc5o7ejuixmltsa7w4ns6y) → wakes transparently, no FATAL error
  • node-postgres auto-serialization: both JSON and TEXT columns accept a raw JS object parameter (verified via direct pg.Client query; typeof stored text = string)

Eval results (4 evals, live agent run against updated skill)

  • eval-6 (JSON column storage): 2/2 PASS
  • eval-7 (array storage): 2/2 PASS
  • eval-8 (INACTIVE cluster error): 4/4 PASS
  • eval-9 (FailedPrecondition backup): 3/3 PASS
  • Total: 11/11 (100%)

Force-pushed as commit 715fa78.

@anwesham-lab
Copy link
Copy Markdown
Member Author

Round 2 audit summary

Second-round independent audit against HEAD 715fa78 — same fleet structure as round 1 (13 agents across pr-review-toolkit, superpowers, specialized general-purpose, Explore). Found issues the first round missed plus confirmed some earlier findings still held.

Legitimate findings applied in round 2

# Source (agent) File:line Confidence Finding Fix
1 context7 (awsknowledge MCP verification) evals.json (eval 9 prompt) 92 aws dsql create-cluster-backup is a fabricated CLI command — doesn't exist in aws dsql help. Real DSQL backups go through AWS Backup (aws backup start-backup-job) per AWS Backup for Aurora DSQL. Verified by running aws dsql create-cluster-backup locally → argument operation: Found invalid choice. Changed eval 9 prompt to aws backup start-backup-job. Verified the correct command works locally — start-backup-job with the DSQL cluster ARN returns a valid BackupJobId when the cluster is ACTIVE
2 silent-failure-hunter + regex-audit + pr-test-analyzer + superpowers run_functional_evals.py:346-360 95 Anti-assertion "does not claim JSONB is a valid column type" silently passes on empty/truncated transcripts — no positive signal required before evaluating the negative clause. Three independent agents caught this. Gated on re.search(r"\bjsonb?\b", full_text) — no JSON/JSONB mention means inconclusive → passed=False, evidence="No JSON/JSONB mention in transcript"
3 regex-audit + superpowers run_functional_evals.py:348 85 \bjsonb\s+column\b false-positives on negated correct answers: "don't use a jsonb column", "jsonb is not a valid column type" — both flip a CORRECT answer to FAIL Rewrote as affirmative-patterns + negation-guard: (use|declare|define|create)\s+\w*\s*(a\s+)?jsonb\s+(column|type) excluding (not|don'?t|cannot|never|avoid|instead\s+of)\s+\w*\s*jsonb\s+(column|type)
4 superpowers + regex-audit run_functional_evals.py:351 75 create\s+table.{0,200}\bjsonb\b uses default single-line regex. Real CREATE TABLE DDL spans newlines, so a wrong answer with CREATE TABLE users (\n id UUID,\n preferences JSONB,\n ...) evades detection Added re.DOTALL flag + used [\s\S]{0,400}? pattern so multi-line CREATE TABLE matches
5 (user feedback on my applied fix) 4 markdown files "MUST store arrays as TEXT" was too narrow. Arrays can legitimately be serialized as TEXT or JSON. When should which be used? Analyzed trade-offs: TEXT better for homogeneous short strings; JSON better when elements contain commas or aren't homogeneous Aligned guidance across SKILL.md, development-guide.md, patterns.md, onboarding.md: MUST serialize arrays as TEXT or JSON. Added round-trip query-time cast guidance (string_to_array(text, ',') or jsonb_array_elements_text(json::jsonb)) — both verified against live DSQL cluster

False positives / lower-confidence findings (verified and dismissed)

# Source File:line Confidence Claim Verification
A context7 troubleshooting.md:60 85 "FATAL: unable to accept connection" framing should mention IDLE in addition to INACTIVE Dismissed. Live-cluster verification established: IDLE wakes transparently on connection with no FATAL error; only INACTIVE emits the FATAL error on first connection. Agent receiving the FATAL error is by definition on an INACTIVE cluster. The existing "IDLE / INACTIVE" framing on the FailedPrecondition entry is already correct (that error applies to both)
B comment-analyzer data-operations.md:57,108 + patterns.md:151-154 90 node-postgres doesn't auto-stringify plain JS objects — will store "[object Object]" Empirically disproven. Ran live node-pg tests against DSQL cluster obtk34wjzneklebnsltvbsvjnu: both JSON and TEXT columns receive a raw JS object parameter correctly; node-pg's prepareObject calls JSON.stringify(val) internally for any non-primitive non-Date non-Buffer. Stored output is valid JSON ({"theme":"dark",...}), not "[object Object]"
C silent-failure-hunter + pr code-reviewer run_functional_evals.py:482-493 50-90 --eval-ids "" raises ValueError with ugly argparse message; no argparse.ArgumentTypeError wrapping Left as-is — edge case, argparse does catch ValueError and print a usage error. Not worth churn
D authoring-style + code-simplifier cross-file duplication of "cast to JSONB at query time" 80-90 Rule stated in 5 files violates §Maintenance "state each rule once" Dismissed per PR #66 precedent. Prior reviewer on PR #66 raised this pattern; author (me) explained evaluation-driven decision to keep entry-file rules for reliability. Each site serves a different audience (SKILL.md = quick-hit for agent entry, development-guide.md = detail, troubleshooting.md = error-specific, patterns.md = example-centric, onboarding.md = checklist). Kept the pattern
E code-simplifier run_functional_evals.py:172 95 Pre-existing eval-5 shadowing bug: "3,000 row" in exp_lower branch fires before batching AND 3,000 branch Out of scope. Pre-existing in base commit d6faaf5, not introduced by this PR. Separate fix
F pr-test-analyzer evals.json:73 (eval 7) 45 Eval 7 "Can I use TEXT[]?" is yes/no; agent answering just "no, use TEXT" passes without naming runtime-only framing Dismissed. Round 1 agent already observed that runtime-only is vocabulary, not required behavior. The skill's actual rule is "don't use TEXT[]", which the eval does test
G security + regex-audit all new regex patterns Catastrophic backtracking risk Clean — all new patterns use bounded .{0,N} quantifiers with short, non-overlapping alternations. No nested unbounded quantifiers
H cross-refs all links/anchors Potential broken links Clean — mise run lint:cross-refs passes; AWS docs anchors verified via WebFetch (200 OK); the aws dsql create-cluster-backup command the round-1 cross-refs agent verified turned out to be the fabricated one caught by context7
I drift audit whole repo Residual JSON-as-TEXT guidance in other plugins or docs Clean — no contradictions anywhere in docs/, other plugins, or top-level README

Live-cluster verification this round

  • aws dsql create-cluster-backupargument operation: Found invalid choice (fabricated)
  • aws backup start-backup-job --resource-arn arn:aws:dsql:...:cluster/... → returns valid BackupJobId when cluster is ACTIVE (verified live)
  • SELECT string_to_array('backend,api,database', ',') → returns {backend,api,database}
  • SELECT jsonb_array_elements_text('["backend","api","database"]'::json::jsonb) → returns 3 rows ✓
  • Plain JS object → node-pg → JSON or TEXT column: stored as valid JSON string in both cases ✓

Round 2 regrade (existing transcripts, new grader)

  • eval-6: 2/2
  • eval-7: 2/2
  • eval-8: 4/4
  • eval-9: 3/3
  • Total: 11/11

Force-pushed as commit 9bc81e3.

@anwesham-lab anwesham-lab force-pushed the dsql-types branch 2 times, most recently from 33cae90 to 17cfc72 Compare May 4, 2026 13:26
gxjx-x
gxjx-x previously approved these changes May 4, 2026
@anwesham-lab anwesham-lab force-pushed the dsql-types branch 2 times, most recently from 47f8e41 to 3c7a7f2 Compare May 4, 2026 13:55
@anwesham-lab anwesham-lab force-pushed the dsql-types branch 3 times, most recently from a2c1221 to 6c43cfd Compare May 4, 2026 14:19
…cle troubleshooting

Update DSQL skill to reflect current DSQL type support: JSON is a supported
column type (1 MiB, auto-compressed), while JSONB, arrays, and INET remain
runtime-only. Replace the narrow quick-reference type list with a pointer to
the canonical AWS docs and an awsknowledge verify-query row, so the skill
does not drift as DSQL's type surface evolves.

Add troubleshooting entries for the INACTIVE-cluster wake error and the
FailedPrecondition returned when backing up an IDLE/INACTIVE cluster, with a
pointer to the cluster-lifecycle documentation.

Extend the Tier 2 functional eval suite with four new evals covering the
updated behaviors — JSON column storage, array-column rejection, the
INACTIVE-cluster wake flow, and FailedPrecondition backup-on-idle — plus
grader clauses they need and an --eval-ids flag on the runner for targeted
subset runs.

Verified against live cluster: JSON accepted as column type; JSONB and TEXT[]
rejected with "datatype not supported"; ::jsonb cast + ->> / @> operators
work as expected. node-postgres auto-serializes JS objects for both JSON and
TEXT columns.

All four new evals pass 11/11 expectations when run against the updated skill.

Co-Authored-By: anwesham-lab <64298192+anwesham-lab@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@anwesham-lab
Copy link
Copy Markdown
Member Author

anwesham-lab commented May 4, 2026

Rounds 3-9 condensed audit summary

Seven audit rounds after R1/R2. Each round ran the same fleet shape:

  • 14-agent independent audit (pr-code-reviewer, code-simplifier, comment-analyzer, pr-test-analyzer, silent-failure-hunter, type-design-analyzer, superpowers:code-reviewer, security, context7 docs fact-check, authoring-style, regex correctness, cross-refs, drift, + 1 Explore/general-purpose specialist)
  • 5-agent /code-review:code-review step-4 review (CLAUDE.md compliance, shallow bugs, git-blame history, prior PR comments, code comments)
  • ≥5-agent /code-review:code-review step-5 confidence scoring (one scorer per step-4 finding)

Total: 24+ agents per round.

  • R3-R5: step-4 reviewers were Sonnet; step-5 scorers were Haiku; independent audit fleet was Opus by default
  • R6 onwards: all agents (independent audit + step-4 + step-5) switched to Opus only — no Sonnet or Haiku anywhere in R6-R9

Per-round diff of what actually changed, with emphasis on markdown/content edits. Intermediate findings that were later rewritten or obsoleted are folded into their final resolution.


Round 3 — broad regex fleet + first code-review skill orchestration (Sonnet + Haiku)

Markdown edits applied:

File Finding Fix
plugins/databases-on-aws/skills/dsql/references/onboarding.md Duplicated "serialize as TEXT or JSON" rule (code-simplifier conf 90) Deduped
plugins/databases-on-aws/skills/dsql/references/development-guide.md:89-92 Grammar bug "instead implementation:" (code-simplifier conf 75) Reworded to "instead implement it as:"
plugins/databases-on-aws/skills/dsql/references/development-guide.md Duplicate MUST-verify-types rule at :57 and :127 — same file §Maintenance violation (authoring-style, high) Consolidated to a single site
plugins/databases-on-aws/skills/dsql/references/examples/patterns.md:136 Two rules on one bullet (code-simplifier conf 55) Split to two bullets

Runner edits: 6 regex issues converged across 5 agents — affirmative \bjsonb\s*(?:not null\|,\|\)) false-positives on ::jsonb)/::jsonb,, duplicate define in verb alternation, negation guard too narrow for "don't use a jsonb column", negation guard global instead of local, positive gate matched bare "json". Rewrote anti-regression check with balanced affirmative + locally-scoped negation guard.

Dismissed: aws backup start-backup-job doesn't target DSQL (factually wrong; verified live); duplicate cross-file rules (intentional per PR #66 precedent).


Round 4 — regression from R3's over-correction (Sonnet + Haiku)

Markdown edits applied:

File Finding Fix
plugins/databases-on-aws/skills/dsql/references/development-guide.md (Supported Data Types section) R3's dedup left the section as a signpost-to-signpost — it only pointed at Schema Design Rules which itself only said "verify via awsknowledge" (code-simplifier, med-high) Restored actionable text: MUST-verify + explicit pointer to canonical AWS docs + runtime-only callout — eliminated the two-hop indirection

Runner edits: Grader-density cleanup, dead regex branches removed; 5 superpowers findings all rated "suggestion" with verdict "MERGE".

Dismissed: Minor stylistic comment on DDL-restriction overstatement.


Round 5 — silent-pass regression + live user regression (Sonnet + Haiku)

Markdown edits applied:

File Finding Fix
plugins/databases-on-aws/skills/dsql/references/troubleshooting.md User-driven live regression: eval-8 agent said "IDLE" when FATAL unable to accept connection, waking up cluster is INACTIVE-only (verified live against cluster obtk34wjzneklebnsltvbsvjnu — IDLE wakes transparently; only INACTIVE emits FATAL) Added disambiguation line "The cluster is `INACTIVE` and waking up"

Runner edits (material):

  • Critical silent-pass in evals 7/8/9 (pr-test-analyzer, conf 75+): full_text scope included tool results, so an agent that just read troubleshooting.md without synthesis passed 4/4. Fix: scoped grader to text only.
  • Sentence-trim order-dependent bug: for sep in (". ", "\n\n"): idx = window.rfind(sep) iterated sequentially, so the second iteration searched an already-trimmed window (type-design conf 85). Fix: compute max of both rfind results first.
  • "cast as jsonb" regression: R4's verb-alternation addition (as/of) false-failed correct cast advice. Fix: removed as/of from verb set.

Dismissed: \bno\b over-blocks "no longer"/"no need" (rated PARTIAL 45); pre-existing eval-5 shadowing bug (out of scope).


Round 6 — pivot from regex to LLM-as-judge (first all-Opus round)

Convergence observation: Both code-simplifier opus and pr-test-analyzer opus independently concluded the regex grader had crossed the complexity threshold. R5's full_text→text fix introduced new false-negatives (agent says "state is INACTIVE" where INACTIVE appears only in tool result) — pure whack-a-mole between silent-pass and false-negative.

Decision: Migrate evals 6-9 from regex to LLM-as-judge.

Runner edits (material):

  • Added _llm_judge() function — shells to claude -p per expectation with the user prompt, assertion, and agent's final text; parses {passed, evidence} verdict
  • Deleted ~155 lines of regex elif branches for evals 6-9
  • evals.json gained "llm_judge": true flag per-eval

Markdown edits applied:

File Finding Fix
tools/evals/databases-on-aws/README.md New grader architecture undocumented Added Grader-modes section explaining regex (evals 1-5) vs LLM judge (evals 6-9), cost (~$0.01-0.05/expectation) and latency trade-offs
tools/evals/databases-on-aws/README.md Stale eval counts ("7 evals / 23 assertions") after evals 6-9 were added Updated table to 9 evals / 31 assertions

Cumulative scorer (full-history re-audit of all 19 findings from R1-R5): only 1 finding at ≥80 still open — --eval-ids "" argparse UX nit (applied 30-second fix).


Round 7 — judge-path hardening + README sync (all Opus)

Runner edits:

  • _llm_judge caught JSONDecodeError only; silent-failure opus flagged AttributeError/TypeError/KeyError on non-dict replies (conf 75). Broadened exception catch with inline rationale
  • \{.*?\} non-greedy regex truncated valid verdicts with nested } in evidence strings (conf 70). First-pass fix (later rewritten in R8)
  • judge_model=args.model CLI wiring bug

Markdown edits: README eval-counts sync completed.

Verdict: superpowers opus and pr-code-reviewer opus both decisive "merge-ready". 5 deferrable follow-ups logged (per-expectation parallelism, temperature pinning, caching, TypedDict schema, extending LLM-judge to evals 2+5).


Round 8 — balanced-brace parser + judge/subject model split (all Opus)

Runner edits:

  • silent-failure opus (conf 75): R7's \{.*?\} non-greedy regex still failed on nested } inside quoted evidence. Wrote _extract_balanced_json_object() — stateful char-by-char parser tracking depth, quote-state, and backslash-escapes (respects JSON escape rules correctly)
  • pr-test-analyzer opus (conf 75): judge_model=args.model silently entangled subject and judge models — bumping --model to test a new subject silently swapped the judge, invalidating baseline grades. Added separate --judge-model CLI flag plumbed through grade_eval → _llm_judge

Verdict: pr-code-reviewer opus: "MERGE-READY. No issues at confidence ≥60."


Round 9 — converged (all Opus)

Markdown edits applied:

File Finding Fix
tools/evals/databases-on-aws/README.md --judge-model flag undocumented (pr-code-reviewer opus, conf ~65) Added one-liner in Grader-modes section noting the flag and recommending pinning the judge model across --model bumps

All 14 other R9 categories clean:

Category Verdict
CLAUDE.md compliance Scoped to databases-on-aws/**; no new deps; manifests/schemas untouched
Shallow bugs Fail-closed judge paths, balanced-brace respects escapes/quotes, sys.exit(main() or 0) preserves int returns
Git history No prior-PR bug regressions; mcp/tools/*/safe_query.py not touched
Prior PRs No applicable feedback carries over
Code comments All comments justify rationale; no WHAT restatements
Silent failures _extract_balanced_json_object explicit None on unbalanced/no-opener; caller returns passed=False; --eval-ids warns on missing, aborts on empty
Test coverage Timeout/non-zero/non-dict/invalid-JSON judge failures all fail-closed; warmup retry; no ≥6/10 gaps
Type design judge_model: str | None plumbed consistently
Regex correctness Balanced-brace parser verified across 10 edge cases (nested, string-wrapped, escaped quotes/backslashes, unbalanced, multi-object, fenced prose)
Code simplification No blocking simplifications; optional topic-keyword dict nit noted but not worth indirection
Authoring style Imperative/prescriptive; RFC 2119 reserved for harm; canonical-doc pointers over embedded lists
Security Bandit 0, Gitleaks 0, Checkov 71/71, Grype 0; 2 Semgrep findings pre-existing and unrelated
Cross-refs 0 errors, 1 pre-existing warning
Doc drift JSON/JSONB/array guidance internally consistent across SKILL.md, dev-guide, examples, mysql-migrations, onboarding, troubleshooting; eval counts match evals.json
Context7 fact-check No new documented claims introduced

Recurring dismissals (across multiple rounds)

Source Claim Why dismissed
Multiple rounds Cross-file duplication of "cast to JSONB at query time" violates §Maintenance Intentional per PR #66 precedent — each site serves a different reader (SKILL.md quick-hit, dev-guide detail, troubleshooting error-specific, patterns example-centric, onboarding checklist)
R2 comment-analyzer node-pg doesn't auto-stringify plain JS objects — would store "[object Object]" Empirically disproven against live cluster; node-pg's prepareObject calls JSON.stringify internally for non-primitives
R3 pr-code-reviewer aws backup start-backup-job doesn't target DSQL Factually wrong; AWS Backup integrates with DSQL per docs; verified live BackupJobId returned
R5 regex-audit opus Uppercase JSONB evades grader Didn't read text = result_text.lower() at line 116 — all 5 uppercase test vectors pass
R5+ code-simplifier Pre-existing eval-5 shadowing bug Out of scope — pre-existing in base commit d6faaf5
Multiple --eval-ids "" raises ValueError with argparse UX Argparse does catch and print usage error

Convergence path

  • R1-R2: rapid-iteration regex fixes + content corrections (CLI fabrication, stale bullets, tautological rules) — see prior comments for details
  • R3-R5: regex grader whack-a-mole under Sonnet+Haiku review — every fix introduced a new class of false positive or false negative
  • R6: decisive pivot from regex to LLM-as-judge for semantic evals, triggered by two Opus agents independently reaching the same conclusion; full fleet switched to Opus-only
  • R7-R8: hardening the new judge path under Opus review (exception breadth, balanced-brace JSON extraction, separate --judge-model flag)
  • R9 (all Opus): pure documentation of the --judge-model flag; no remaining blockers

Final HEAD: f967415. Ready to merge.

@anwesham-lab anwesham-lab requested a review from gxjx-x May 4, 2026 15:36
@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented May 4, 2026

can you please bump the version of the plugin in relevant sections ? Thanks !

gxjx-x
gxjx-x previously approved these changes May 4, 2026
Ships the DSQL skill updates from this branch: corrected type guidance
(JSON supported, JSONB/arrays/INET runtime-only), cluster-lifecycle
troubleshooting (INACTIVE wake, FailedPrecondition on IDLE/INACTIVE
backup), and the expanded Tier 2 eval suite (evals 6-9) with LLM-judge
grading and the --eval-ids/--judge-model runner flags.

Co-Authored-By: anwesham-lab <64298192+anwesham-lab@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@anwesham-lab
Copy link
Copy Markdown
Member Author

Bumped databases-on-aws version 1.0.0 → 1.1.0 (a351c8b):

  • plugins/databases-on-aws/.claude-plugin/plugin.json
  • plugins/databases-on-aws/.codex-plugin/plugin.json
  • .claude-plugin/marketplace.json

Minor bump rather than patch since this ships new user-visible guidance (cluster-lifecycle troubleshooting, new type guidance for JSON/JSONB/arrays) and eval suite expansion, not just bug fixes.

HEAD: a351c8b.

@anwesham-lab
Copy link
Copy Markdown
Member Author

can you please bump the version of the plugin in relevant sections ? Thanks !

@krokoko done

Copy link
Copy Markdown
Member

@scottschreckengaust scottschreckengaust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@krokoko krokoko added this pull request to the merge queue May 4, 2026
Merged via the queue into awslabs:main with commit 5305a2b May 4, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants