fix(databases-on-aws): correct DSQL type guidance, add cluster-lifecycle troubleshooting by anwesham-lab · Pull Request #155 · awslabs/agent-plugins

anwesham-lab · 2026-05-04T10:47:24Z

Summary

Correct DSQL type guidance in the databases-on-aws skill: JSON is a supported column type (1 MiB, auto-compressed); JSONB, arrays, and INET remain runtime-only. Quick-reference type lists are replaced with pointers to the canonical AWS supported data types doc plus an awsknowledge verify row, so the skill does not drift as DSQL's type surface evolves.
Add two entries to troubleshooting.md under a new Cluster Lifecycle section: the FATAL: unable to accept connection, waking up cluster error emitted when connecting to an INACTIVE cluster, and the FailedPrecondition returned when backing up an IDLE/INACTIVE cluster. Links to the cluster lifecycle docs.
Remove stale JSON.stringify(...) wrapping in data-operations.md examples now that metadata is a JSON column.
Extend the Tier 2 functional eval suite with three new evals covering the updated behaviors (JSON column storage, array-column rejection, INACTIVE-cluster wake flow), plus the grader clauses they need. Add --eval-ids to the runner so subsets can be run without executing the full suite.

Test plan

Verified against a live DSQL cluster in us-west-2:
- CREATE TABLE ... (payload JSON) → accepted; information_schema reports data_type = json
- CREATE TABLE ... (payload JSONB) → rejected with ERROR: datatype jsonb not supported
- CREATE TABLE ... (tags TEXT[]) → rejected with ERROR: datatype text[] not supported
- payload::jsonb->>'key', payload::jsonb @> '{...}'::jsonb, string_to_array(...) all succeed
Verified lifecycle behavior against a live cluster: INACTIVE cluster returns the documented FATAL error on first connection and reaches ACTIVE after polling; IDLE cluster wakes transparently
Ran the three new Tier 2 evals end-to-end via run_functional_evals.py --eval-ids 6,7,8: 10 / 10 expectations pass (eval 6: JSON column storage 3/3; eval 7: array storage guidance 3/3; eval 8: INACTIVE-cluster wake flow 4/4)
mise run build passes locally (lint, format, security scans green)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

anwesham-lab · 2026-05-04T11:51:06Z

Multi-agent audit of PR #155

Spawned 20+ independent review agents across two orchestrations: (1) a 13-agent fleet covering code review, simplification, comments, test coverage, silent failures, type design, security, regex correctness, cross-refs, drift, docs fact-check, authoring-style audit, and an independent second-opinion code-reviewer; (2) the /code-review:code-review skill's 5 Sonnet reviewers with 7 Haiku confidence scorers.

All findings at ≥60 confidence were reviewed; legitimate ones below were applied to the commit. Findings below 60 or ruled false-positives after verification are listed for transparency.

Legitimate findings applied

#	Source (agent / skill)	File:line	Confidence	Finding	Fix
1	silent-failure-hunter	`run_functional_evals.py:354`	95	Anti-regression regex `must\s+store\s+json\s+as\s+text` is too narrow — misses `arrays/JSON as TEXT`, `replace JSON columns with TEXT`, passive forms	Replaced with `does not claim JSONB is a valid column type` check (5-pattern DDL-context detection) — tests the actually load-bearing invariant
2	regex-audit	`run_functional_evals.py:339`	75	`\bjson\b.column` uses greedy unbounded `.`; excludes `jsonb` via word boundary	Removed branch entirely; no longer needed with simplified eval 6
3	regex-audit	`run_functional_evals.py:364`	85	Array-unsupported regex misses `cannot`, `not available`, `not allowed`, `only at runtime`	Widened alternation, expanded window to 80 chars
4	drift-audit / superpowers-code-reviewer	`troubleshooting.md:100`	90	Stale `"Or use JSON.stringify: \"..\""` contradicts new JSON-column guidance	Consolidated 4 bullets → 2; replaced `JSON.stringify` path with `JSON` column option
5	pr-test-analyzer	`evals.json` (missing eval)	95	`FailedPrecondition` backup-on-IDLE path added to `troubleshooting.md` but has no eval	Added eval 9: `FailedPrecondition` prompt with 3 expectations, 3 new grader branches, 3/3 PASS live
6	regex-audit	`run_functional_evals.py:380`	70	Bare `\bseparator\b`, `\bdelimiter\b`, `\bcsv\b` fire anywhere in transcript (false positives)	Tightened to require array/tag co-occurrence within 120-char window
7	regex-audit	`run_functional_evals.py:408`	70	Bare `inactive` matches any mention (false positives)	Tightened to `cluster.{0,60}inactive\|inactive.{0,60}cluster\|inactive\s+state\|in\s+the\s+inactive`
8	regex-audit	`run_functional_evals.py:424`	80	Poll-until-ACTIVE window too tight (40 chars); verb set missing `sleep`, `until`, `keep checking`	Widened to 80 chars + verb alternation
9	regex-audit	`run_functional_evals.py:432`	85	Retry-after-ACTIVE missed `reconnect`, `re-establish`, `open new connection`	Widened verb set
10	authoring-style / code-simplifier	`development-guide.md:128`	85	Dense paragraph combining MUST + URLs + runtime-only caveat	Split to two one-line bullets
11	code-simplifier	`SKILL.md:209`	85	Two rules on one bullet (`MUST arrays... SHOULD JSON...`)	Removed redundant SHOULD rule (JSON column type was tautological once JSONB-not-column-type was stated in dev-guide.md)
12	user feedback (live-test-driven)	`troubleshooting.md:60`	—	Eval 8 regression — agent said `IDLE` when the FATAL error is INACTIVE-only (verified against live cluster: INACTIVE returns FATAL, IDLE wakes transparently)	Added `The cluster is \`INACTIVE` and waking up.` disambiguation
13	user feedback (house-style)	multiple `.md`	—	`SHOULD store JSON in a JSON column` is tautological given context that both JSON and TEXT are valid (verified: node-pg auto-serializes JS objects to both column types identically against live cluster)	Dropped redundant rule from SKILL.md, development-guide.md (2 sites), patterns.md, onboarding.md (2 sites)
14	user feedback	`patterns.md:132`	—	Section header `Data Serialization` misleading (rules are about runtime-only types, not serialization)	Renamed to `Runtime-Only Types` with clearer framing
15	drift-audit + context7	`examples/schema.md:24`	85	Pre-existing `metadata TEXT` now contradictory — flagged by superpowers agent as unfixed legacy	Investigated: both JSON and TEXT are valid — no change needed (pre-existing shape is a valid choice, skill no longer prescribes one over the other)

False positives / low-confidence findings (verified and dismissed)

#	Source	File:line	Confidence	Claim	Verification
A	shallow-bug-scan	`run_functional_evals.py:489`	0	`missing` calc order wrong — computed before filtering	Read code: `eval_items` is filtered before `missing` is computed; `requested - {filtered_ids}` correctly yields `requested - all_available_ids`. Confirmed by type-design-analyzer and scoring-agent-2.
B	shallow-bug-scan	`patterns.md:150`	0	`fromTextJSON` function removal may cause `ReferenceError` downstream	`patterns.md` is documentation markdown — example code snippets, not executed. `grep -rn fromTextJSON` finds no callers.
C	shallow-bug-scan / type-design	`run_functional_evals.py:482`	50	`--eval-ids ""` triggers `ValueError` with ugly argparse message	Rare edge case; argparse does catch ValueError and prints usage error. Not worth churn.
D	code-simplifier	`troubleshooting.md:64`	45	`IDLE` and `INACTIVE` used interchangeably — should be just `INACTIVE`	Verified against live cluster: distinct states with distinct behaviors. Backup error mentions both because the FailedPrecondition can apply to either state per AWS docs.
E	code-comments / prior-PR	multiple	55	Potential duplication between SKILL.md and development-guide.md per PR #66 precedent	Intentional per PR #66 author response — entry file holds quick-hit rules; reference file holds detail. Reviewed and kept the pattern.
F	security	`patterns.md:145-162`	55	SQL examples missing `tenant_id`	Pre-existing; not touched by this PR. Flagged for a future cleanup PR.
G	context7	`troubleshooting.md:59`	—	FATAL wire error string not literally in AWS docs	Reproduced live against cluster `obtk34wjzneklebnsltvbsvjnu` in us-west-2 during this PR — empirically observed, not a documentation claim
H	authoring-style	multiple	55	Casing drift (`JSON` vs `json`)	Intentional — uppercase backticked matches existing skill convention for all SQL type names
I	code-comments	`development-guide.md:57`	60	MUST→SHOULD downgrade under section titled "Rules"	Original MUST was factually wrong (JSON column is supported). SHOULD is semantically correct since both JSON and TEXT work.
J	cross-references	all anchors/links	0	Broken links	`mise run lint:cross-refs` passes; anchors in AWS docs verified via WebFetch

Agent fleet results

pr-review-toolkit agents: code-reviewer, code-simplifier, comment-analyzer, pr-test-analyzer, silent-failure-hunter, type-design-analyzer — 7 agents total
superpowers:code-reviewer: 1 independent second-opinion agent
general-purpose agents (specialized): security review, context7 documentation fact-check, authoring-style audit, regex correctness audit, cross-reference/link audit — 5 agents
Explore: drift / legacy-language audit — 1 agent
/code-review:code-review skill orchestration: 5 parallel Sonnet reviewers (CLAUDE.md audit, shallow bug scan, git blame history, prior PR comments, code-comments compliance) + 7 Haiku confidence scorers
Total: 22 review agents, 0 issues at ≥80 confidence per the /code-review:code-review rubric; 10 issues at ≥60 confidence consolidated across the broader audit

Live-cluster verification (us-west-2, cluster `obtk34wjzneklebnsltvbsvjnu`)

CREATE TABLE ... (payload JSON) → accepted; information_schema reports data_type = json
CREATE TABLE ... (payload JSONB) → ERROR: datatype jsonb not supported
CREATE TABLE ... (tags TEXT[]) → ERROR: datatype text[] not supported
payload::jsonb->>'key', payload::jsonb @> '{...}'::jsonb → work as expected
INACTIVE-state connection → FATAL: unable to accept connection, waking up cluster, please retry later (reproduced), transitions to ACTIVE after ~2 minutes
IDLE-state connection (cluster 4rtqsc5o7ejuixmltsa7w4ns6y) → wakes transparently, no FATAL error
node-postgres auto-serialization: both JSON and TEXT columns accept a raw JS object parameter (verified via direct pg.Client query; typeof stored text = string)

Eval results (4 evals, live agent run against updated skill)

eval-6 (JSON column storage): 2/2 PASS
eval-7 (array storage): 2/2 PASS
eval-8 (INACTIVE cluster error): 4/4 PASS
eval-9 (FailedPrecondition backup): 3/3 PASS
Total: 11/11 (100%)

Force-pushed as commit 715fa78.

anwesham-lab · 2026-05-04T13:09:40Z

Round 2 audit summary

Second-round independent audit against HEAD 715fa78 — same fleet structure as round 1 (13 agents across pr-review-toolkit, superpowers, specialized general-purpose, Explore). Found issues the first round missed plus confirmed some earlier findings still held.

Legitimate findings applied in round 2

#	Source (agent)	File:line	Confidence	Finding	Fix
1	context7 (awsknowledge MCP verification)	`evals.json` (eval 9 prompt)	92	`aws dsql create-cluster-backup` is a fabricated CLI command — doesn't exist in `aws dsql help`. Real DSQL backups go through AWS Backup (`aws backup start-backup-job`) per AWS Backup for Aurora DSQL. Verified by running `aws dsql create-cluster-backup` locally → `argument operation: Found invalid choice`.	Changed eval 9 prompt to `aws backup start-backup-job`. Verified the correct command works locally — `start-backup-job` with the DSQL cluster ARN returns a valid `BackupJobId` when the cluster is ACTIVE
2	silent-failure-hunter + regex-audit + pr-test-analyzer + superpowers	`run_functional_evals.py:346-360`	95	Anti-assertion `"does not claim JSONB is a valid column type"` silently passes on empty/truncated transcripts — no positive signal required before evaluating the negative clause. Three independent agents caught this.	Gated on `re.search(r"\bjsonb?\b", full_text)` — no JSON/JSONB mention means inconclusive → `passed=False`, `evidence="No JSON/JSONB mention in transcript"`
3	regex-audit + superpowers	`run_functional_evals.py:348`	85	`\bjsonb\s+column\b` false-positives on negated correct answers: "don't use a jsonb column", "jsonb is not a valid column type" — both flip a CORRECT answer to FAIL	Rewrote as affirmative-patterns + negation-guard: `(use\|declare\|define\|create)\s+\w\s(a\s+)?jsonb\s+(column\|type)` excluding `(not\|don'?t\|cannot\|never\|avoid\|instead\s+of)\s+\w\sjsonb\s+(column\|type)`
4	superpowers + regex-audit	`run_functional_evals.py:351`	75	`create\s+table.{0,200}\bjsonb\b` uses default single-line regex. Real CREATE TABLE DDL spans newlines, so a wrong answer with `CREATE TABLE users (\n id UUID,\n preferences JSONB,\n ...)` evades detection	Added `re.DOTALL` flag + used `[\s\S]{0,400}?` pattern so multi-line CREATE TABLE matches
5	(user feedback on my applied fix)	4 markdown files	—	"MUST store arrays as TEXT" was too narrow. Arrays can legitimately be serialized as TEXT or JSON. When should which be used? Analyzed trade-offs: TEXT better for homogeneous short strings; JSON better when elements contain commas or aren't homogeneous	Aligned guidance across SKILL.md, development-guide.md, patterns.md, onboarding.md: `MUST serialize arrays as TEXT or JSON`. Added round-trip query-time cast guidance (`string_to_array(text, ',')` or `jsonb_array_elements_text(json::jsonb)`) — both verified against live DSQL cluster

False positives / lower-confidence findings (verified and dismissed)

#	Source	File:line	Confidence	Claim	Verification
A	context7	`troubleshooting.md:60`	85	"FATAL: unable to accept connection" framing should mention IDLE in addition to INACTIVE	Dismissed. Live-cluster verification established: IDLE wakes transparently on connection with no FATAL error; only INACTIVE emits the FATAL error on first connection. Agent receiving the FATAL error is by definition on an INACTIVE cluster. The existing "IDLE / INACTIVE" framing on the FailedPrecondition entry is already correct (that error applies to both)
B	comment-analyzer	`data-operations.md:57,108` + `patterns.md:151-154`	90	`node-postgres` doesn't auto-stringify plain JS objects — will store `"[object Object]"`	Empirically disproven. Ran live node-pg tests against DSQL cluster `obtk34wjzneklebnsltvbsvjnu`: both `JSON` and `TEXT` columns receive a raw JS object parameter correctly; node-pg's `prepareObject` calls `JSON.stringify(val)` internally for any non-primitive non-Date non-Buffer. Stored output is valid JSON (`{"theme":"dark",...}`), not `"[object Object]"`
C	silent-failure-hunter + pr code-reviewer	`run_functional_evals.py:482-493`	50-90	`--eval-ids ""` raises `ValueError` with ugly argparse message; no `argparse.ArgumentTypeError` wrapping	Left as-is — edge case, argparse does catch ValueError and print a usage error. Not worth churn
D	authoring-style + code-simplifier	cross-file duplication of "cast to JSONB at query time"	80-90	Rule stated in 5 files violates §Maintenance "state each rule once"	Dismissed per PR #66 precedent. Prior reviewer on PR #66 raised this pattern; author (me) explained evaluation-driven decision to keep entry-file rules for reliability. Each site serves a different audience (SKILL.md = quick-hit for agent entry, development-guide.md = detail, troubleshooting.md = error-specific, patterns.md = example-centric, onboarding.md = checklist). Kept the pattern
E	code-simplifier	`run_functional_evals.py:172`	95	Pre-existing eval-5 shadowing bug: `"3,000 row" in exp_lower` branch fires before `batching AND 3,000` branch	Out of scope. Pre-existing in base commit `d6faaf5`, not introduced by this PR. Separate fix
F	pr-test-analyzer	`evals.json:73` (eval 7)	45	Eval 7 "Can I use TEXT[]?" is yes/no; agent answering just "no, use TEXT" passes without naming runtime-only framing	Dismissed. Round 1 agent already observed that runtime-only is vocabulary, not required behavior. The skill's actual rule is "don't use TEXT[]", which the eval does test
G	security + regex-audit	all new regex patterns	—	Catastrophic backtracking risk	Clean — all new patterns use bounded `.{0,N}` quantifiers with short, non-overlapping alternations. No nested unbounded quantifiers
H	cross-refs	all links/anchors	—	Potential broken links	Clean — `mise run lint:cross-refs` passes; AWS docs anchors verified via WebFetch (200 OK); the `aws dsql create-cluster-backup` command the round-1 cross-refs agent verified turned out to be the fabricated one caught by context7
I	drift audit	whole repo	—	Residual JSON-as-TEXT guidance in other plugins or docs	Clean — no contradictions anywhere in `docs/`, other plugins, or top-level README

Live-cluster verification this round

aws dsql create-cluster-backup → argument operation: Found invalid choice (fabricated)
aws backup start-backup-job --resource-arn arn:aws:dsql:...:cluster/... → returns valid BackupJobId when cluster is ACTIVE (verified live)
SELECT string_to_array('backend,api,database', ',') → returns {backend,api,database} ✓
SELECT jsonb_array_elements_text('["backend","api","database"]'::json::jsonb) → returns 3 rows ✓
Plain JS object → node-pg → JSON or TEXT column: stored as valid JSON string in both cases ✓

Round 2 regrade (existing transcripts, new grader)

eval-6: 2/2
eval-7: 2/2
eval-8: 4/4
eval-9: 3/3
Total: 11/11

Force-pushed as commit 9bc81e3.

…cle troubleshooting Update DSQL skill to reflect current DSQL type support: JSON is a supported column type (1 MiB, auto-compressed), while JSONB, arrays, and INET remain runtime-only. Replace the narrow quick-reference type list with a pointer to the canonical AWS docs and an awsknowledge verify-query row, so the skill does not drift as DSQL's type surface evolves. Add troubleshooting entries for the INACTIVE-cluster wake error and the FailedPrecondition returned when backing up an IDLE/INACTIVE cluster, with a pointer to the cluster-lifecycle documentation. Extend the Tier 2 functional eval suite with four new evals covering the updated behaviors — JSON column storage, array-column rejection, the INACTIVE-cluster wake flow, and FailedPrecondition backup-on-idle — plus grader clauses they need and an --eval-ids flag on the runner for targeted subset runs. Verified against live cluster: JSON accepted as column type; JSONB and TEXT[] rejected with "datatype not supported"; ::jsonb cast + ->> / @> operators work as expected. node-postgres auto-serializes JS objects for both JSON and TEXT columns. All four new evals pass 11/11 expectations when run against the updated skill. Co-Authored-By: anwesham-lab <64298192+anwesham-lab@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

anwesham-lab · 2026-05-04T14:27:46Z

Rounds 3-9 condensed audit summary

Seven audit rounds after R1/R2. Each round ran the same fleet shape:

14-agent independent audit (pr-code-reviewer, code-simplifier, comment-analyzer, pr-test-analyzer, silent-failure-hunter, type-design-analyzer, superpowers:code-reviewer, security, context7 docs fact-check, authoring-style, regex correctness, cross-refs, drift, + 1 Explore/general-purpose specialist)
5-agent /code-review:code-review step-4 review (CLAUDE.md compliance, shallow bugs, git-blame history, prior PR comments, code comments)
≥5-agent /code-review:code-review step-5 confidence scoring (one scorer per step-4 finding)

Total: 24+ agents per round.

R3-R5: step-4 reviewers were Sonnet; step-5 scorers were Haiku; independent audit fleet was Opus by default
R6 onwards: all agents (independent audit + step-4 + step-5) switched to Opus only — no Sonnet or Haiku anywhere in R6-R9

Per-round diff of what actually changed, with emphasis on markdown/content edits. Intermediate findings that were later rewritten or obsoleted are folded into their final resolution.

Round 3 — broad regex fleet + first code-review skill orchestration (Sonnet + Haiku)

Markdown edits applied:

File	Finding	Fix
`plugins/databases-on-aws/skills/dsql/references/onboarding.md`	Duplicated "serialize as TEXT or JSON" rule (code-simplifier conf 90)	Deduped
`plugins/databases-on-aws/skills/dsql/references/development-guide.md:89-92`	Grammar bug "instead implementation:" (code-simplifier conf 75)	Reworded to "instead implement it as:"
`plugins/databases-on-aws/skills/dsql/references/development-guide.md`	Duplicate MUST-verify-types rule at `:57` and `:127` — same file §Maintenance violation (authoring-style, high)	Consolidated to a single site
`plugins/databases-on-aws/skills/dsql/references/examples/patterns.md:136`	Two rules on one bullet (code-simplifier conf 55)	Split to two bullets

Runner edits: 6 regex issues converged across 5 agents — affirmative \bjsonb\s*(?:not null\|,\|\)) false-positives on ::jsonb)/::jsonb,, duplicate define in verb alternation, negation guard too narrow for "don't use a jsonb column", negation guard global instead of local, positive gate matched bare "json". Rewrote anti-regression check with balanced affirmative + locally-scoped negation guard.

Dismissed: aws backup start-backup-job doesn't target DSQL (factually wrong; verified live); duplicate cross-file rules (intentional per PR #66 precedent).

Round 4 — regression from R3's over-correction (Sonnet + Haiku)

Markdown edits applied:

File	Finding	Fix
`plugins/databases-on-aws/skills/dsql/references/development-guide.md` (Supported Data Types section)	R3's dedup left the section as a signpost-to-signpost — it only pointed at Schema Design Rules which itself only said "verify via awsknowledge" (code-simplifier, med-high)	Restored actionable text: MUST-verify + explicit pointer to canonical AWS docs + runtime-only callout — eliminated the two-hop indirection

Runner edits: Grader-density cleanup, dead regex branches removed; 5 superpowers findings all rated "suggestion" with verdict "MERGE".

Dismissed: Minor stylistic comment on DDL-restriction overstatement.

Round 5 — silent-pass regression + live user regression (Sonnet + Haiku)

Markdown edits applied:

File	Finding	Fix
`plugins/databases-on-aws/skills/dsql/references/troubleshooting.md`	User-driven live regression: eval-8 agent said "IDLE" when FATAL `unable to accept connection, waking up cluster` is INACTIVE-only (verified live against cluster `obtk34wjzneklebnsltvbsvjnu` — IDLE wakes transparently; only INACTIVE emits FATAL)	Added disambiguation line "The cluster is `INACTIVE` and waking up"

Runner edits (material):

Critical silent-pass in evals 7/8/9 (pr-test-analyzer, conf 75+): full_text scope included tool results, so an agent that just read troubleshooting.md without synthesis passed 4/4. Fix: scoped grader to text only.
Sentence-trim order-dependent bug: for sep in (". ", "\n\n"): idx = window.rfind(sep) iterated sequentially, so the second iteration searched an already-trimmed window (type-design conf 85). Fix: compute max of both rfind results first.
"cast as jsonb" regression: R4's verb-alternation addition (as/of) false-failed correct cast advice. Fix: removed as/of from verb set.

Dismissed: \bno\b over-blocks "no longer"/"no need" (rated PARTIAL 45); pre-existing eval-5 shadowing bug (out of scope).

Round 6 — pivot from regex to LLM-as-judge (first all-Opus round)

Convergence observation: Both code-simplifier opus and pr-test-analyzer opus independently concluded the regex grader had crossed the complexity threshold. R5's full_text→text fix introduced new false-negatives (agent says "state is INACTIVE" where INACTIVE appears only in tool result) — pure whack-a-mole between silent-pass and false-negative.

Decision: Migrate evals 6-9 from regex to LLM-as-judge.

Runner edits (material):

Added _llm_judge() function — shells to claude -p per expectation with the user prompt, assertion, and agent's final text; parses {passed, evidence} verdict
Deleted ~155 lines of regex elif branches for evals 6-9
evals.json gained "llm_judge": true flag per-eval

Markdown edits applied:

File	Finding	Fix
`tools/evals/databases-on-aws/README.md`	New grader architecture undocumented	Added Grader-modes section explaining regex (evals 1-5) vs LLM judge (evals 6-9), cost (~$0.01-0.05/expectation) and latency trade-offs
`tools/evals/databases-on-aws/README.md`	Stale eval counts ("7 evals / 23 assertions") after evals 6-9 were added	Updated table to 9 evals / 31 assertions

Cumulative scorer (full-history re-audit of all 19 findings from R1-R5): only 1 finding at ≥80 still open — --eval-ids "" argparse UX nit (applied 30-second fix).

Round 7 — judge-path hardening + README sync (all Opus)

Runner edits:

_llm_judge caught JSONDecodeError only; silent-failure opus flagged AttributeError/TypeError/KeyError on non-dict replies (conf 75). Broadened exception catch with inline rationale
\{.*?\} non-greedy regex truncated valid verdicts with nested } in evidence strings (conf 70). First-pass fix (later rewritten in R8)
judge_model=args.model CLI wiring bug

Markdown edits: README eval-counts sync completed.

Verdict: superpowers opus and pr-code-reviewer opus both decisive "merge-ready". 5 deferrable follow-ups logged (per-expectation parallelism, temperature pinning, caching, TypedDict schema, extending LLM-judge to evals 2+5).

Round 8 — balanced-brace parser + judge/subject model split (all Opus)

Runner edits:

silent-failure opus (conf 75): R7's \{.*?\} non-greedy regex still failed on nested } inside quoted evidence. Wrote _extract_balanced_json_object() — stateful char-by-char parser tracking depth, quote-state, and backslash-escapes (respects JSON escape rules correctly)
pr-test-analyzer opus (conf 75): judge_model=args.model silently entangled subject and judge models — bumping --model to test a new subject silently swapped the judge, invalidating baseline grades. Added separate --judge-model CLI flag plumbed through grade_eval → _llm_judge

Verdict: pr-code-reviewer opus: "MERGE-READY. No issues at confidence ≥60."

Round 9 — converged (all Opus)

Markdown edits applied:

File	Finding	Fix
`tools/evals/databases-on-aws/README.md`	`--judge-model` flag undocumented (pr-code-reviewer opus, conf ~65)	Added one-liner in Grader-modes section noting the flag and recommending pinning the judge model across `--model` bumps

All 14 other R9 categories clean:

Category	Verdict
CLAUDE.md compliance	Scoped to `databases-on-aws/**`; no new deps; manifests/schemas untouched
Shallow bugs	Fail-closed judge paths, balanced-brace respects escapes/quotes, `sys.exit(main() or 0)` preserves int returns
Git history	No prior-PR bug regressions; `mcp/tools/*`/`safe_query.py` not touched
Prior PRs	No applicable feedback carries over
Code comments	All comments justify rationale; no WHAT restatements
Silent failures	`_extract_balanced_json_object` explicit `None` on unbalanced/no-opener; caller returns `passed=False`; `--eval-ids` warns on missing, aborts on empty
Test coverage	Timeout/non-zero/non-dict/invalid-JSON judge failures all fail-closed; warmup retry; no ≥6/10 gaps
Type design	`judge_model: str \| None` plumbed consistently
Regex correctness	Balanced-brace parser verified across 10 edge cases (nested, string-wrapped, escaped quotes/backslashes, unbalanced, multi-object, fenced prose)
Code simplification	No blocking simplifications; optional topic-keyword dict nit noted but not worth indirection
Authoring style	Imperative/prescriptive; RFC 2119 reserved for harm; canonical-doc pointers over embedded lists
Security	Bandit 0, Gitleaks 0, Checkov 71/71, Grype 0; 2 Semgrep findings pre-existing and unrelated
Cross-refs	0 errors, 1 pre-existing warning
Doc drift	JSON/JSONB/array guidance internally consistent across SKILL.md, dev-guide, examples, mysql-migrations, onboarding, troubleshooting; eval counts match `evals.json`
Context7 fact-check	No new documented claims introduced

Recurring dismissals (across multiple rounds)

Source	Claim	Why dismissed
Multiple rounds	Cross-file duplication of "cast to JSONB at query time" violates §Maintenance	Intentional per PR #66 precedent — each site serves a different reader (SKILL.md quick-hit, dev-guide detail, troubleshooting error-specific, patterns example-centric, onboarding checklist)
R2 comment-analyzer	node-pg doesn't auto-stringify plain JS objects — would store `"[object Object]"`	Empirically disproven against live cluster; node-pg's `prepareObject` calls `JSON.stringify` internally for non-primitives
R3 pr-code-reviewer	`aws backup start-backup-job` doesn't target DSQL	Factually wrong; AWS Backup integrates with DSQL per docs; verified live `BackupJobId` returned
R5 regex-audit opus	Uppercase `JSONB` evades grader	Didn't read `text = result_text.lower()` at line 116 — all 5 uppercase test vectors pass
R5+ code-simplifier	Pre-existing eval-5 shadowing bug	Out of scope — pre-existing in base commit `d6faaf5`
Multiple	`--eval-ids ""` raises ValueError with argparse UX	Argparse does catch and print usage error

Convergence path

R1-R2: rapid-iteration regex fixes + content corrections (CLI fabrication, stale bullets, tautological rules) — see prior comments for details
R3-R5: regex grader whack-a-mole under Sonnet+Haiku review — every fix introduced a new class of false positive or false negative
R6: decisive pivot from regex to LLM-as-judge for semantic evals, triggered by two Opus agents independently reaching the same conclusion; full fleet switched to Opus-only
R7-R8: hardening the new judge path under Opus review (exception breadth, balanced-brace JSON extraction, separate --judge-model flag)
R9 (all Opus): pure documentation of the --judge-model flag; no remaining blockers

Final HEAD: f967415. Ready to merge.

krokoko · 2026-05-04T18:13:23Z

can you please bump the version of the plugin in relevant sections ? Thanks !

Ships the DSQL skill updates from this branch: corrected type guidance (JSON supported, JSONB/arrays/INET runtime-only), cluster-lifecycle troubleshooting (INACTIVE wake, FailedPrecondition on IDLE/INACTIVE backup), and the expanded Tier 2 eval suite (evals 6-9) with LLM-judge grading and the --eval-ids/--judge-model runner flags. Co-Authored-By: anwesham-lab <64298192+anwesham-lab@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

anwesham-lab · 2026-05-04T19:25:26Z

Bumped databases-on-aws version 1.0.0 → 1.1.0 (a351c8b):

plugins/databases-on-aws/.claude-plugin/plugin.json
plugins/databases-on-aws/.codex-plugin/plugin.json
.claude-plugin/marketplace.json

Minor bump rather than patch since this ships new user-visible guidance (cluster-lifecycle troubleshooting, new type guidance for JSON/JSONB/arrays) and eval suite expansion, not just bug fixes.

HEAD: a351c8b.

anwesham-lab · 2026-05-04T19:26:08Z

can you please bump the version of the plugin in relevant sections ? Thanks !

@krokoko done

scottschreckengaust

LGTM

anwesham-lab requested review from a team, krokoko, scottschreckengaust and theagenticguy May 4, 2026 10:47

anwesham-lab requested review from a team as code owners May 4, 2026 10:47

anwesham-lab requested review from Benjscho, Morlej, amaksimo, gxjx-x, pkale and praba2210 May 4, 2026 10:47

anwesham-lab force-pushed the dsql-types branch 6 times, most recently from 6671e09 to 715fa78 Compare May 4, 2026 11:50

anwesham-lab assigned anwesham-lab, pkale and amaksimo May 4, 2026

anwesham-lab force-pushed the dsql-types branch from 715fa78 to 9bc81e3 Compare May 4, 2026 13:07

anwesham-lab force-pushed the dsql-types branch 2 times, most recently from 33cae90 to 17cfc72 Compare May 4, 2026 13:26

gxjx-x previously approved these changes May 4, 2026

View reviewed changes

anwesham-lab dismissed gxjx-x’s stale review via 47f8e41 May 4, 2026 13:45

anwesham-lab force-pushed the dsql-types branch 2 times, most recently from 47f8e41 to 3c7a7f2 Compare May 4, 2026 13:55

anwesham-lab force-pushed the dsql-types branch 3 times, most recently from a2c1221 to 6c43cfd Compare May 4, 2026 14:19

anwesham-lab force-pushed the dsql-types branch from 6c43cfd to f967415 Compare May 4, 2026 14:27

anwesham-lab requested a review from gxjx-x May 4, 2026 15:36

gxjx-x previously approved these changes May 4, 2026

View reviewed changes

anwesham-lab dismissed gxjx-x’s stale review via a351c8b May 4, 2026 19:24

scottschreckengaust approved these changes May 4, 2026

View reviewed changes

krokoko approved these changes May 4, 2026

View reviewed changes

krokoko added this pull request to the merge queue May 4, 2026

Merged via the queue into awslabs:main with commit 5305a2b May 4, 2026
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(databases-on-aws): correct DSQL type guidance, add cluster-lifecycle troubleshooting#155

fix(databases-on-aws): correct DSQL type guidance, add cluster-lifecycle troubleshooting#155
krokoko merged 2 commits into
awslabs:mainfrom
anwesham-lab:dsql-types

anwesham-lab commented May 4, 2026 •

edited

Loading

Uh oh!

anwesham-lab commented May 4, 2026

Uh oh!

anwesham-lab commented May 4, 2026

Uh oh!

anwesham-lab commented May 4, 2026 •

edited

Loading

Uh oh!

krokoko commented May 4, 2026

Uh oh!

anwesham-lab commented May 4, 2026

Uh oh!

anwesham-lab commented May 4, 2026

Uh oh!

scottschreckengaust left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

anwesham-lab commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

anwesham-lab commented May 4, 2026

Multi-agent audit of PR #155

Legitimate findings applied

False positives / low-confidence findings (verified and dismissed)

Agent fleet results

Live-cluster verification (us-west-2, cluster obtk34wjzneklebnsltvbsvjnu)

Eval results (4 evals, live agent run against updated skill)

Uh oh!

anwesham-lab commented May 4, 2026

Round 2 audit summary

Legitimate findings applied in round 2

False positives / lower-confidence findings (verified and dismissed)

Live-cluster verification this round

Round 2 regrade (existing transcripts, new grader)

Uh oh!

anwesham-lab commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rounds 3-9 condensed audit summary

Round 3 — broad regex fleet + first code-review skill orchestration (Sonnet + Haiku)

Round 4 — regression from R3's over-correction (Sonnet + Haiku)

Round 5 — silent-pass regression + live user regression (Sonnet + Haiku)

Round 6 — pivot from regex to LLM-as-judge (first all-Opus round)

Round 7 — judge-path hardening + README sync (all Opus)

Round 8 — balanced-brace parser + judge/subject model split (all Opus)

Round 9 — converged (all Opus)

Recurring dismissals (across multiple rounds)

Convergence path

Uh oh!

krokoko commented May 4, 2026

Uh oh!

anwesham-lab commented May 4, 2026

Uh oh!

anwesham-lab commented May 4, 2026

Uh oh!

scottschreckengaust left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

anwesham-lab commented May 4, 2026 •

edited

Loading

Live-cluster verification (us-west-2, cluster `obtk34wjzneklebnsltvbsvjnu`)

anwesham-lab commented May 4, 2026 •

edited

Loading