Agent org stage 1: kill QA dup-issues, reshape Analyst, model bumps, skill realloc by ohld · Pull Request #186 · ffmemes/ff-backend

ohld · 2026-04-24T12:08:34Z

Summary

Stage 1 of the agent-org plan. The 62%-firefighting problem isn't routines — an audit of the last 100 issues (docs/agent-org-audit-2026-04-24.md) shows it's QA filing duplicate incident tickets (DB pool ×4, describe_memes ×6, score column ×4) plus Analyst dumping reactive numbers into CEO's inbox. This PR fixes both, plus the model + skill drift we'd never addressed.

What changed

Control-system fixes (the high-leverage edits):

agents/qa-engineer/AGENTS.md: explicit DO-NOT-FILE list for known recurring incidents (describe_memes, db-pool, OpenRouter, Forbidden errors). 3-issue/scan output cap. Dedup preflight made mandatory. Removed two duplicated blocks.
agents/analyst/AGENTS.md: daily report rewritten to fixed 4-section shape — one hypothesis, one recommended bet for CEO, severity-gated incident digest (max 5 bullets), open hypotheses status. Anti-patterns called out explicitly.

Model bumps: CEO/CTO/Staff Eng claude-opus-4-6 → claude-opus-4-7. Sonnet agents stay (per codex pushback — Analyst quality isn't the bottleneck).

Skill reallocation (per gstack agent-company best practices):

CEO: +learn
Analyst: +learn, +codex
CTO: -plan-design-review
Staff Eng: +cso (scoped via prompt to auth/payments/uploads/infra PRs only)
QA: +devex-review, -design-consultation
Release Eng: +canary, +benchmark (post-deploy monitoring is theirs, not QA's)
Comms: +learn (kept frontend-design per ohld)

Tooling:

agents/_sync_config.py: new Python helper. Reads .paperclip.yaml + AGENTS.md frontmatter, diffs against prod via GET /api/agents/<id>, PATCHes only on change. Preserves paperclipai/* skill paths. Permissions routed to dedicated /permissions endpoint.
agents/deploy.sh: now invokes _sync_config.py as a second pass after the markdown PUT pass.
Workflow: pip install pyyaml step added.

Verification

Applied locally before pushing — all 7 agents PATCHed cleanly:

PATCHED analyst: +skills: ['garrytan/gstack/codex', 'garrytan/gstack/learn']
PATCHED ceo: model: claude-opus-4-6 → claude-opus-4-7; +skills: ['garrytan/gstack/learn']
PATCHED comms-manager: +skills: ['garrytan/gstack/learn']
PATCHED cto: model: claude-opus-4-6 → claude-opus-4-7; -skills: ['garrytan/gstack/plan-design-review']
PATCHED qa-engineer: +skills: ['garrytan/gstack/devex-review']; -skills: ['garrytan/gstack/design-consultation']
PATCHED release-engineer: +skills: ['garrytan/gstack/benchmark', 'garrytan/gstack/canary']
PATCHED staff-engineer: model: claude-opus-4-6 → claude-opus-4-7; +skills: ['garrytan/gstack/cso']

Config sync: patched=7, skipped=0, failed=0.

Re-run dry-run after apply showed zero drift across all 7. CEO desiredSkills now mixes 5 gstack + 4 preserved paperclipai/* paths.

Audit doc with full data: docs/agent-org-audit-2026-04-24.md.

Why these specific changes (not others)

Codex review (this session) flagged that "more aspirational CEO mission" without fixing routine/agent output volume is theater. Hence: control-system first, mission reframing in Stage 2.
Analyst stays on Sonnet because the bottleneck is deliverable shape, not reasoning depth. Bumping Analyst to Opus would just produce more eloquent reactive summaries.
Comms keeps frontend-design — confirmed with ohld it's the planned path for HTML→screenshot stat-cards (matplotlib in src/comms/visuals.py is transitional).

Test plan

Local apply succeeds (7/7 PATCHes, 10/10 markdown PUTs)
Re-run dry-run shows zero drift
CEO agent verified: model = claude-opus-4-7, skills include learn, paperclipai/* preserved
After merge, the GitHub Action re-runs the same sync against prod and reports zero drift
Wake Analyst once via paperclipai heartbeat run --agent-id 9c87d840-7041-49d8-8436-00b6dcb10971 and confirm new daily-report shape
48-hour soak: count issues created in next 48h vs prior baseline (38 QA + 23 CEO over 28 days = ~2.2/day). Target: drop to ≤1.5/day with zero describe_memes/db-pool duplicates from QA.

Stage 2 (separate PR, after Stage 1 settles)

CEO mission reframe with severity-aware preemption ("bet is default; regressions only preempt on severity threshold")
gbrain shared KB (DIY install — no official Paperclip integration) per INSTALL_FOR_AGENTS.md
Source-of-work dashboard

ohld

Staff Engineer review — PR #186

CI: lint ✅ test ✅. Verdict: approve with notes (external PR, owner merges).

Scope breakdown

Three changes bundled in one PR:

Agent-org Stage 1 (prompts, skill realloc, model bumps 4-6 → 4-7). Meta, reversible.
agents/_sync_config.py + deploy workflow hook. Diff-first PATCH, sane.
Upload flow: full i18n (EN/RU/UK) + inline OCR dedup. This is the only production-code surface worth paranoia.

Notes on the inline OCR dedup (`src/tgbot/handlers/upload/moderation.py`)

1. OpenRouter rate-limit contention (flag, not block). describe_single_meme now runs synchronously during upload, sharing the 20-rpm free-tier budget with the hourly describe_memes cron. At current upload volume (≤5/user/day) this is fine, but a bursty upload spike could starve the cron. Mitigation is already correct: a rate_limited / failed return falls through to manual review without error. No action required, just aware.

2. describe_single_meme return-type mismatch. Signature is -> str but it returns ("rate_limited", retry_after) tuple in one branch. The new caller handles this (isinstance(status, tuple)), but the signature should be fixed upstream — not blocking this PR.

3. User-selected language_code can be overwritten. describe_single_meme overrides meme.language_code when it detects a known language. Previously this only happened on the hourly cron (after manual approval). Now it happens during auto-review, so user-selected language can flip before moderator sees it. Probably net-neutral (OCR is usually more accurate), but worth confirming it's intended.

4. Trigram dedup: no GIN index on ocr_result ->> 'text'. find_meme_duplicate uses % operator with default 0.3 threshold, sequential scan on meme. Fine at current scale, but mark for follow-up if upload volume grows.

5. find_meme_duplicate length filter is duplicated (12 in caller, 11 inside). Redundant, not broken.

Localization coverage gap (minor)

All 10 new keys (upload.approved, upload.rejected, etc.) have EN/RU/UK only. Existing upload.submitted / upload.rules_accept_button have EN/RU/UK/ES/HI. Hindi and Spanish users will see English for the new messages. localizer.t() fallback is correct (CIS → RU, else EN), so no crashes — just inconsistency. Your call whether to backfill.

Agent-org changes

.paperclip.yaml model bumps + skill edits: reversible via re-sync, low risk.
_sync_config.py assumes frontmatter skills are slug-only (not full paths); matches current AGENTS.md convention but worth asserting.
Issue-hygiene rule finally ported into QA — this is the high-leverage fix.

SQL safety

find_meme_duplicate uses bindparams properly. No string interpolation risk.
No candidates.py changes. No new SQL injection surface.

Secrets

None committed. _sync_config.py reads PAPERCLIP_API_KEY from env. Clean.

Action

Approving the GitHub review. Not merging — external PR, @ohld merges manually.

…--approve, poll CI (#189) Diagnosis from a session audit of why 4 PRs went stale despite the routine firing: 1. Staff Eng never reads the trigger payload. Paperclip exposes it as $PAPERCLIP_WAKE_PAYLOAD_JSON (per claude-local adapter), but the prompt only said "trigger payload contains pr_number" without telling the agent how to read it. Agent was guessing PRs (or doing nothing). 2. Reviews posted as "comment" not "approve". GitHub's merge gate requires reviewDecision == APPROVED; a `gh pr comment` or default `gh pr review` leaves it empty. Result: review text appears, merge stays blocked. 3. Internal PRs misclassified as external. Old logic let the agent decide "external" without checking — a recent review of PR #186 (authored by ohld) literally said "approve with notes (external PR, owner merges)" and never merged. 4. CI polling missing. Most webhooks fire before CI finishes (test takes 2-3 min). Without polling, agent either approved-then-bailed or skipped the merge. Now: poll up to 5min × 60s, merge on green, leave issue blocked + comment on red/timeout. Companion server-side fix (already applied via PATCH /api/routines/<id>): - Added `pr_number` + `pr_url` as routine variables. - Title template changed from "PR Review" to "[pr:{{pr_number}}] Review" so each PR gets its own identifiable issue (kills coalescing-by-identical-title). Verified the new prompt PUTs cleanly via deploy.sh; effect lands on the next Staff Eng wake.

…t, model bumps, skill realloc Stage 1 of the agent-org plan from docs/paperclip-native-migration.md + docs/agent-org-audit-2026-04-24.md. Audit of last 100 issues showed the firefighting problem isn't routines — it's QA filing duplicate incident tickets (DB pool x4, describe_memes x6, score column x4) and Analyst dumping reactive numbers into CEO's inbox. Changes: - QA: explicit DO-NOT-FILE list for known recurring incidents (describe_memes, db-pool, OpenRouter, Forbidden errors), 3-issue/scan output cap, dedup preflight made mandatory. Removed duplicated issue-hygiene + MCP-tools blocks. Skills: -design-consultation +devex-review. - Analyst: daily report rewritten to fixed 4-section shape — one hypothesis, one recommended bet for CEO, severity-gated incident digest (max 5 bullets), open hypotheses status. Anti-patterns named explicitly. Skills: +learn,+codex. - Models: CEO/CTO/Staff Eng bumped claude-opus-4-6 → claude-opus-4-7. - Skill reallocation per gstack agent-company best practices: CEO +learn; CTO -plan-design-review; Staff Eng +cso (scoped to auth/payments/uploads/infra PRs only); Release Eng +canary,+benchmark (post-deploy monitoring is theirs, not QA's); Comms +learn. - agents/_sync_config.py: new Python helper that diffs current adapterConfig + desiredSkills + heartbeat against the manifest+frontmatter and PATCHes only on change. Preserves paperclipai/* skill paths. Permissions routed to dedicated /permissions endpoint. - agents/deploy.sh: invokes _sync_config.py as second pass after the existing markdown PUT pass. - workflow: pip install pyyaml so the new sync helper runs on the runner. Verified locally: dry-run after apply shows zero drift across all 7 agents. CEO desiredSkills now mixes 5 gstack + 4 preserved paperclipai paths. CEO mission reframe + gbrain integration are deferred to Stage 2 per plan (land after this Stage's effect on inbox volume can be measured).

ohld · 2026-04-24T12:47:34Z

Staff Engineer review — LGTM (can't self-approve via GitHub API, same account as author)

Scope matches title (kill QA dup-issues, reshape Analyst, model bumps, skill realloc). No SQL, no secrets, no LLM trust-boundary, no shell-injection surface. CI green. Merging.

Follow-up items (non-blocking)

agents/release-engineer/AGENTS.md:6 — canary listed twice in skills: YAML. _sync_config.py dedupes via set() so prod won't break, but it masks source-of-truth mistakes. Drop the duplicate.
agents/_sync_config.py:51 read_frontmatter_skills() fails open on edge cases (codex adversarial catch):
- skills: review (scalar instead of list) → list(\"review\") = ['r','e','v','i','e','w']. Silent corruption.
- CRLF line endings skip the ^---\n regex → returns [].
- Missing/unparseable frontmatter silently returns [], potentially stripping repo-managed skills.
  Defensive fix: validate fm.get(\"skills\") is a list; abort on scalar/dict. Log a warning on regex miss rather than silent [].
.github/workflows/paperclip-deploy-agents.yml:21 — pip install --quiet pyyaml is unpinned. The workflow later holds `PAPERCLIP_API_KEY`. Pin the version (`pyyaml==6.0.2` or similar) — low probability, but the blast radius is the live Paperclip control-plane key.
agents/_sync_config.py merge-only semantics — removing a `heartbeat.` or `permissions.` field from the manifest won't propagate to prod (only iterates manifest-present keys and `.update()`s). Intentional per docstring, but document the limitation so future-you remembers disabling a heartbeat needs an explicit null or code change.
agents/analyst/AGENTS.md vs experiments/README.md — prompt now mandates fixed 4-section daily report and forbids deviation. `experiments/README.md` still describes the older 6-section story-format report. Update the README or delete the old contract.
agents/staff-engineer/AGENTS.md:89 — `4a.` sub-step numbering is unusual. Renumber 5→6 and promote `/cso` to step 5.
agents/.paperclip.yaml — only 3 of the agents bumped to `claude-opus-4-7`. If staying on 4-6 elsewhere is intentional (budget, heartbeat cadence), a one-line comment would save future confusion.

None blocking. Squashing.

Staff Engineer's autonomous review of PR #186 caught this — `canary` was listed twice in skills:, leftover from the rebase that combined ohld's role-merge edit with my Stage 1 +canary,+benchmark addition. Non-blocking (deploy.sh _sync_config.py dedupes via set()), but the YAML should be the source of truth, not relying on downstream defensive coding.

ohld commented Apr 24, 2026

View reviewed changes

ohld mentioned this pull request Apr 24, 2026

fix(agents): close the PR review cycle — read trigger payload, force --approve, poll CI #189

Merged

6 tasks

ohld force-pushed the feat/agent-org-stage1 branch from 43d6c67 to 833b3f3 Compare April 24, 2026 12:37

ohld merged commit 7531d91 into production Apr 24, 2026
3 checks passed

ohld mentioned this pull request Apr 24, 2026

fix(agents): dedupe canary skill in release-engineer (caught by Staff Eng autonomous review) #190

Merged

ohld mentioned this pull request May 5, 2026

test: describe_memes tests + suppress QA firefighting #174

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent org stage 1: kill QA dup-issues, reshape Analyst, model bumps, skill realloc#186

Agent org stage 1: kill QA dup-issues, reshape Analyst, model bumps, skill realloc#186
ohld merged 1 commit into
productionfrom
feat/agent-org-stage1

ohld commented Apr 24, 2026

Uh oh!

ohld left a comment

Uh oh!

ohld commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ohld commented Apr 24, 2026

Summary

What changed

Verification

Why these specific changes (not others)

Test plan

Stage 2 (separate PR, after Stage 1 settles)

Uh oh!

ohld left a comment

Choose a reason for hiding this comment

Staff Engineer review — PR #186

Scope breakdown

Notes on the inline OCR dedup (src/tgbot/handlers/upload/moderation.py)

Localization coverage gap (minor)

Agent-org changes

SQL safety

Secrets

Action

Uh oh!

ohld commented Apr 24, 2026

Staff Engineer review — LGTM (can't self-approve via GitHub API, same account as author)

Follow-up items (non-blocking)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Notes on the inline OCR dedup (`src/tgbot/handlers/upload/moderation.py`)