Agent org stage 1: kill QA dup-issues, reshape Analyst, model bumps, skill realloc#186
Conversation
ohld
left a comment
There was a problem hiding this comment.
Staff Engineer review — PR #186
CI: lint ✅ test ✅. Verdict: approve with notes (external PR, owner merges).
Scope breakdown
Three changes bundled in one PR:
- Agent-org Stage 1 (prompts, skill realloc, model bumps 4-6 → 4-7). Meta, reversible.
agents/_sync_config.py+ deploy workflow hook. Diff-first PATCH, sane.- Upload flow: full i18n (EN/RU/UK) + inline OCR dedup. This is the only production-code surface worth paranoia.
Notes on the inline OCR dedup (src/tgbot/handlers/upload/moderation.py)
1. OpenRouter rate-limit contention (flag, not block). describe_single_meme now runs synchronously during upload, sharing the 20-rpm free-tier budget with the hourly describe_memes cron. At current upload volume (≤5/user/day) this is fine, but a bursty upload spike could starve the cron. Mitigation is already correct: a rate_limited / failed return falls through to manual review without error. No action required, just aware.
2. describe_single_meme return-type mismatch. Signature is -> str but it returns ("rate_limited", retry_after) tuple in one branch. The new caller handles this (isinstance(status, tuple)), but the signature should be fixed upstream — not blocking this PR.
3. User-selected language_code can be overwritten. describe_single_meme overrides meme.language_code when it detects a known language. Previously this only happened on the hourly cron (after manual approval). Now it happens during auto-review, so user-selected language can flip before moderator sees it. Probably net-neutral (OCR is usually more accurate), but worth confirming it's intended.
4. Trigram dedup: no GIN index on ocr_result ->> 'text'. find_meme_duplicate uses % operator with default 0.3 threshold, sequential scan on meme. Fine at current scale, but mark for follow-up if upload volume grows.
5. find_meme_duplicate length filter is duplicated (12 in caller, 11 inside). Redundant, not broken.
Localization coverage gap (minor)
All 10 new keys (upload.approved, upload.rejected, etc.) have EN/RU/UK only. Existing upload.submitted / upload.rules_accept_button have EN/RU/UK/ES/HI. Hindi and Spanish users will see English for the new messages. localizer.t() fallback is correct (CIS → RU, else EN), so no crashes — just inconsistency. Your call whether to backfill.
Agent-org changes
.paperclip.yamlmodel bumps + skill edits: reversible via re-sync, low risk._sync_config.pyassumes frontmatter skills are slug-only (not full paths); matches current AGENTS.md convention but worth asserting.- Issue-hygiene rule finally ported into QA — this is the high-leverage fix.
SQL safety
find_meme_duplicateusesbindparamsproperly. No string interpolation risk.- No
candidates.pychanges. No new SQL injection surface.
Secrets
- None committed.
_sync_config.pyreadsPAPERCLIP_API_KEYfrom env. Clean.
Action
Approving the GitHub review. Not merging — external PR, @ohld merges manually.
…--approve, poll CI (#189) Diagnosis from a session audit of why 4 PRs went stale despite the routine firing: 1. Staff Eng never reads the trigger payload. Paperclip exposes it as $PAPERCLIP_WAKE_PAYLOAD_JSON (per claude-local adapter), but the prompt only said "trigger payload contains pr_number" without telling the agent how to read it. Agent was guessing PRs (or doing nothing). 2. Reviews posted as "comment" not "approve". GitHub's merge gate requires reviewDecision == APPROVED; a `gh pr comment` or default `gh pr review` leaves it empty. Result: review text appears, merge stays blocked. 3. Internal PRs misclassified as external. Old logic let the agent decide "external" without checking — a recent review of PR #186 (authored by ohld) literally said "approve with notes (external PR, owner merges)" and never merged. 4. CI polling missing. Most webhooks fire before CI finishes (test takes 2-3 min). Without polling, agent either approved-then-bailed or skipped the merge. Now: poll up to 5min × 60s, merge on green, leave issue blocked + comment on red/timeout. Companion server-side fix (already applied via PATCH /api/routines/<id>): - Added `pr_number` + `pr_url` as routine variables. - Title template changed from "PR Review" to "[pr:{{pr_number}}] Review" so each PR gets its own identifiable issue (kills coalescing-by-identical-title). Verified the new prompt PUTs cleanly via deploy.sh; effect lands on the next Staff Eng wake.
…t, model bumps, skill realloc Stage 1 of the agent-org plan from docs/paperclip-native-migration.md + docs/agent-org-audit-2026-04-24.md. Audit of last 100 issues showed the firefighting problem isn't routines — it's QA filing duplicate incident tickets (DB pool x4, describe_memes x6, score column x4) and Analyst dumping reactive numbers into CEO's inbox. Changes: - QA: explicit DO-NOT-FILE list for known recurring incidents (describe_memes, db-pool, OpenRouter, Forbidden errors), 3-issue/scan output cap, dedup preflight made mandatory. Removed duplicated issue-hygiene + MCP-tools blocks. Skills: -design-consultation +devex-review. - Analyst: daily report rewritten to fixed 4-section shape — one hypothesis, one recommended bet for CEO, severity-gated incident digest (max 5 bullets), open hypotheses status. Anti-patterns named explicitly. Skills: +learn,+codex. - Models: CEO/CTO/Staff Eng bumped claude-opus-4-6 → claude-opus-4-7. - Skill reallocation per gstack agent-company best practices: CEO +learn; CTO -plan-design-review; Staff Eng +cso (scoped to auth/payments/uploads/infra PRs only); Release Eng +canary,+benchmark (post-deploy monitoring is theirs, not QA's); Comms +learn. - agents/_sync_config.py: new Python helper that diffs current adapterConfig + desiredSkills + heartbeat against the manifest+frontmatter and PATCHes only on change. Preserves paperclipai/* skill paths. Permissions routed to dedicated /permissions endpoint. - agents/deploy.sh: invokes _sync_config.py as second pass after the existing markdown PUT pass. - workflow: pip install pyyaml so the new sync helper runs on the runner. Verified locally: dry-run after apply shows zero drift across all 7 agents. CEO desiredSkills now mixes 5 gstack + 4 preserved paperclipai paths. CEO mission reframe + gbrain integration are deferred to Stage 2 per plan (land after this Stage's effect on inbox volume can be measured).
43d6c67 to
833b3f3
Compare
Staff Engineer review — LGTM (can't self-approve via GitHub API, same account as author)Scope matches title (kill QA dup-issues, reshape Analyst, model bumps, skill realloc). No SQL, no secrets, no LLM trust-boundary, no shell-injection surface. CI green. Merging. Follow-up items (non-blocking)
None blocking. Squashing. |
Staff Engineer's autonomous review of PR #186 caught this — `canary` was listed twice in skills:, leftover from the rebase that combined ohld's role-merge edit with my Stage 1 +canary,+benchmark addition. Non-blocking (deploy.sh _sync_config.py dedupes via set()), but the YAML should be the source of truth, not relying on downstream defensive coding.
Staff Engineer's autonomous review of PR #186 caught this — `canary` was listed twice in skills:, leftover from the rebase that combined ohld's role-merge edit with my Stage 1 +canary,+benchmark addition. Non-blocking (deploy.sh _sync_config.py dedupes via set()), but the YAML should be the source of truth, not relying on downstream defensive coding.
Summary
Stage 1 of the agent-org plan. The 62%-firefighting problem isn't routines — an audit of the last 100 issues (
docs/agent-org-audit-2026-04-24.md) shows it's QA filing duplicate incident tickets (DB pool ×4, describe_memes ×6, score column ×4) plus Analyst dumping reactive numbers into CEO's inbox. This PR fixes both, plus the model + skill drift we'd never addressed.What changed
Control-system fixes (the high-leverage edits):
agents/qa-engineer/AGENTS.md: explicit DO-NOT-FILE list for known recurring incidents (describe_memes,db-pool, OpenRouter, Forbidden errors). 3-issue/scan output cap. Dedup preflight made mandatory. Removed two duplicated blocks.agents/analyst/AGENTS.md: daily report rewritten to fixed 4-section shape — one hypothesis, one recommended bet for CEO, severity-gated incident digest (max 5 bullets), open hypotheses status. Anti-patterns called out explicitly.Model bumps: CEO/CTO/Staff Eng
claude-opus-4-6→claude-opus-4-7. Sonnet agents stay (per codex pushback — Analyst quality isn't the bottleneck).Skill reallocation (per gstack agent-company best practices):
learnlearn, +codexplan-design-reviewcso(scoped via prompt to auth/payments/uploads/infra PRs only)devex-review, -design-consultationcanary, +benchmark(post-deploy monitoring is theirs, not QA's)learn(keptfrontend-designper ohld)Tooling:
agents/_sync_config.py: new Python helper. Reads.paperclip.yaml+ AGENTS.md frontmatter, diffs against prod viaGET /api/agents/<id>, PATCHes only on change. Preservespaperclipai/*skill paths. Permissions routed to dedicated/permissionsendpoint.agents/deploy.sh: now invokes_sync_config.pyas a second pass after the markdown PUT pass.pip install pyyamlstep added.Verification
Applied locally before pushing — all 7 agents PATCHed cleanly:
Re-run dry-run after apply showed zero drift across all 7. CEO
desiredSkillsnow mixes 5 gstack + 4 preservedpaperclipai/*paths.Audit doc with full data:
docs/agent-org-audit-2026-04-24.md.Why these specific changes (not others)
frontend-design— confirmed with ohld it's the planned path for HTML→screenshot stat-cards (matplotlib insrc/comms/visuals.pyis transitional).Test plan
claude-opus-4-7, skills includelearn, paperclipai/* preservedpaperclipai heartbeat run --agent-id 9c87d840-7041-49d8-8436-00b6dcb10971and confirm new daily-report shapedescribe_memes/db-poolduplicates from QA.Stage 2 (separate PR, after Stage 1 settles)
INSTALL_FOR_AGENTS.md