Skip to content

Agent org stage 1: kill QA dup-issues, reshape Analyst, model bumps, skill realloc#186

Merged
ohld merged 1 commit into
productionfrom
feat/agent-org-stage1
Apr 24, 2026
Merged

Agent org stage 1: kill QA dup-issues, reshape Analyst, model bumps, skill realloc#186
ohld merged 1 commit into
productionfrom
feat/agent-org-stage1

Conversation

@ohld
Copy link
Copy Markdown
Member

@ohld ohld commented Apr 24, 2026

Summary

Stage 1 of the agent-org plan. The 62%-firefighting problem isn't routines — an audit of the last 100 issues (docs/agent-org-audit-2026-04-24.md) shows it's QA filing duplicate incident tickets (DB pool ×4, describe_memes ×6, score column ×4) plus Analyst dumping reactive numbers into CEO's inbox. This PR fixes both, plus the model + skill drift we'd never addressed.

What changed

Control-system fixes (the high-leverage edits):

  • agents/qa-engineer/AGENTS.md: explicit DO-NOT-FILE list for known recurring incidents (describe_memes, db-pool, OpenRouter, Forbidden errors). 3-issue/scan output cap. Dedup preflight made mandatory. Removed two duplicated blocks.
  • agents/analyst/AGENTS.md: daily report rewritten to fixed 4-section shape — one hypothesis, one recommended bet for CEO, severity-gated incident digest (max 5 bullets), open hypotheses status. Anti-patterns called out explicitly.

Model bumps: CEO/CTO/Staff Eng claude-opus-4-6claude-opus-4-7. Sonnet agents stay (per codex pushback — Analyst quality isn't the bottleneck).

Skill reallocation (per gstack agent-company best practices):

  • CEO: +learn
  • Analyst: +learn, +codex
  • CTO: -plan-design-review
  • Staff Eng: +cso (scoped via prompt to auth/payments/uploads/infra PRs only)
  • QA: +devex-review, -design-consultation
  • Release Eng: +canary, +benchmark (post-deploy monitoring is theirs, not QA's)
  • Comms: +learn (kept frontend-design per ohld)

Tooling:

  • agents/_sync_config.py: new Python helper. Reads .paperclip.yaml + AGENTS.md frontmatter, diffs against prod via GET /api/agents/<id>, PATCHes only on change. Preserves paperclipai/* skill paths. Permissions routed to dedicated /permissions endpoint.
  • agents/deploy.sh: now invokes _sync_config.py as a second pass after the markdown PUT pass.
  • Workflow: pip install pyyaml step added.

Verification

Applied locally before pushing — all 7 agents PATCHed cleanly:

PATCHED analyst: +skills: ['garrytan/gstack/codex', 'garrytan/gstack/learn']
PATCHED ceo: model: claude-opus-4-6 → claude-opus-4-7; +skills: ['garrytan/gstack/learn']
PATCHED comms-manager: +skills: ['garrytan/gstack/learn']
PATCHED cto: model: claude-opus-4-6 → claude-opus-4-7; -skills: ['garrytan/gstack/plan-design-review']
PATCHED qa-engineer: +skills: ['garrytan/gstack/devex-review']; -skills: ['garrytan/gstack/design-consultation']
PATCHED release-engineer: +skills: ['garrytan/gstack/benchmark', 'garrytan/gstack/canary']
PATCHED staff-engineer: model: claude-opus-4-6 → claude-opus-4-7; +skills: ['garrytan/gstack/cso']

Config sync: patched=7, skipped=0, failed=0.

Re-run dry-run after apply showed zero drift across all 7. CEO desiredSkills now mixes 5 gstack + 4 preserved paperclipai/* paths.

Audit doc with full data: docs/agent-org-audit-2026-04-24.md.

Why these specific changes (not others)

  • Codex review (this session) flagged that "more aspirational CEO mission" without fixing routine/agent output volume is theater. Hence: control-system first, mission reframing in Stage 2.
  • Analyst stays on Sonnet because the bottleneck is deliverable shape, not reasoning depth. Bumping Analyst to Opus would just produce more eloquent reactive summaries.
  • Comms keeps frontend-design — confirmed with ohld it's the planned path for HTML→screenshot stat-cards (matplotlib in src/comms/visuals.py is transitional).

Test plan

  • Local apply succeeds (7/7 PATCHes, 10/10 markdown PUTs)
  • Re-run dry-run shows zero drift
  • CEO agent verified: model = claude-opus-4-7, skills include learn, paperclipai/* preserved
  • After merge, the GitHub Action re-runs the same sync against prod and reports zero drift
  • Wake Analyst once via paperclipai heartbeat run --agent-id 9c87d840-7041-49d8-8436-00b6dcb10971 and confirm new daily-report shape
  • 48-hour soak: count issues created in next 48h vs prior baseline (38 QA + 23 CEO over 28 days = ~2.2/day). Target: drop to ≤1.5/day with zero describe_memes/db-pool duplicates from QA.

Stage 2 (separate PR, after Stage 1 settles)

  • CEO mission reframe with severity-aware preemption ("bet is default; regressions only preempt on severity threshold")
  • gbrain shared KB (DIY install — no official Paperclip integration) per INSTALL_FOR_AGENTS.md
  • Source-of-work dashboard

Copy link
Copy Markdown
Member Author

@ohld ohld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Staff Engineer review — PR #186

CI: lint ✅ test ✅. Verdict: approve with notes (external PR, owner merges).

Scope breakdown

Three changes bundled in one PR:

  1. Agent-org Stage 1 (prompts, skill realloc, model bumps 4-6 → 4-7). Meta, reversible.
  2. agents/_sync_config.py + deploy workflow hook. Diff-first PATCH, sane.
  3. Upload flow: full i18n (EN/RU/UK) + inline OCR dedup. This is the only production-code surface worth paranoia.

Notes on the inline OCR dedup (src/tgbot/handlers/upload/moderation.py)

1. OpenRouter rate-limit contention (flag, not block). describe_single_meme now runs synchronously during upload, sharing the 20-rpm free-tier budget with the hourly describe_memes cron. At current upload volume (≤5/user/day) this is fine, but a bursty upload spike could starve the cron. Mitigation is already correct: a rate_limited / failed return falls through to manual review without error. No action required, just aware.

2. describe_single_meme return-type mismatch. Signature is -> str but it returns ("rate_limited", retry_after) tuple in one branch. The new caller handles this (isinstance(status, tuple)), but the signature should be fixed upstream — not blocking this PR.

3. User-selected language_code can be overwritten. describe_single_meme overrides meme.language_code when it detects a known language. Previously this only happened on the hourly cron (after manual approval). Now it happens during auto-review, so user-selected language can flip before moderator sees it. Probably net-neutral (OCR is usually more accurate), but worth confirming it's intended.

4. Trigram dedup: no GIN index on ocr_result ->> 'text'. find_meme_duplicate uses % operator with default 0.3 threshold, sequential scan on meme. Fine at current scale, but mark for follow-up if upload volume grows.

5. find_meme_duplicate length filter is duplicated (12 in caller, 11 inside). Redundant, not broken.

Localization coverage gap (minor)

All 10 new keys (upload.approved, upload.rejected, etc.) have EN/RU/UK only. Existing upload.submitted / upload.rules_accept_button have EN/RU/UK/ES/HI. Hindi and Spanish users will see English for the new messages. localizer.t() fallback is correct (CIS → RU, else EN), so no crashes — just inconsistency. Your call whether to backfill.

Agent-org changes

  • .paperclip.yaml model bumps + skill edits: reversible via re-sync, low risk.
  • _sync_config.py assumes frontmatter skills are slug-only (not full paths); matches current AGENTS.md convention but worth asserting.
  • Issue-hygiene rule finally ported into QA — this is the high-leverage fix.

SQL safety

  • find_meme_duplicate uses bindparams properly. No string interpolation risk.
  • No candidates.py changes. No new SQL injection surface.

Secrets

  • None committed. _sync_config.py reads PAPERCLIP_API_KEY from env. Clean.

Action

Approving the GitHub review. Not merging — external PR, @ohld merges manually.

ohld added a commit that referenced this pull request Apr 24, 2026
…--approve, poll CI (#189)

Diagnosis from a session audit of why 4 PRs went stale despite the routine firing:

1. Staff Eng never reads the trigger payload. Paperclip exposes it as
   $PAPERCLIP_WAKE_PAYLOAD_JSON (per claude-local adapter), but the prompt
   only said "trigger payload contains pr_number" without telling the agent
   how to read it. Agent was guessing PRs (or doing nothing).

2. Reviews posted as "comment" not "approve". GitHub's merge gate requires
   reviewDecision == APPROVED; a `gh pr comment` or default `gh pr review`
   leaves it empty. Result: review text appears, merge stays blocked.

3. Internal PRs misclassified as external. Old logic let the agent decide
   "external" without checking — a recent review of PR #186 (authored by
   ohld) literally said "approve with notes (external PR, owner merges)"
   and never merged.

4. CI polling missing. Most webhooks fire before CI finishes (test takes
   2-3 min). Without polling, agent either approved-then-bailed or skipped
   the merge. Now: poll up to 5min × 60s, merge on green, leave issue
   blocked + comment on red/timeout.

Companion server-side fix (already applied via PATCH /api/routines/<id>):
- Added `pr_number` + `pr_url` as routine variables.
- Title template changed from "PR Review" to "[pr:{{pr_number}}] Review"
  so each PR gets its own identifiable issue (kills coalescing-by-identical-title).

Verified the new prompt PUTs cleanly via deploy.sh; effect lands on the
next Staff Eng wake.
…t, model bumps, skill realloc

Stage 1 of the agent-org plan from docs/paperclip-native-migration.md +
docs/agent-org-audit-2026-04-24.md. Audit of last 100 issues showed the
firefighting problem isn't routines — it's QA filing duplicate incident
tickets (DB pool x4, describe_memes x6, score column x4) and Analyst
dumping reactive numbers into CEO's inbox.

Changes:
- QA: explicit DO-NOT-FILE list for known recurring incidents
  (describe_memes, db-pool, OpenRouter, Forbidden errors), 3-issue/scan
  output cap, dedup preflight made mandatory. Removed duplicated
  issue-hygiene + MCP-tools blocks. Skills: -design-consultation +devex-review.
- Analyst: daily report rewritten to fixed 4-section shape — one hypothesis,
  one recommended bet for CEO, severity-gated incident digest (max 5 bullets),
  open hypotheses status. Anti-patterns named explicitly. Skills: +learn,+codex.
- Models: CEO/CTO/Staff Eng bumped claude-opus-4-6 → claude-opus-4-7.
- Skill reallocation per gstack agent-company best practices:
  CEO +learn; CTO -plan-design-review; Staff Eng +cso (scoped to
  auth/payments/uploads/infra PRs only); Release Eng +canary,+benchmark
  (post-deploy monitoring is theirs, not QA's); Comms +learn.
- agents/_sync_config.py: new Python helper that diffs current adapterConfig
  + desiredSkills + heartbeat against the manifest+frontmatter and
  PATCHes only on change. Preserves paperclipai/* skill paths. Permissions
  routed to dedicated /permissions endpoint.
- agents/deploy.sh: invokes _sync_config.py as second pass after the
  existing markdown PUT pass.
- workflow: pip install pyyaml so the new sync helper runs on the runner.

Verified locally: dry-run after apply shows zero drift across all 7 agents.
CEO desiredSkills now mixes 5 gstack + 4 preserved paperclipai paths.

CEO mission reframe + gbrain integration are deferred to Stage 2 per plan
(land after this Stage's effect on inbox volume can be measured).
@ohld ohld force-pushed the feat/agent-org-stage1 branch from 43d6c67 to 833b3f3 Compare April 24, 2026 12:37
@ohld
Copy link
Copy Markdown
Member Author

ohld commented Apr 24, 2026

Staff Engineer review — LGTM (can't self-approve via GitHub API, same account as author)

Scope matches title (kill QA dup-issues, reshape Analyst, model bumps, skill realloc). No SQL, no secrets, no LLM trust-boundary, no shell-injection surface. CI green. Merging.

Follow-up items (non-blocking)

  1. agents/release-engineer/AGENTS.md:6canary listed twice in skills: YAML. _sync_config.py dedupes via set() so prod won't break, but it masks source-of-truth mistakes. Drop the duplicate.

  2. agents/_sync_config.py:51 read_frontmatter_skills() fails open on edge cases (codex adversarial catch):

    • skills: review (scalar instead of list) → list(\"review\") = ['r','e','v','i','e','w']. Silent corruption.
    • CRLF line endings skip the ^---\n regex → returns [].
    • Missing/unparseable frontmatter silently returns [], potentially stripping repo-managed skills.
      Defensive fix: validate fm.get(\"skills\") is a list; abort on scalar/dict. Log a warning on regex miss rather than silent [].
  3. .github/workflows/paperclip-deploy-agents.yml:21pip install --quiet pyyaml is unpinned. The workflow later holds `PAPERCLIP_API_KEY`. Pin the version (`pyyaml==6.0.2` or similar) — low probability, but the blast radius is the live Paperclip control-plane key.

  4. agents/_sync_config.py merge-only semantics — removing a `heartbeat.` or `permissions.` field from the manifest won't propagate to prod (only iterates manifest-present keys and `.update()`s). Intentional per docstring, but document the limitation so future-you remembers disabling a heartbeat needs an explicit null or code change.

  5. agents/analyst/AGENTS.md vs experiments/README.md — prompt now mandates fixed 4-section daily report and forbids deviation. `experiments/README.md` still describes the older 6-section story-format report. Update the README or delete the old contract.

  6. agents/staff-engineer/AGENTS.md:89 — `4a.` sub-step numbering is unusual. Renumber 5→6 and promote `/cso` to step 5.

  7. agents/.paperclip.yaml — only 3 of the agents bumped to `claude-opus-4-7`. If staying on 4-6 elsewhere is intentional (budget, heartbeat cadence), a one-line comment would save future confusion.

None blocking. Squashing.

@ohld ohld merged commit 7531d91 into production Apr 24, 2026
3 checks passed
ohld added a commit that referenced this pull request Apr 24, 2026
Staff Engineer's autonomous review of PR #186 caught this — `canary` was
listed twice in skills:, leftover from the rebase that combined ohld's
role-merge edit with my Stage 1 +canary,+benchmark addition.

Non-blocking (deploy.sh _sync_config.py dedupes via set()), but the YAML
should be the source of truth, not relying on downstream defensive coding.
ohld added a commit that referenced this pull request Apr 24, 2026
Staff Engineer's autonomous review of PR #186 caught this — `canary` was
listed twice in skills:, leftover from the rebase that combined ohld's
role-merge edit with my Stage 1 +canary,+benchmark addition.

Non-blocking (deploy.sh _sync_config.py dedupes via set()), but the YAML
should be the source of truth, not relying on downstream defensive coding.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant