Skip to content

feat(start): document onboarding contract + ghost-user monitor (FFM-907)#224

Merged
ohld merged 1 commit into
productionfrom
fix/ffm-907-start-onboarding-audit
May 5, 2026
Merged

feat(start): document onboarding contract + ghost-user monitor (FFM-907)#224
ohld merged 1 commit into
productionfrom
fix/ffm-907-start-onboarding-audit

Conversation

@ohld
Copy link
Copy Markdown
Member

@ohld ohld commented May 2, 2026

Summary

Pattern-level fix for FFM-907 — Sega's kitchen-branch leak (PR #222) was a symptom; this PR makes the same regression impossible to silently re-introduce and adds an alarm so we never depend on a friend complaining again.

Audit

handle_start does seven side effects every new user MUST get. After PR #222 they're all hoisted above the deep_link ladder:

# Side effect Universal?
1 user_tg upsert ✅ above ladder (save_user_data)
2 user upsert ✅ above ladder
3 deep_link saved on user_tg ✅ above ladder
4 user_info cache populated ✅ above ladder
5 user_deep_link_log row ✅ above ladder
6 Admin chat notification ✅ above ladder
7 user_language rows seeded ✅ above ladder (PR #222)

Per-branch behaviour (handle_language_settings, handle_invited_user, next_message, etc.) is intentional and stays inside the ladder.

What's in this PR

  1. Contributor comment at the top of handle_start spelling out the rule: hoist universal side effects above the ladder; never bury them inside per-branch blocks. Cites the FFM-907 incident so future agents/humans understand the why.
  2. src/flows/monitors/ghost_users.py — Prefect flow, runs every minute. Counts users created 1–5 min ago without a user_meme_reaction row (= WARN) and >5 min ago (= ERROR), posts a single summary to the admin chat with sample IDs. Filters out blocked-acquisition deep_links (silent drop is by design — they don't even create a user row, but the SQL is defensive).
  3. tests/tgbot/test_start.py — regression coverage per deep_link variant: bare /start, kitchen, wrapped, giveaway_77, s_*_* share-link, likefollowbot blocked, plus existing-user path and idempotency of the lazy lang init. Each asserts the row-level post-conditions in user_tg + user + user_language + user_deep_link_log.

Why a Prefect flow vs in-process timer

Per-request scheduling would burn an event-loop slot per /start and lose state on bot restart. A 1-min Prefect cron is one query against user + user_meme_reaction (both already covered by indexes on created_at / user_id), reuses the existing failure-alert path, and is easy to pause/inspect.

Test plan

  • CI: tests/tgbot/test_start.py passes (8 cases)
  • After deploy: Prefect UI shows Monitor ghost users running every minute
  • Manually trigger by creating a user row with no reactions — confirm WARN appears in admin chat within 60s
  • No false positives on a normal day (most ghosts come from the language-picker drop-off; if rate is too noisy, raise the WARN threshold)
  • Confirm user.created_at is real wall-clock (not now() skew across replicas)

🤖 Generated with Claude Code

…s (FFM-907)

PR #222 plugged the kitchen-branch leak (Sega's case). This audit ensures
no future deep_link branch can re-introduce the same drift, and adds a
runtime alarm so we don't have to wait for a friend to complain.

- Contributor-facing comment in handle_start documenting the rule:
  every universal onboarding side effect lives ABOVE the deep_link
  ladder. New side effects must hoist, not bury inside per-branch blocks.
- src/flows/monitors/ghost_users.py: Prefect flow runs every minute,
  counts new users (1-5min ago = WARN, >5min ago = ERROR) with no
  user_meme_reaction row, posts a single summary to admin chat.
  Filters out blocked-acquisition deep_links (intentional silent drop).
- tests/tgbot/test_start.py: regression coverage for each known
  deep_link variant (none / kitchen / wrapped / giveaway_77 / s_*_* /
  blocked-acquisition / existing user). Asserts user_tg + user +
  user_language + user_deep_link_log rows after every created=True path,
  plus idempotency of the lazy lang init.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@ohld
Copy link
Copy Markdown
Member Author

ohld commented May 5, 2026

STAFF ENGINEER REVIEW: APPROVED — Structural review clean. SQL safety verified: _build_exempt_clause interpolates Python constants only (BLOCKED_ACQUISITION_CHANNELS = frozenset({'likefollowbot'}), prefix tuple ('tapps',)), no user input. Tests cover every deep_link branch with row-level invariants. Documentation comment in handle_start correctly captures the FFM-907 contract.

Codex adversarial findings (non-blocking, suggested follow-ups)

  1. Alert fatigue from language-picker drop-off — monitor will fire WARN/ERROR every minute even on healthy days because users who never finish onboarding count as ghosts. PR description already acknowledges this; raise the WARN threshold or add a per-WARN cooldown if signal-to-noise is bad post-deploy.
  2. No Prefect concurrency guard — with * * * * * cron + 60s timeout, a hung run can overlap the next scheduled run. Worst case is duplicate admin alerts (read-only query, no locks). Consider prefect.deployments.concurrency_limit=1 on the deployment.
  3. Monitor SQL untestedtests/tgbot/test_start.py regression-tests handle_start (correct scope for FFM-907) but the ghost-detection SQL itself has no integration test. Add a small tests/flows/test_ghost_users.py later: insert 4 fixture users (warn / error / has-reaction / blocked deep_link), assert counts.
  4. Unbounded ARRAY_AGG — theoretical at current FF-memes scale; not worth fixing unless ghost rate spikes massively.

None of these are SQL injection, race conditions on writes, or secrets. Auto-merge queued.

@ohld ohld merged commit c03fa31 into production May 5, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant