Skip to content

diag: /admin/memory endpoint for in-process leak investigation#75

Merged
bk86a merged 1 commit intomainfrom
diag/admin-memory-endpoint
May 3, 2026
Merged

diag: /admin/memory endpoint for in-process leak investigation#75
bk86a merged 1 commit intomainfrom
diag/admin-memory-endpoint

Conversation

@bk86a
Copy link
Copy Markdown
Owner

@bk86a bk86a commented May 3, 2026

Summary

  • Adds GET /admin/memory, trusted-token-gated, returning the levers to localise in-process growth: every module-scoped dict's len(), /proc/self/status (VmRSS/VmHWM/RssAnon/RssFile/Threads), FD count, asyncio task count + sample, and a top-30 gc.get_objects() type histogram.
  • include_in_schema=False keeps it out of the public OpenAPI doc. Mirrors /admin/refresh-estimates's auth shape (401 without bearer, 200 otherwise).
  • Intended as a diagnostic, not a permanent feature — a gc.get_objects() walk on a 4 GB heap runs in hundreds of ms, fine for manual snapshots, not a hot path.

Why

We're investigating a memory growth that began Sat 2 May ~06:00 — pre-deploy baseline was ~810 MB pod RAM (docs/performance.md:106-116), now over 4 GB and climbing. Two leading hypotheses (slowapi MemoryStorage accumulating per-IP, and the new periodic estimates refresh swap path) were both reproduced locally and disproven — MemoryStorage evicts properly under sustained traffic; refresh_estimates_once stays flat across 200 ticks. Without instrumented snapshots from the running container we'd be guessing.

Test plan

  • pytest tests/test_api.py::TestAdminMemoryEndpoint — 3 tests: 401 without token, 401 with invalid bearer, 200 returns the expected payload shape (sizes / proc / asyncio_tasks / thread_count / gc_top_30_types).
  • pytest tests/test_api.py — 48/48 passing.
  • ruff check app/main.py tests/test_api.py — clean.
  • After merge + deploy: snapshot once shortly after restart, again 6-12 h later, diff to identify the growing object class.

Operator usage (post-deploy)

curl -s https://api.datatoolset.eu/admin/memory \
  -H "Authorization: Bearer <trusted-token>" | jq

🤖 Generated with Claude Code

Trusted-token-gated GET that returns the levers we need to localise an
in-process growth: lengths of every module-scoped dict (`_lookup`,
`_estimates`, `_prefix_index`, `_nuts_names`, `auth._db_tokens`, slowapi
`_storage.{storage,expirations,events,locks}`), the relevant
/proc/self/status fields (VmRSS / VmHWM / RssAnon / RssFile / Threads),
asyncio task count + sample, file-descriptor count, and a top-30
`gc.get_objects()` type histogram.

include_in_schema=False so it stays out of the public OpenAPI doc.
GET (no side effects); the gc walk runs in O(n_objects) so a single call
on a 4 GB heap takes ~hundreds of ms — fine for a manually-triggered
diagnostic, not intended as a hot path.

Mirrors the auth shape of /admin/refresh-estimates: 401 without a
trusted bearer, 200 otherwise. Tests cover both auth paths plus the
response shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bk86a bk86a merged commit f8904af into main May 3, 2026
11 checks passed
@bk86a bk86a deleted the diag/admin-memory-endpoint branch May 3, 2026 18:10
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c8c038f59c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread app/main.py
Comment on lines +507 to +509
gc.collect()
type_counts = Counter(type(o).__name__ for o in gc.get_objects())
top_types = [{"type": t, "count": c} for t, c in type_counts.most_common(30)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Offload heap scan from event-loop handler

This endpoint is declared async but performs heavy synchronous CPU work (gc.collect() and full gc.get_objects() traversal), so each /admin/memory call can block the worker event loop and pause unrelated requests handled by the same process. In production, even occasional operator snapshots can introduce noticeable latency spikes under load; this diagnostic logic should run in a threadpool (e.g., asyncio.to_thread) or the route should be def so FastAPI executes it off the main loop.

Useful? React with 👍 / 👎.

bk86a added a commit that referenced this pull request May 3, 2026
Two changes addressing PR #75 review (Codex P2):

1. Move the `gc.get_objects()` + Counter loop into `asyncio.to_thread`.
   On a multi-GB heap the walk iterates 10M+ objects; the pure-Python
   loop releases the GIL between iterations so other coroutines on the
   worker can interleave instead of stalling for the duration of one
   operator snapshot.

2. Drop `gc.collect()`. On a multi-GB heap collection costs seconds
   and holds the GIL throughout (no thread offload helps). For a leak
   investigation, transient-cycle noise is negligible compared to
   whatever growth we'd be chasing — skipping it gives a better
   blocking profile without measurably degrading the histogram.

`asyncio.all_tasks()` is captured before the offload because it must
run on the event loop. Endpoint contract is unchanged; existing tests
in tests/test_api.py::TestAdminMemoryEndpoint still pass.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bk86a added a commit that referenced this pull request May 3, 2026
Bumps __version__ to 0.19.0 and folds the [Unreleased] CHANGELOG block
into a dated [0.19.0] release. Adds entries for the two diagnostic-PR
landings since the prior cut: /admin/memory endpoint (#75) and the
follow-up offload of its gc walk to a thread (#76).

Other changes already in [Unreleased] from the prior week:
- / root endpoint
- persistent-volume support (entrypoint chown + gosu)
- provider-agnostic deployment scaffolding (compose.yaml + Makefile)
- periodic refresh of tercet_missing_codes.csv (#44)
- 85 new postal-code estimates from postal_code_monitor.py (#74)
- uvicorn --proxy-headers --forwarded-allow-ips '*' in CMD
- entrypoint safe under non-root --user
- multi-worker performance re-baseline
- estimates_refresh serialisation lock
- perf_test run_warm fix
- previously-stale __version__ catch-up

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant