CI flake diagnostics: readiness-check + startup phase logging#4873
CI flake diagnostics: readiness-check + startup phase logging#4873habdelra wants to merge 3 commits into
Conversation
The "Error: Timed out waiting for: .../_readiness-check" failure in the
host-test job recurs without enough log signal to root-cause. Today the
realm-server log artifact jumps from "is starting indexing" to nothing
for 10 minutes; the worker log similarly silences between
"completed invalidations" and "completed index visit"; the wait-on
output never shows what HTTP response (if any) it saw.
Temporary diagnostic logging, all tagged "[ci-readiness-diag]":
- Realm#readinessCheck: entry log, 15s heartbeat while #startedUp is
pending, fulfillment log with total wait time.
- Realm.#startup: per-phase info logs (copy / isNewIndex /
fullIndexOnStartup / awaiting fullIndex / awaiting isWorldReadable
/ complete) so we can see exactly where a realm is parked.
- RealmIndexUpdater.publishFullIndex: log job_id and queue.publish
duration the moment the from-scratch job is durably enqueued.
- IndexRunner.fromScratch: visit-loop progress every 25 files
(filesCompleted / totalFiles / lastUrl / elapsedMs).
- ci-host.yaml host-test step: DEBUG=start-server-and-test so wait-on
prints each probe's HTTP status / connection error.
Removal: `git grep ci-readiness-diag` locates every block. Each new log
line carries the "[ci-readiness-diag]" prefix in its message so search
on log artifacts also finds them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds temporary, tagged diagnostic logging and CI verbosity to better pinpoint where the realm-server startup/indexing pipeline stalls when _readiness-check timeouts occur in CI.
Changes:
- Add
[ci-readiness-diag]logging around_readiness-checkwaiting and realm startup phases. - Add enqueue-time diagnostics for from-scratch indexing jobs and progress heartbeats during the visit loop.
- Enable
start-server-and-testdebug output inci-hostso wait-on prints per-probe status/errors.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
.github/workflows/ci-host.yaml |
Enables DEBUG=start-server-and-test to surface wait-on probe diagnostics in CI output. |
packages/runtime-common/realm.ts |
Adds readiness-check waiting logs and startup phase markers. |
packages/runtime-common/realm-index-updater.ts |
Logs from-scratch job enqueue details (job id + timing). |
packages/runtime-common/index-runner.ts |
Adds periodic visit-loop progress logs during from-scratch indexing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Two findings on PR #4873: 1. readinessCheck created a setInterval heartbeat per probe. wait-on polls every 0.5–2s for up to 10 minutes, so a stalled startup would accumulate ~300 concurrent intervals all heartbeating the same Deferred, ~300 log lines every 15s. Moved the heartbeat to #startup so it runs exactly once per realm-lifetime regardless of probe count. readinessCheck keeps only entry + fulfillment logs. 2. publishFullIndex labeled the elapsed time "queue.publish took", but startedAt is captured before getRealmOwnerUsername() and the clearLastModified write inside enqueueReindexRealmJob. Renamed the label to "enqueue path took" so it accurately covers the full path from "is starting indexing" to "durably enqueued". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
maybe these new logs should be at a debug level with a common log category, and then we just enable that logger as part of the realm server startup for host tests, so this is purely opt in |
Per review: the diagnostic logs shouldn't be `info`-level with
`[ci-readiness-diag]` baked into each message — they should be `debug`
on a common logger category that's opt-in via `LOG_LEVELS`. That makes
them invisible in production by default, and the cleanup story is
"drop `readiness-diag=debug` from `HOST_TEST_LOG_LEVELS`" with the
code-removal as a follow-up.
- New named logger `readiness-diag` on each of `Realm`,
`RealmIndexUpdater`, and `IndexRunner` (as `#readinessDiag`).
- Every prior `this.#log.info('[ci-readiness-diag] ...')` call site
rewritten as `this.#readinessDiag.debug('...')`. The string prefix
drops out — the logger name does that job.
- `HOST_TEST_LOG_LEVELS` in `mise-tasks/test-services/host` now
includes `readiness-diag=debug`. Production realm-server keeps
`*=info` so the new logger stays silent there.
- Renamed in-code marker comments from `[ci-readiness-diag]` to
`[readiness-diag]` for consistency with the logger name.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[Claude Code 🤖] Good call — moved them all to a new Knock-on cleanup: dropped the Pushed as 43f127e. Cleanup path is now two-stage:
|
Host Test Results 1 files 1 suites 1h 43m 56s ⏱️ Results for commit 43f127e. Realm Server Test Results 1 files ± 0 1 suites +1 8m 10s ⏱️ + 8m 10s Results for commit 43f127e. ± Comparison against earlier commit 9ec47d4. |
Why
The host-test job recurrently fails with:
Seen on this branch's run #26061756182 and on a
mainrun ~50 min earlier (#26058294854), same 4 realm URLs each time. wait-on waits 10 min, then aborts.Today's signal floor:
realm-server.logjumps fromis starting indexingto nothing for 10 minutes.worker-manager.logsilences betweencompleted invalidations in Xmsandcompleted index visit in Yms(no per-file heartbeat in the visit loop).We can't tell whether the hang is in
queue.publish, in worker pickup, in the visit loop, inisWorldReadable, or in the realm-server's HTTP stack itself.What this PR adds
Temporary diagnostic logging, every line tagged
[ci-readiness-diag]:Realm#readinessCheck#startedUpis pending, fulfillment with total wait timeRealm.#startupcopy/isNewIndex/fullIndexOnStartup/awaiting fullIndex/awaiting isWorldReadable/completeRealmIndexUpdater.publishFullIndexjob_id+queue.publishduration the moment the from-scratch job is durably enqueuedIndexRunner.fromScratchvisit loopfilesCompleted/totalFiles,lastUrl,elapsedMsci-host.yamlhost-test stepDEBUG=start-server-and-testenv so wait-on prints each probe's HTTP status / connection errorExisting log levels already pass info through for the affected loggers (
*=info,realm-index-updater=debug,index-runner=debug,index-perf=debug), so no level changes were needed. Lines land in the existingrealm-server-log,test-realms-log, andworker-manager-logartifacts, plus the failing step's stdout (for the wait-on trace).Removal
git grep ci-readiness-diaglocates every block — comment markers + the log-string prefix make it findable both in source and in the log artifacts.Files
.github/workflows/ci-host.yamlpackages/runtime-common/realm.tspackages/runtime-common/realm-index-updater.tspackages/runtime-common/index-runner.tsTest plan
🤖 Generated with Claude Code