CI flake diagnostics: readiness-check + startup phase logging by habdelra · Pull Request #4873 · cardstack/boxel

habdelra · 2026-05-18T22:30:15Z

Why

The host-test job recurrently fails with:

Error: Timed out waiting for: https-get://localhost:4201/base/_readiness-check, .../node-test/..., .../skills/..., .../test/...

Seen on this branch's run #26061756182 and on a main run ~50 min earlier (#26058294854), same 4 realm URLs each time. wait-on waits 10 min, then aborts.

Today's signal floor:

realm-server.log jumps from is starting indexing to nothing for 10 minutes.
worker-manager.log silences between completed invalidations in Xms and completed index visit in Yms (no per-file heartbeat in the visit loop).
The wait-on failure message tells us nothing about what the realm-server responded — timeout? 500? not listening?

We can't tell whether the hang is in queue.publish, in worker pickup, in the visit loop, in isWorldReadable, or in the realm-server's HTTP stack itself.

What this PR adds

Temporary diagnostic logging, every line tagged [ci-readiness-diag]:

Location	Logs
`Realm#readinessCheck`	entry, 15s heartbeat while `#startedUp` is pending, fulfillment with total wait time
`Realm.#startup`	per-phase info logs: `copy` / `isNewIndex` / `fullIndexOnStartup` / `awaiting fullIndex` / `awaiting isWorldReadable` / `complete`
`RealmIndexUpdater.publishFullIndex`	`job_id` + `queue.publish` duration the moment the from-scratch job is durably enqueued
`IndexRunner.fromScratch` visit loop	progress line every 25 files: `filesCompleted/totalFiles`, `lastUrl`, `elapsedMs`
`ci-host.yaml` host-test step	`DEBUG=start-server-and-test` env so wait-on prints each probe's HTTP status / connection error

Existing log levels already pass info through for the affected loggers (*=info, realm-index-updater=debug, index-runner=debug, index-perf=debug), so no level changes were needed. Lines land in the existing realm-server-log, test-realms-log, and worker-manager-log artifacts, plus the failing step's stdout (for the wait-on trace).

Removal

git grep ci-readiness-diag locates every block — comment markers + the log-string prefix make it findable both in source and in the log artifacts.

Files

.github/workflows/ci-host.yaml
packages/runtime-common/realm.ts
packages/runtime-common/realm-index-updater.ts
packages/runtime-common/index-runner.ts

Test plan

CI passes (the new logs are info-level, no behavioral change)
On the next reproduction, the realm-server artifact should now show which realm is parked at which startup phase, and the wait-on output should show the actual HTTP response code

🤖 Generated with Claude Code

The "Error: Timed out waiting for: .../_readiness-check" failure in the host-test job recurs without enough log signal to root-cause. Today the realm-server log artifact jumps from "is starting indexing" to nothing for 10 minutes; the worker log similarly silences between "completed invalidations" and "completed index visit"; the wait-on output never shows what HTTP response (if any) it saw. Temporary diagnostic logging, all tagged "[ci-readiness-diag]": - Realm#readinessCheck: entry log, 15s heartbeat while #startedUp is pending, fulfillment log with total wait time. - Realm.#startup: per-phase info logs (copy / isNewIndex / fullIndexOnStartup / awaiting fullIndex / awaiting isWorldReadable / complete) so we can see exactly where a realm is parked. - RealmIndexUpdater.publishFullIndex: log job_id and queue.publish duration the moment the from-scratch job is durably enqueued. - IndexRunner.fromScratch: visit-loop progress every 25 files (filesCompleted / totalFiles / lastUrl / elapsedMs). - ci-host.yaml host-test step: DEBUG=start-server-and-test so wait-on prints each probe's HTTP status / connection error. Removal: `git grep ci-readiness-diag` locates every block. Each new log line carries the "[ci-readiness-diag]" prefix in its message so search on log artifacts also finds them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds temporary, tagged diagnostic logging and CI verbosity to better pinpoint where the realm-server startup/indexing pipeline stalls when _readiness-check timeouts occur in CI.

Changes:

Add [ci-readiness-diag] logging around _readiness-check waiting and realm startup phases.
Add enqueue-time diagnostics for from-scratch indexing jobs and progress heartbeats during the visit loop.
Enable start-server-and-test debug output in ci-host so wait-on prints per-probe status/errors.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`.github/workflows/ci-host.yaml`	Enables `DEBUG=start-server-and-test` to surface wait-on probe diagnostics in CI output.
`packages/runtime-common/realm.ts`	Adds readiness-check waiting logs and startup phase markers.
`packages/runtime-common/realm-index-updater.ts`	Logs from-scratch job enqueue details (job id + timing).
`packages/runtime-common/index-runner.ts`	Adds periodic visit-loop progress logs during from-scratch indexing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Two findings on PR #4873: 1. readinessCheck created a setInterval heartbeat per probe. wait-on polls every 0.5–2s for up to 10 minutes, so a stalled startup would accumulate ~300 concurrent intervals all heartbeating the same Deferred, ~300 log lines every 15s. Moved the heartbeat to #startup so it runs exactly once per realm-lifetime regardless of probe count. readinessCheck keeps only entry + fulfillment logs. 2. publishFullIndex labeled the elapsed time "queue.publish took", but startedAt is captured before getRealmOwnerUsername() and the clearLastModified write inside enqueueReindexRealmJob. Renamed the label to "enqueue path took" so it accurately covers the full path from "is starting indexing" to "durably enqueued". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

habdelra · 2026-05-18T22:47:32Z

maybe these new logs should be at a debug level with a common log category, and then we just enable that logger as part of the realm server startup for host tests, so this is purely opt in

Per review: the diagnostic logs shouldn't be `info`-level with `[ci-readiness-diag]` baked into each message — they should be `debug` on a common logger category that's opt-in via `LOG_LEVELS`. That makes them invisible in production by default, and the cleanup story is "drop `readiness-diag=debug` from `HOST_TEST_LOG_LEVELS`" with the code-removal as a follow-up. - New named logger `readiness-diag` on each of `Realm`, `RealmIndexUpdater`, and `IndexRunner` (as `#readinessDiag`). - Every prior `this.#log.info('[ci-readiness-diag] ...')` call site rewritten as `this.#readinessDiag.debug('...')`. The string prefix drops out — the logger name does that job. - `HOST_TEST_LOG_LEVELS` in `mise-tasks/test-services/host` now includes `readiness-diag=debug`. Production realm-server keeps `*=info` so the new logger stays silent there. - Renamed in-code marker comments from `[ci-readiness-diag]` to `[readiness-diag]` for consistency with the logger name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

habdelra · 2026-05-18T22:51:51Z

[Claude Code 🤖]

Good call — moved them all to a new readiness-diag logger at debug, enabled via readiness-diag=debug in HOST_TEST_LOG_LEVELS. Production realm-server keeps its *=info default so this is purely opt-in in CI.

Knock-on cleanup: dropped the [ci-readiness-diag] string prefix from every log message (the logger name carries that signal now), and renamed the in-code marker comments to [readiness-diag] for consistency.

Pushed as 43f127e.

Cleanup path is now two-stage:

Drop readiness-diag=debug from HOST_TEST_LOG_LEVELS → logs silent everywhere.
git grep readiness-diag → delete the call sites + the #readinessDiag fields when convenient.

github-actions · 2026-05-18T23:34:34Z

Host Test Results

1 files 1 suites 1h 43m 56s ⏱️
2 661 tests 2 646 ✅ 15 💤 0 ❌
2 680 runs 2 665 ✅ 15 💤 0 ❌

Results for commit 43f127e.

Realm Server Test Results

1 files ± 0 1 suites +1 8m 10s ⏱️ + 8m 10s
1 408 tests +1 408 1 408 ✅ +1 408 0 💤 ±0 0 ❌ ±0
1 495 runs +1 495 1 495 ✅ +1 495 0 💤 ±0 0 ❌ ±0

Results for commit 43f127e. ± Comparison against earlier commit 9ec47d4.

habdelra marked this pull request as ready for review May 18, 2026 22:30

habdelra requested a review from Copilot May 18, 2026 22:30

Copilot started reviewing on behalf of habdelra May 18, 2026 22:31 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread packages/runtime-common/realm.ts Outdated

Comment thread packages/runtime-common/realm-index-updater.ts Outdated

habdelra requested a review from a team May 18, 2026 23:14

jurgenwerk approved these changes May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI flake diagnostics: readiness-check + startup phase logging#4873

CI flake diagnostics: readiness-check + startup phase logging#4873
habdelra wants to merge 3 commits into
mainfrom
worktree-ci-readiness-diagnostics

habdelra commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

habdelra commented May 18, 2026

Uh oh!

habdelra commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

habdelra commented May 18, 2026

Why

What this PR adds

Removal

Files

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

habdelra commented May 18, 2026

Uh oh!

habdelra commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Host Test Results

Realm Server Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 18, 2026 •

edited

Loading