fix(ci): remove NODE_OPTIONS causing OOM in Algolia workflow#17711
Conversation
The user docs build has been OOM-failing since April 29 — the `next build` for 2200+ MDX pages exceeds the 7 GB RAM on `ubuntu-latest`. Upgrade to `ubuntu-latest-4core` (16 GB), bump the Node heap to 12 GB, add a concurrency group to cancel superseded runs, and drop the unnecessary `generate-og-images` step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
| NODE_OPTIONS: '--max-old-space-size=12288' | ||
| steps: | ||
| - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6 | ||
|
|
There was a problem hiding this comment.
Bug: The shared concurrency key can cause a new push to cancel an in-progress documentation update, leading to a stale search index for the cancelled job.
Severity: MEDIUM
Suggested Fix
Modify the concurrency group key to be unique for each documentation type. For example, incorporate the output of the path filter step into the key, such as ${{ github.workflow }}-${{ github.ref }}-${{ steps.filter.outputs.user_docs }}-${{ steps.filter.outputs.dev_docs }}, to prevent unrelated updates from cancelling each other.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: .github/workflows/algolia-index.yml#L14-L17
Potential issue: The GitHub Actions workflow uses a `concurrency` group key that is
identical for all pushes to the `master` branch. Due to `cancel-in-progress: true`, a
new push can cancel an in-progress workflow run. The workflow has separate conditional
steps for updating user and developer documentation based on file paths. If a push
updating user docs cancels a run that was updating developer docs, the new run will not
re-trigger the developer docs update. This is because the path filter only evaluates
files in the new push, causing the developer documentation search index to become
silently stale. The long build times for each documentation set make this race condition
likely.
Did we get this right? 👍 / 👎 to inform future reviews.
The `NODE_OPTIONS: '--max-old-space-size=6144'` setting was letting Node greedily allocate 6 GB heap on a 7 GB runner, leaving only 1 GB for the OS and triggering the OOM killer. The `lint-404s` workflow proves the same `next build` succeeds on `ubuntu-latest` without this override. Also remove unnecessary `generate-og-images` step (not needed for Algolia indexing). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f365b48. Configure here.
| index: | ||
| name: Update Algolia index | ||
| runs-on: ubuntu-latest | ||
| env: |
There was a problem hiding this comment.
NODE_OPTIONS removed instead of bumped to 12 GB
High Severity
The NODE_OPTIONS: '--max-old-space-size=6144' env var was removed entirely instead of being bumped to 12 GB (--max-old-space-size=12288) as the PR description states. Without this setting, Node.js defaults to a ~2–4 GB heap limit — significantly less than the previous 6 GB. Since the build was already OOM-killing at 6 GB, removing the heap override will make the OOM problem worse, not better. The runner also remains ubuntu-latest rather than ubuntu-latest-4core as described.
Reviewed by Cursor Bugbot for commit f365b48. Configure here.
Revert to using `pnpm build` and `pnpm build:developer-docs` instead of spelling out each sub-command. This matches what lint-404s uses and is proven to work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keep the explicit command chain that only runs what's needed for Algolia indexing (no generate-og-images, no generate-md-exports). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@sfanahata yeah i rmeoved it instead of bumping because |
## DESCRIBE YOUR PR Follow-up to #17711. The previous PR fixed `NODE_OPTIONS` causing `next build` to OOM, but the algolia script itself was also a problem — it fired `Promise.all` on all ~9,820 pages simultaneously with no concurrency limit, no caching, and no monitoring. **Changes:** 1. **Concurrency limit** — `p-limit(50)` caps concurrent page processing (same pattern as `generate-md-exports.mjs`). Previously all 9,820 pages were processed simultaneously via unbounded `Promise.all`. 2. **Content-hash caching** — MD5 hash of each HTML file is used as a cache key. Algolia records for unchanged pages are read from `.next/cache/algolia-records/` (already covered by the GitHub Actions cache step). First run processes all pages; subsequent runs only reprocess pages whose HTML changed. 3. **Sentry metrics** — tracks page count, record count, generation duration, and cache hit/miss rate via `ALGOLIA_SENTRY_DSN` env var. No-op if the secret isn't set. **Context:** - 212 failed runs burned **157 hours** of CI time (~$75) since late March - The `generate-md-exports.mjs` script processes the same HTML files but uses worker threads, `p-limit`, and caching — the algolia script had none of these - Developer docs (304 pages) always succeeded; user docs (9,820 pages → 234K records) consistently OOMed **Note:** The `ALGOLIA_SENTRY_DSN` secret needs to be added to the repo for metrics to flow. Without it, Sentry init is skipped and the metrics calls are no-ops. ## IS YOUR CHANGE URGENT? - [ ] Urgent deadline (GA date, etc.): - [x] Other deadline: Algolia search index hasn't updated since April 28 - [ ] None: Not urgent, can wait up to 1 week+ ## SLA - Teamwork makes the dream work, so please add a reviewer to your PRs. - Please give the docs team up to 1 week to review your PR unless you've added an urgent due date to it. Thanks in advance for your help! ## PRE-MERGE CHECKLIST - [x] Checked Vercel preview for correctness, including links - [ ] PR was reviewed and approved by any necessary SMEs (subject matter experts) - [ ] PR was reviewed and approved by a member of the [Sentry docs team](https://github.com/orgs/getsentry/teams/docs) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>
## DESCRIBE YOUR PR The Algolia user docs build has been increasingly flaky since late March and is now failing ~100% of the time. All failures show the same pattern: `exit code 143` (SIGTERM) with "runner received shutdown signal" after ~56 minutes — the runner is being OOM-killed. ### Root cause The `NODE_OPTIONS: '--max-old-space-size=6144'` setting (added in #17283 on April 9 as an attempted fix) tells Node it can allocate up to 6 GB heap on a 7 GB runner, leaving only ~1 GB for the OS/kernel — triggering the OOM killer. ### Evidence The `lint-404s` workflow does the **same** `pnpm build` → `next build` on the same `ubuntu-latest` runner **without** `NODE_OPTIONS` and has a **0% failure rate**. The algolia workflow — identical build but with `NODE_OPTIONS=6144` — has been failing consistently. Before April 9 (no `NODE_OPTIONS`, no `.next/cache`), the workflow was already flaky at ~30-40%. The April 9 fix added both cache (which helps) and `NODE_OPTIONS` (which hurts), netting a worse outcome. This PR keeps the cache but removes `NODE_OPTIONS`. ### Timeline - **Late March**: Failures start appearing (~5-30% rate), no NODE_OPTIONS, no build cache - **April 9 (#17283)**: Added `NODE_OPTIONS=6144` + `.next/cache` — failure rate increased to ~50-65% - **Late April–now**: User docs build has not succeeded once since April 28 ### What this PR does - Removes `NODE_OPTIONS: '--max-old-space-size=6144'` (the main culprit) - Removes unnecessary `generate-og-images` from the user docs build step ### What this does NOT do - We cannot 100% verify locally because we can't replicate the 7 GB memory constraint of the GitHub Actions runner. The strongest evidence is the `lint-404s` comparison (same build, same runner, no `NODE_OPTIONS`, 0% failure rate). ## IS YOUR CHANGE URGENT? Help us prioritize incoming PRs by letting us know when the change needs to go live. - [ ] Urgent deadline (GA date, etc.): - [x] Other deadline: Algolia search index for user docs hasn't been updated since April 28 - [ ] None: Not urgent, can wait up to 1 week+ ## SLA - Teamwork makes the dream work, so please add a reviewer to your PRs. - Please give the docs team up to 1 week to review your PR unless you've added an urgent due date to it. Thanks in advance for your help! ## PRE-MERGE CHECKLIST *Make sure you've checked the following before merging your changes:* - [x] Checked Vercel preview for correctness, including links - [ ] PR was reviewed and approved by any necessary SMEs (subject matter experts) - [ ] PR was reviewed and approved by a member of the [Sentry docs team](https://github.com/orgs/getsentry/teams/docs) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## DESCRIBE YOUR PR Follow-up to #17711. The previous PR fixed `NODE_OPTIONS` causing `next build` to OOM, but the algolia script itself was also a problem — it fired `Promise.all` on all ~9,820 pages simultaneously with no concurrency limit, no caching, and no monitoring. **Changes:** 1. **Concurrency limit** — `p-limit(50)` caps concurrent page processing (same pattern as `generate-md-exports.mjs`). Previously all 9,820 pages were processed simultaneously via unbounded `Promise.all`. 2. **Content-hash caching** — MD5 hash of each HTML file is used as a cache key. Algolia records for unchanged pages are read from `.next/cache/algolia-records/` (already covered by the GitHub Actions cache step). First run processes all pages; subsequent runs only reprocess pages whose HTML changed. 3. **Sentry metrics** — tracks page count, record count, generation duration, and cache hit/miss rate via `ALGOLIA_SENTRY_DSN` env var. No-op if the secret isn't set. **Context:** - 212 failed runs burned **157 hours** of CI time (~$75) since late March - The `generate-md-exports.mjs` script processes the same HTML files but uses worker threads, `p-limit`, and caching — the algolia script had none of these - Developer docs (304 pages) always succeeded; user docs (9,820 pages → 234K records) consistently OOMed **Note:** The `ALGOLIA_SENTRY_DSN` secret needs to be added to the repo for metrics to flow. Without it, Sentry init is skipped and the metrics calls are no-ops. ## IS YOUR CHANGE URGENT? - [ ] Urgent deadline (GA date, etc.): - [x] Other deadline: Algolia search index hasn't updated since April 28 - [ ] None: Not urgent, can wait up to 1 week+ ## SLA - Teamwork makes the dream work, so please add a reviewer to your PRs. - Please give the docs team up to 1 week to review your PR unless you've added an urgent due date to it. Thanks in advance for your help! ## PRE-MERGE CHECKLIST - [x] Checked Vercel preview for correctness, including links - [ ] PR was reviewed and approved by any necessary SMEs (subject matter experts) - [ ] PR was reviewed and approved by a member of the [Sentry docs team](https://github.com/orgs/getsentry/teams/docs) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>


DESCRIBE YOUR PR
The Algolia user docs build has been increasingly flaky since late March and is now failing ~100% of the time. All failures show the same pattern:
exit code 143(SIGTERM) with "runner received shutdown signal" after ~56 minutes — the runner is being OOM-killed.Root cause
The
NODE_OPTIONS: '--max-old-space-size=6144'setting (added in #17283 on April 9 as an attempted fix) tells Node it can allocate up to 6 GB heap on a 7 GB runner, leaving only ~1 GB for the OS/kernel — triggering the OOM killer.Evidence
The
lint-404sworkflow does the samepnpm build→next buildon the sameubuntu-latestrunner withoutNODE_OPTIONSand has a 0% failure rate. The algolia workflow — identical build but withNODE_OPTIONS=6144— has been failing consistently.Before April 9 (no
NODE_OPTIONS, no.next/cache), the workflow was already flaky at ~30-40%. The April 9 fix added both cache (which helps) andNODE_OPTIONS(which hurts), netting a worse outcome. This PR keeps the cache but removesNODE_OPTIONS.Timeline
NODE_OPTIONS=6144+.next/cache— failure rate increased to ~50-65%What this PR does
NODE_OPTIONS: '--max-old-space-size=6144'(the main culprit)generate-og-imagesfrom the user docs build stepWhat this does NOT do
lint-404scomparison (same build, same runner, noNODE_OPTIONS, 0% failure rate).IS YOUR CHANGE URGENT?
Help us prioritize incoming PRs by letting us know when the change needs to go live.
SLA
Thanks in advance for your help!
PRE-MERGE CHECKLIST
Make sure you've checked the following before merging your changes: