Skip to content

fix(ci): remove NODE_OPTIONS causing OOM in Algolia workflow#17711

Merged
sergical merged 5 commits into
masterfrom
fix/algolia-workflow-oom
May 11, 2026
Merged

fix(ci): remove NODE_OPTIONS causing OOM in Algolia workflow#17711
sergical merged 5 commits into
masterfrom
fix/algolia-workflow-oom

Conversation

@sergical
Copy link
Copy Markdown
Member

@sergical sergical commented May 11, 2026

DESCRIBE YOUR PR

The Algolia user docs build has been increasingly flaky since late March and is now failing ~100% of the time. All failures show the same pattern: exit code 143 (SIGTERM) with "runner received shutdown signal" after ~56 minutes — the runner is being OOM-killed.

Root cause

The NODE_OPTIONS: '--max-old-space-size=6144' setting (added in #17283 on April 9 as an attempted fix) tells Node it can allocate up to 6 GB heap on a 7 GB runner, leaving only ~1 GB for the OS/kernel — triggering the OOM killer.

Evidence

The lint-404s workflow does the same pnpm buildnext build on the same ubuntu-latest runner without NODE_OPTIONS and has a 0% failure rate. The algolia workflow — identical build but with NODE_OPTIONS=6144 — has been failing consistently.

Before April 9 (no NODE_OPTIONS, no .next/cache), the workflow was already flaky at ~30-40%. The April 9 fix added both cache (which helps) and NODE_OPTIONS (which hurts), netting a worse outcome. This PR keeps the cache but removes NODE_OPTIONS.

Timeline

  • Late March: Failures start appearing (~5-30% rate), no NODE_OPTIONS, no build cache
  • April 9 (fix(ci): resolve OOM in Algolia index workflow #17283): Added NODE_OPTIONS=6144 + .next/cache — failure rate increased to ~50-65%
  • Late April–now: User docs build has not succeeded once since April 28

What this PR does

  • Removes NODE_OPTIONS: '--max-old-space-size=6144' (the main culprit)
  • Removes unnecessary generate-og-images from the user docs build step

What this does NOT do

  • We cannot 100% verify locally because we can't replicate the 7 GB memory constraint of the GitHub Actions runner. The strongest evidence is the lint-404s comparison (same build, same runner, no NODE_OPTIONS, 0% failure rate).

IS YOUR CHANGE URGENT?

Help us prioritize incoming PRs by letting us know when the change needs to go live.

  • Urgent deadline (GA date, etc.):
  • Other deadline: Algolia search index for user docs hasn't been updated since April 28
  • None: Not urgent, can wait up to 1 week+

SLA

  • Teamwork makes the dream work, so please add a reviewer to your PRs.
  • Please give the docs team up to 1 week to review your PR unless you've added an urgent due date to it.
    Thanks in advance for your help!

PRE-MERGE CHECKLIST

Make sure you've checked the following before merging your changes:

  • Checked Vercel preview for correctness, including links
  • PR was reviewed and approved by any necessary SMEs (subject matter experts)
  • PR was reviewed and approved by a member of the Sentry docs team

The user docs build has been OOM-failing since April 29 — the `next build`
for 2200+ MDX pages exceeds the 7 GB RAM on `ubuntu-latest`. Upgrade to
`ubuntu-latest-4core` (16 GB), bump the Node heap to 12 GB, add a
concurrency group to cancel superseded runs, and drop the unnecessary
`generate-og-images` step.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 11, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
develop-docs Ready Ready Preview, Comment May 11, 2026 5:59pm
sentry-docs Ready Ready Preview, Comment May 11, 2026 5:59pm

Request Review

Comment thread .github/workflows/algolia-index.yml Outdated
Comment thread .github/workflows/algolia-index.yml Outdated
Comment on lines 14 to 17
NODE_OPTIONS: '--max-old-space-size=12288'
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The shared concurrency key can cause a new push to cancel an in-progress documentation update, leading to a stale search index for the cancelled job.
Severity: MEDIUM

Suggested Fix

Modify the concurrency group key to be unique for each documentation type. For example, incorporate the output of the path filter step into the key, such as ${{ github.workflow }}-${{ github.ref }}-${{ steps.filter.outputs.user_docs }}-${{ steps.filter.outputs.dev_docs }}, to prevent unrelated updates from cancelling each other.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: .github/workflows/algolia-index.yml#L14-L17

Potential issue: The GitHub Actions workflow uses a `concurrency` group key that is
identical for all pushes to the `master` branch. Due to `cancel-in-progress: true`, a
new push can cancel an in-progress workflow run. The workflow has separate conditional
steps for updating user and developer documentation based on file paths. If a push
updating user docs cancels a run that was updating developer docs, the new run will not
re-trigger the developer docs update. This is because the path filter only evaluates
files in the new push, causing the developer documentation search index to become
silently stale. The long build times for each documentation set make this race condition
likely.

Did we get this right? 👍 / 👎 to inform future reviews.

The `NODE_OPTIONS: '--max-old-space-size=6144'` setting was letting Node
greedily allocate 6 GB heap on a 7 GB runner, leaving only 1 GB for the
OS and triggering the OOM killer. The `lint-404s` workflow proves the
same `next build` succeeds on `ubuntu-latest` without this override.

Also remove unnecessary `generate-og-images` step (not needed for
Algolia indexing).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f365b48. Configure here.

index:
name: Update Algolia index
runs-on: ubuntu-latest
env:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NODE_OPTIONS removed instead of bumped to 12 GB

High Severity

The NODE_OPTIONS: '--max-old-space-size=6144' env var was removed entirely instead of being bumped to 12 GB (--max-old-space-size=12288) as the PR description states. Without this setting, Node.js defaults to a ~2–4 GB heap limit — significantly less than the previous 6 GB. Since the build was already OOM-killing at 6 GB, removing the heap override will make the OOM problem worse, not better. The runner also remains ubuntu-latest rather than ubuntu-latest-4core as described.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f365b48. Configure here.

@sergical sergical changed the title fix(ci): use larger runner for Algolia indexing workflow fix(ci): remove NODE_OPTIONS causing OOM in Algolia workflow May 11, 2026
sergical and others added 2 commits May 11, 2026 13:47
Revert to using `pnpm build` and `pnpm build:developer-docs` instead of
spelling out each sub-command. This matches what lint-404s uses and is
proven to work.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keep the explicit command chain that only runs what's needed for Algolia
indexing (no generate-og-images, no generate-md-exports).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@sfanahata sfanahata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sergical I know you said this is one that we need to push to see, so looks fine except for the cursor bot comment that looks like it might be an issue. Mind checking that before merging?

@sergical
Copy link
Copy Markdown
Member Author

@sfanahata yeah i rmeoved it instead of bumping because lint-404 workflow isn't failing the next build step with any configuration so i just want to validate that the default config actually solves the problem vs bumping the config

@sergical sergical merged commit 7a4ecdf into master May 11, 2026
20 checks passed
@sergical sergical deleted the fix/algolia-workflow-oom branch May 11, 2026 18:19
sergical added a commit that referenced this pull request May 12, 2026
## DESCRIBE YOUR PR

Follow-up to #17711. The previous PR fixed `NODE_OPTIONS` causing `next
build` to OOM, but the algolia script itself was also a problem — it
fired `Promise.all` on all ~9,820 pages simultaneously with no
concurrency limit, no caching, and no monitoring.

**Changes:**

1. **Concurrency limit** — `p-limit(50)` caps concurrent page processing
(same pattern as `generate-md-exports.mjs`). Previously all 9,820 pages
were processed simultaneously via unbounded `Promise.all`.

2. **Content-hash caching** — MD5 hash of each HTML file is used as a
cache key. Algolia records for unchanged pages are read from
`.next/cache/algolia-records/` (already covered by the GitHub Actions
cache step). First run processes all pages; subsequent runs only
reprocess pages whose HTML changed.

3. **Sentry metrics** — tracks page count, record count, generation
duration, and cache hit/miss rate via `ALGOLIA_SENTRY_DSN` env var.
No-op if the secret isn't set.

**Context:**
- 212 failed runs burned **157 hours** of CI time (~$75) since late
March
- The `generate-md-exports.mjs` script processes the same HTML files but
uses worker threads, `p-limit`, and caching — the algolia script had
none of these
- Developer docs (304 pages) always succeeded; user docs (9,820 pages →
234K records) consistently OOMed

**Note:** The `ALGOLIA_SENTRY_DSN` secret needs to be added to the repo
for metrics to flow. Without it, Sentry init is skipped and the metrics
calls are no-ops.

## IS YOUR CHANGE URGENT?

- [ ] Urgent deadline (GA date, etc.):
- [x] Other deadline: Algolia search index hasn't updated since April 28
- [ ] None: Not urgent, can wait up to 1 week+

## SLA

- Teamwork makes the dream work, so please add a reviewer to your PRs.
- Please give the docs team up to 1 week to review your PR unless you've
added an urgent due date to it.
Thanks in advance for your help!

## PRE-MERGE CHECKLIST

- [x] Checked Vercel preview for correctness, including links
- [ ] PR was reviewed and approved by any necessary SMEs (subject matter
experts)
- [ ] PR was reviewed and approved by a member of the [Sentry docs
team](https://github.com/orgs/getsentry/teams/docs)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>
sentrivana pushed a commit that referenced this pull request May 18, 2026
## DESCRIBE YOUR PR

The Algolia user docs build has been increasingly flaky since late March
and is now failing ~100% of the time. All failures show the same
pattern: `exit code 143` (SIGTERM) with "runner received shutdown
signal" after ~56 minutes — the runner is being OOM-killed.

### Root cause

The `NODE_OPTIONS: '--max-old-space-size=6144'` setting (added in #17283
on April 9 as an attempted fix) tells Node it can allocate up to 6 GB
heap on a 7 GB runner, leaving only ~1 GB for the OS/kernel — triggering
the OOM killer.

### Evidence

The `lint-404s` workflow does the **same** `pnpm build` → `next build`
on the same `ubuntu-latest` runner **without** `NODE_OPTIONS` and has a
**0% failure rate**. The algolia workflow — identical build but with
`NODE_OPTIONS=6144` — has been failing consistently.

Before April 9 (no `NODE_OPTIONS`, no `.next/cache`), the workflow was
already flaky at ~30-40%. The April 9 fix added both cache (which helps)
and `NODE_OPTIONS` (which hurts), netting a worse outcome. This PR keeps
the cache but removes `NODE_OPTIONS`.

### Timeline
- **Late March**: Failures start appearing (~5-30% rate), no
NODE_OPTIONS, no build cache
- **April 9 (#17283)**: Added `NODE_OPTIONS=6144` + `.next/cache` —
failure rate increased to ~50-65%
- **Late April–now**: User docs build has not succeeded once since April
28

### What this PR does
- Removes `NODE_OPTIONS: '--max-old-space-size=6144'` (the main culprit)
- Removes unnecessary `generate-og-images` from the user docs build step

### What this does NOT do
- We cannot 100% verify locally because we can't replicate the 7 GB
memory constraint of the GitHub Actions runner. The strongest evidence
is the `lint-404s` comparison (same build, same runner, no
`NODE_OPTIONS`, 0% failure rate).

## IS YOUR CHANGE URGENT?

Help us prioritize incoming PRs by letting us know when the change needs
to go live.
- [ ] Urgent deadline (GA date, etc.):
- [x] Other deadline: Algolia search index for user docs hasn't been
updated since April 28
- [ ] None: Not urgent, can wait up to 1 week+

## SLA

- Teamwork makes the dream work, so please add a reviewer to your PRs.
- Please give the docs team up to 1 week to review your PR unless you've
added an urgent due date to it.
Thanks in advance for your help!

## PRE-MERGE CHECKLIST

*Make sure you've checked the following before merging your changes:*

- [x] Checked Vercel preview for correctness, including links
- [ ] PR was reviewed and approved by any necessary SMEs (subject matter
experts)
- [ ] PR was reviewed and approved by a member of the [Sentry docs
team](https://github.com/orgs/getsentry/teams/docs)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sentrivana pushed a commit that referenced this pull request May 18, 2026
## DESCRIBE YOUR PR

Follow-up to #17711. The previous PR fixed `NODE_OPTIONS` causing `next
build` to OOM, but the algolia script itself was also a problem — it
fired `Promise.all` on all ~9,820 pages simultaneously with no
concurrency limit, no caching, and no monitoring.

**Changes:**

1. **Concurrency limit** — `p-limit(50)` caps concurrent page processing
(same pattern as `generate-md-exports.mjs`). Previously all 9,820 pages
were processed simultaneously via unbounded `Promise.all`.

2. **Content-hash caching** — MD5 hash of each HTML file is used as a
cache key. Algolia records for unchanged pages are read from
`.next/cache/algolia-records/` (already covered by the GitHub Actions
cache step). First run processes all pages; subsequent runs only
reprocess pages whose HTML changed.

3. **Sentry metrics** — tracks page count, record count, generation
duration, and cache hit/miss rate via `ALGOLIA_SENTRY_DSN` env var.
No-op if the secret isn't set.

**Context:**
- 212 failed runs burned **157 hours** of CI time (~$75) since late
March
- The `generate-md-exports.mjs` script processes the same HTML files but
uses worker threads, `p-limit`, and caching — the algolia script had
none of these
- Developer docs (304 pages) always succeeded; user docs (9,820 pages →
234K records) consistently OOMed

**Note:** The `ALGOLIA_SENTRY_DSN` secret needs to be added to the repo
for metrics to flow. Without it, Sentry init is skipped and the metrics
calls are no-ops.

## IS YOUR CHANGE URGENT?

- [ ] Urgent deadline (GA date, etc.):
- [x] Other deadline: Algolia search index hasn't updated since April 28
- [ ] None: Not urgent, can wait up to 1 week+

## SLA

- Teamwork makes the dream work, so please add a reviewer to your PRs.
- Please give the docs team up to 1 week to review your PR unless you've
added an urgent due date to it.
Thanks in advance for your help!

## PRE-MERGE CHECKLIST

- [x] Checked Vercel preview for correctness, including links
- [ ] PR was reviewed and approved by any necessary SMEs (subject matter
experts)
- [ ] PR was reviewed and approved by a member of the [Sentry docs
team](https://github.com/orgs/getsentry/teams/docs)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>
@github-actions github-actions Bot locked and limited conversation to collaborators May 27, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants