java.lang.OutOfMemoryError when upgrading from Bazel 7.0.2 to 7.1.1 #21803

ivan-golub · 2024-03-25T21:14:38Z

Description of the bug:

Upgrading from bazel 7.0.2 to 7.1.1 resulted in consistent OOM exception being thrown during bazel query

04:45:00 [Bazel] Loading: 15 packages loaded
04:45:02 [Bazel] Loading: 268 packages loaded
04:45:02 [Bazel]     currently loading: bzl ... (1719 packages)
04:45:04 [Bazel] Loading: 1990 packages loaded
04:45:04 [Bazel]     currently loading: @@bazel_tools//tools/jdk ... (12 packages)
04:45:05 [Bazel] Loading: 2067 packages loaded
04:45:05 [Bazel]     currently loading: @@bazel_tools//tools/jdk ... (2 packages)
04:45:06 [Bazel] Loading: 2078 packages loaded
04:45:06 [Bazel]     currently loading: @@remote_java_tools//java_tools/zlib
04:45:07 [Bazel] Loading: 2079 packages loaded
04:45:07 [Bazel]     currently loading: @@remote_java_tools//java_tools/zlib
04:45:08 [Bazel] Loading: 2171 packages loaded
04:45:09 [Bazel] Loading: 2364 packages loaded
04:45:10 [Bazel] Loading: 2543 packages loaded
04:45:11 [Bazel] Loading: 2650 packages loaded
04:45:13 [Bazel] Loading: 2889 packages loaded
04:45:14 [Bazel] Loading: 2889 packages loaded
04:45:15 [Bazel] Loading: 2931 packages loaded
04:45:16 [Bazel] Loading: 3036 packages loaded
04:45:17 [Bazel] Loading: 3457 packages loaded
04:45:18 [Bazel] Loading: 3862 packages loaded
04:45:19 [Bazel] Loading: 4655 packages loaded
04:45:20 [Bazel] Loading: 5464 packages loaded
04:45:21 [Bazel] Loading: 6297 packages loaded
04:45:22 [Bazel] Loading: 7193 packages loaded
04:45:23 [Bazel] Loading: 7996 packages loaded
04:45:23 [Bazel] FATAL: bazel ran out of memory and crashed. Printing stack trace:
04:45:23 [Bazel] java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
	at java.base/java.lang.Thread.start0(Native Method)
	at java.base/java.lang.Thread.start(Unknown Source)
	at java.base/java.lang.System$2.start(Unknown Source)
	at java.base/jdk.internal.vm.SharedThreadContainer.start(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.createWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.tryCompensate(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.compensatedBlock(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.managedBlock(Unknown Source)
	at java.base/java.util.concurrent.SynchronousQueue$TransferStack.transfer(Unknown Source)
	at java.base/java.util.concurrent.SynchronousQueue.take(Unknown Source)
	at com.google.devtools.build.lib.bazel.repository.starlark.StarlarkRepositoryFunction.fetch(StarlarkRepositoryFunction.java:170)
	at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.fetchRepository(RepositoryDelegatorFunction.java:418)
	at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.compute(RepositoryDelegatorFunction.java:205)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:461)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
	at java.base/java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

linux

What is the output of `bazel info release`?

No response

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

Suggestion from bazel public slack to disable experimental_worker_for_repo_fetching with --experimental_worker_for_repo_fetching=off resolved issues

Any other information, logs, or outputs that you want to share?

No response

The text was updated successfully, but these errors were encountered:

fmeum · 2024-03-25T21:31:32Z

@ivan-golub How many CPUs do you have on your machine? And how many repos are there that could potentially be fetched in parallel?

My suspicion: Loom doesn't use non-blocking file I/O yet and instead creates additional native threads when a virtual thread is blocked on file operations. If too many repos are blocked on them in parallel, this could run into the same thread limits as with a native thread pool. Hopefully we don't reach the OS limit and just need to tweak some native memory settings.

Wyverald · 2024-03-25T21:31:50Z

Thanks for the report. Some questions:

Do you have Bzlmod enabled? (check your .bazelrc file for --noenable_bzlmod)
Does the OOM happen with any query? Or what are the queries that cause this?
Do you have a sense of how many external repos you have defined? (you could run something like bazel query //external:all-targets | wc -l)

ivan-golub · 2024-03-26T01:03:41Z

How many CPUs do you have on your machine

32 cores

How many repos are there that could potentially be fetched in parallel?
bazel query //external:all-targets | wc -l

7131

Do you have Bzlmod enabled? (check your .bazelrc file for --noenable_bzlmod)

Bzlmod disabled

Does the OOM happen with any query?

Definitely happens with wildcard query //... we use for diff aware bulds

meteorcloudy · 2024-03-26T16:20:19Z

@bazel-io fork 7.2.0

jfirebaugh · 2024-03-26T17:25:56Z

We (Figma) experienced this too.

Bzlmod is not enabled (--noenable_bzlmod)
OOM happens for us on any bazel action that involves fetching a significant number of repositories (mostly bazel build rather than bazel query)
bazel query //external:all-targets | wc -l: 20720

fmeum · 2024-03-26T18:01:04Z

@ivan-golub @jfirebaugh Could you share at least a rough breakdown of which rulesets/repo rules contribute to this number of external repos?

ivan-golub · 2024-03-26T19:03:01Z

Could you share at least a rough breakdown of which rulesets/repo rules contribute to this number of external repos?

its an android repo in our case, so androidx, android_tools, dagger, kotlin, jdk, robelectric, some jetbrains libs, internal protobuf repos and 1st/3rd party libs

keertk · 2024-04-29T15:42:06Z

Is this still on track for 7.2? We're aiming to create the first RC on 5/13.

Wyverald · 2024-05-08T21:38:49Z

This one is hard to pin down. We'd still like to fix if we can get a hold of it, but it's possible that the fix will only be in time for a 7.2.1. Marked #21815 as a soft blocker.

mpereira · 2024-06-03T15:09:42Z

I've also been experiencing this for a while now. In my case, building a Python zip after changing more than a couple requirement versions in requirements.in and running a bazel build would spawn many, many python.pip_install.tools.wheel_installer.wheel_installer processes, eventually using 100% of the machine memory and CPU.

Analyzing: target <redacted>_bin (125 packages loaded, 3295 targets configured)
[1 / 1] checking cached actions
    Fetching repository @@rules_python~~pip~pip_312_starlette; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_requests; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_tenacity; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_pydantic_yaml; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_nest_asyncio; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_playwright_stealth; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_typer; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_uuid_utils; starting 22s ... (30 fetches)

Server terminated abruptly (error code: 14, error message: 'Connection reset by peer', log file: '/root/.cache/bazel/_bazel_root/b8ccf54d4a62f705275a4051f267d262/server/jvm.out')

I tried many different flags like --jobs=1 --local_resources=cpu=1.0 --local_resources=memory=256, but those have no effect on the repository-fetching code.

I was again trying to solve this issue today, and saw the experimental_worker_for_repo_fetching in the Bazel changelogs, and found the issue mentioning that it started defaulting to auto recently: #21082

I tried running a bazel build with --experimental_worker_for_repo_fetching=off and that immediately fixed the issue! Memory consumption during repository fetching was minimal.

I hope other folks running into bazel build OOMs due to --experimental_worker_for_repo_fetching=auto also find this issue.

GorshkovNikita · 2024-06-04T12:23:28Z

We also experience this issue after updating bazel from 6.4.0 to 7.1.2. We use rules_nixpkgs for external repositories. When I build a target with a lot of external dependencies, I see, that bazel generates too many nix build processes (over 300). And our dev container crashes with OOM. The problem is that now the number of processes is limited by host resources (there are 256 cores on the host), but we used --loading_phase_threads to limit number of generated processes to account for the amount of resources that were allocated for a particular dev container. Now it seems, that the value of --loading_phase_threads is ignored for some reason.

I can confirm, that I can't reproduce the issue with bazel 7.0.2.

Is there an estimate for when it will be resolved?

UPD:

I tried running a bazel build with --experimental_worker_for_repo_fetching=off and that immediately fixed the issue! Memory consumption during repository fetching was minimal.

This fixed issue for me as well. Thanks @mpereira !
Now the maximum number of processes is 2x(value of loading_phase_threads).

fmeum · 2024-06-05T14:29:18Z

I tried to reproduce this for a while, both on synthetic and real-world projects, but I haven't observed any meaningful difference between 7.0.2 and 7.1.2. --loading_phase_threads is being honored in my experiments and I can have > 200,000 repos in a dependency chain on my laptop.

The flip of --experimental_worker_for_repo_fetching in 7.1.2 may result in more repo rules doing actual work in parallel than before simply because less time is spent in restarts, but the number of concurrent repository_ctx.execute calls should still be limited by --loading_phase_threads, even with Skyframe enabled.

If you reported an issue in this thread, could you try running with 7.2.0rc2 and an explicit --loading_phase_threads value and then share a Starlark profile (can be emitted into the workspace directory with --profile)? A standalone reproducer would be ideal, but the profile would already be very helpful.

fmeum · 2024-06-07T08:31:37Z

I found a clean reproducer on Slack (thanks @hugocarr): https://github.com/hugocarr/cloud_repro/tree/hugo/requirements_oom

I get an OOM with --loading_phase_threads=4 that I don't get with --loading_phase_threads=4 --experimental_worker_for_repo_fetching=off.

The profiles show that with off, there are never more than 4 concurrent executes, with auto, there are hundreds.
profile.fails.json
profile.works.json

@meteorcloudy @Wyverald

fmeum · 2024-06-07T10:06:23Z

Good news: I tested the repro with 7.2.0rc3 and it seems that the issue is fixed there. I don't know why though.
profile.rc3.json

meteorcloudy · 2024-06-07T14:02:55Z

Maybe somehow fixed by #22573?

@ivan-golub Can you please also verify this issue no longer exists with 7.2.0rc3?

hugocarr · 2024-06-07T16:54:29Z

Just to reiterate: Chatting with @fmeum in Bazel slack about an OOM we were experiencing during the fetch stage when we run bazel build @pypi//... for a project with many 3rd party Python dependencies.

It appears that upgrading from 7.1.2 to 7.2.0rc3 solves this issue for us. Memory stays stable. Not sure if it's directly causal but wanted to add a 👍 for #thissolvedmyproblem

jfirebaugh · 2024-06-10T17:57:58Z

This is fixed in 7.2.0 for us too.

Wyverald · 2024-06-10T18:33:43Z

Thanks for the reports! I'll go ahead and close this for now. If new reports surface, we can revisit.

ivan-golub added type: bug untriaged labels Mar 25, 2024

ivan-golub assigned iancha1992, satyanandak and sgowroji Mar 25, 2024

github-actions bot added the team-Core Skyframe, bazel query, BEP, options parsing, bazelrc label Mar 25, 2024

Wyverald assigned Wyverald and unassigned sgowroji, iancha1992 and satyanandak Mar 25, 2024

Wyverald added P1 I'll work on this now. (Assignee required) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. and removed untriaged team-Core Skyframe, bazel query, BEP, options parsing, bazelrc labels Mar 25, 2024

bazel-io mentioned this issue Mar 26, 2024

[7.3.0] java.lang.OutOfMemoryError when upgrading from Bazel 7.0.2 to 7.1.1 #21815

Closed

fmeum mentioned this issue Jun 4, 2024

Skymeld crashes the build by overusing system resources. #20302

Closed

Wyverald closed this as completed Jun 10, 2024

jfirebaugh mentioned this issue Jun 10, 2024

"unlinkat ([path]) (Directory not empty)" errors with experimental_worker_for_repo_fetching #22680

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

java.lang.OutOfMemoryError when upgrading from Bazel 7.0.2 to 7.1.1 #21803

java.lang.OutOfMemoryError when upgrading from Bazel 7.0.2 to 7.1.1 #21803

ivan-golub commented Mar 25, 2024

fmeum commented Mar 25, 2024 •

edited

Loading

Wyverald commented Mar 25, 2024

ivan-golub commented Mar 26, 2024

meteorcloudy commented Mar 26, 2024

jfirebaugh commented Mar 26, 2024

fmeum commented Mar 26, 2024

ivan-golub commented Mar 26, 2024 •

edited

Loading

keertk commented Apr 29, 2024

Wyverald commented May 8, 2024

mpereira commented Jun 3, 2024 •

edited

Loading

GorshkovNikita commented Jun 4, 2024 •

edited

Loading

fmeum commented Jun 5, 2024

fmeum commented Jun 7, 2024 •

edited

Loading

fmeum commented Jun 7, 2024

meteorcloudy commented Jun 7, 2024

hugocarr commented Jun 7, 2024 •

edited

Loading

jfirebaugh commented Jun 10, 2024

Wyverald commented Jun 10, 2024

java.lang.OutOfMemoryError when upgrading from Bazel 7.0.2 to 7.1.1 #21803

java.lang.OutOfMemoryError when upgrading from Bazel 7.0.2 to 7.1.1 #21803

Comments

ivan-golub commented Mar 25, 2024

Description of the bug:

Which category does this issue belong to?

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse HEAD ?

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

fmeum commented Mar 25, 2024 • edited Loading

Wyverald commented Mar 25, 2024

ivan-golub commented Mar 26, 2024

meteorcloudy commented Mar 26, 2024

jfirebaugh commented Mar 26, 2024

fmeum commented Mar 26, 2024

ivan-golub commented Mar 26, 2024 • edited Loading

keertk commented Apr 29, 2024

Wyverald commented May 8, 2024

mpereira commented Jun 3, 2024 • edited Loading

GorshkovNikita commented Jun 4, 2024 • edited Loading

fmeum commented Jun 5, 2024

fmeum commented Jun 7, 2024 • edited Loading

fmeum commented Jun 7, 2024

meteorcloudy commented Jun 7, 2024

hugocarr commented Jun 7, 2024 • edited Loading

jfirebaugh commented Jun 10, 2024

Wyverald commented Jun 10, 2024

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

fmeum commented Mar 25, 2024 •

edited

Loading

ivan-golub commented Mar 26, 2024 •

edited

Loading

mpereira commented Jun 3, 2024 •

edited

Loading

GorshkovNikita commented Jun 4, 2024 •

edited

Loading

fmeum commented Jun 7, 2024 •

edited

Loading

hugocarr commented Jun 7, 2024 •

edited

Loading