Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.OutOfMemoryError when upgrading from Bazel 7.0.2 to 7.1.1 #21803

Closed
ivan-golub opened this issue Mar 25, 2024 · 18 comments
Closed

java.lang.OutOfMemoryError when upgrading from Bazel 7.0.2 to 7.1.1 #21803

ivan-golub opened this issue Mar 25, 2024 · 18 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug

Comments

@ivan-golub
Copy link
Contributor

Description of the bug:

Upgrading from bazel 7.0.2 to 7.1.1 resulted in consistent OOM exception being thrown during bazel query

04:45:00 [Bazel] Loading: 15 packages loaded
04:45:02 [Bazel] Loading: 268 packages loaded
04:45:02 [Bazel]     currently loading: bzl ... (1719 packages)
04:45:04 [Bazel] Loading: 1990 packages loaded
04:45:04 [Bazel]     currently loading: @@bazel_tools//tools/jdk ... (12 packages)
04:45:05 [Bazel] Loading: 2067 packages loaded
04:45:05 [Bazel]     currently loading: @@bazel_tools//tools/jdk ... (2 packages)
04:45:06 [Bazel] Loading: 2078 packages loaded
04:45:06 [Bazel]     currently loading: @@remote_java_tools//java_tools/zlib
04:45:07 [Bazel] Loading: 2079 packages loaded
04:45:07 [Bazel]     currently loading: @@remote_java_tools//java_tools/zlib
04:45:08 [Bazel] Loading: 2171 packages loaded
04:45:09 [Bazel] Loading: 2364 packages loaded
04:45:10 [Bazel] Loading: 2543 packages loaded
04:45:11 [Bazel] Loading: 2650 packages loaded
04:45:13 [Bazel] Loading: 2889 packages loaded
04:45:14 [Bazel] Loading: 2889 packages loaded
04:45:15 [Bazel] Loading: 2931 packages loaded
04:45:16 [Bazel] Loading: 3036 packages loaded
04:45:17 [Bazel] Loading: 3457 packages loaded
04:45:18 [Bazel] Loading: 3862 packages loaded
04:45:19 [Bazel] Loading: 4655 packages loaded
04:45:20 [Bazel] Loading: 5464 packages loaded
04:45:21 [Bazel] Loading: 6297 packages loaded
04:45:22 [Bazel] Loading: 7193 packages loaded
04:45:23 [Bazel] Loading: 7996 packages loaded
04:45:23 [Bazel] FATAL: bazel ran out of memory and crashed. Printing stack trace:
04:45:23 [Bazel] java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
	at java.base/java.lang.Thread.start0(Native Method)
	at java.base/java.lang.Thread.start(Unknown Source)
	at java.base/java.lang.System$2.start(Unknown Source)
	at java.base/jdk.internal.vm.SharedThreadContainer.start(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.createWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.tryCompensate(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.compensatedBlock(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.managedBlock(Unknown Source)
	at java.base/java.util.concurrent.SynchronousQueue$TransferStack.transfer(Unknown Source)
	at java.base/java.util.concurrent.SynchronousQueue.take(Unknown Source)
	at com.google.devtools.build.lib.bazel.repository.starlark.StarlarkRepositoryFunction.fetch(StarlarkRepositoryFunction.java:170)
	at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.fetchRepository(RepositoryDelegatorFunction.java:418)
	at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.compute(RepositoryDelegatorFunction.java:205)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:461)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
	at java.base/java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

linux

What is the output of bazel info release?

No response

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

Suggestion from bazel public slack to disable experimental_worker_for_repo_fetching with --experimental_worker_for_repo_fetching=off resolved issues

Any other information, logs, or outputs that you want to share?

No response

@github-actions github-actions bot added the team-Core Skyframe, bazel query, BEP, options parsing, bazelrc label Mar 25, 2024
@Wyverald Wyverald added P1 I'll work on this now. (Assignee required) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. and removed untriaged team-Core Skyframe, bazel query, BEP, options parsing, bazelrc labels Mar 25, 2024
@fmeum
Copy link
Collaborator

fmeum commented Mar 25, 2024

@ivan-golub How many CPUs do you have on your machine? And how many repos are there that could potentially be fetched in parallel?

My suspicion: Loom doesn't use non-blocking file I/O yet and instead creates additional native threads when a virtual thread is blocked on file operations. If too many repos are blocked on them in parallel, this could run into the same thread limits as with a native thread pool. Hopefully we don't reach the OS limit and just need to tweak some native memory settings.

@Wyverald
Copy link
Member

Thanks for the report. Some questions:

  • Do you have Bzlmod enabled? (check your .bazelrc file for --noenable_bzlmod)
  • Does the OOM happen with any query? Or what are the queries that cause this?
  • Do you have a sense of how many external repos you have defined? (you could run something like bazel query //external:all-targets | wc -l)

@ivan-golub
Copy link
Contributor Author

How many CPUs do you have on your machine

32 cores

How many repos are there that could potentially be fetched in parallel?
bazel query //external:all-targets | wc -l

7131

Do you have Bzlmod enabled? (check your .bazelrc file for --noenable_bzlmod)

Bzlmod disabled

Does the OOM happen with any query?

Definitely happens with wildcard query //... we use for diff aware bulds

@meteorcloudy
Copy link
Member

@bazel-io fork 7.2.0

@jfirebaugh
Copy link

We (Figma) experienced this too.

  • Bzlmod is not enabled (--noenable_bzlmod)
  • OOM happens for us on any bazel action that involves fetching a significant number of repositories (mostly bazel build rather than bazel query)
  • bazel query //external:all-targets | wc -l: 20720

@fmeum
Copy link
Collaborator

fmeum commented Mar 26, 2024

@ivan-golub @jfirebaugh Could you share at least a rough breakdown of which rulesets/repo rules contribute to this number of external repos?

@ivan-golub
Copy link
Contributor Author

ivan-golub commented Mar 26, 2024

Could you share at least a rough breakdown of which rulesets/repo rules contribute to this number of external repos?

its an android repo in our case, so androidx, android_tools, dagger, kotlin, jdk, robelectric, some jetbrains libs, internal protobuf repos and 1st/3rd party libs

@keertk
Copy link
Member

keertk commented Apr 29, 2024

Is this still on track for 7.2? We're aiming to create the first RC on 5/13.

@Wyverald
Copy link
Member

Wyverald commented May 8, 2024

This one is hard to pin down. We'd still like to fix if we can get a hold of it, but it's possible that the fix will only be in time for a 7.2.1. Marked #21815 as a soft blocker.

@mpereira
Copy link

mpereira commented Jun 3, 2024

I've also been experiencing this for a while now. In my case, building a Python zip after changing more than a couple requirement versions in requirements.in and running a bazel build would spawn many, many python.pip_install.tools.wheel_installer.wheel_installer processes, eventually using 100% of the machine memory and CPU.

Analyzing: target <redacted>_bin (125 packages loaded, 3295 targets configured)
[1 / 1] checking cached actions
    Fetching repository @@rules_python~~pip~pip_312_starlette; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_requests; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_tenacity; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_pydantic_yaml; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_nest_asyncio; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_playwright_stealth; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_typer; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_uuid_utils; starting 22s ... (30 fetches)

Server terminated abruptly (error code: 14, error message: 'Connection reset by peer', log file: '/root/.cache/bazel/_bazel_root/b8ccf54d4a62f705275a4051f267d262/server/jvm.out')

I tried many different flags like --jobs=1 --local_resources=cpu=1.0 --local_resources=memory=256, but those have no effect on the repository-fetching code.

I was again trying to solve this issue today, and saw the experimental_worker_for_repo_fetching in the Bazel changelogs, and found the issue mentioning that it started defaulting to auto recently: #21082

I tried running a bazel build with --experimental_worker_for_repo_fetching=off and that immediately fixed the issue! Memory consumption during repository fetching was minimal.

I hope other folks running into bazel build OOMs due to --experimental_worker_for_repo_fetching=auto also find this issue.

@GorshkovNikita
Copy link

GorshkovNikita commented Jun 4, 2024

We also experience this issue after updating bazel from 6.4.0 to 7.1.2. We use rules_nixpkgs for external repositories. When I build a target with a lot of external dependencies, I see, that bazel generates too many nix build processes (over 300). And our dev container crashes with OOM. The problem is that now the number of processes is limited by host resources (there are 256 cores on the host), but we used --loading_phase_threads to limit number of generated processes to account for the amount of resources that were allocated for a particular dev container. Now it seems, that the value of --loading_phase_threads is ignored for some reason.

I can confirm, that I can't reproduce the issue with bazel 7.0.2.

Is there an estimate for when it will be resolved?

UPD:

I tried running a bazel build with --experimental_worker_for_repo_fetching=off and that immediately fixed the issue! Memory consumption during repository fetching was minimal.

This fixed issue for me as well. Thanks @mpereira !
Now the maximum number of processes is 2x(value of loading_phase_threads).

@fmeum
Copy link
Collaborator

fmeum commented Jun 5, 2024

I tried to reproduce this for a while, both on synthetic and real-world projects, but I haven't observed any meaningful difference between 7.0.2 and 7.1.2. --loading_phase_threads is being honored in my experiments and I can have > 200,000 repos in a dependency chain on my laptop.

The flip of --experimental_worker_for_repo_fetching in 7.1.2 may result in more repo rules doing actual work in parallel than before simply because less time is spent in restarts, but the number of concurrent repository_ctx.execute calls should still be limited by --loading_phase_threads, even with Skyframe enabled.

If you reported an issue in this thread, could you try running with 7.2.0rc2 and an explicit --loading_phase_threads value and then share a Starlark profile (can be emitted into the workspace directory with --profile)? A standalone reproducer would be ideal, but the profile would already be very helpful.

@fmeum
Copy link
Collaborator

fmeum commented Jun 7, 2024

I found a clean reproducer on Slack (thanks @hugocarr): https://github.com/hugocarr/cloud_repro/tree/hugo/requirements_oom

I get an OOM with --loading_phase_threads=4 that I don't get with --loading_phase_threads=4 --experimental_worker_for_repo_fetching=off.

The profiles show that with off, there are never more than 4 concurrent executes, with auto, there are hundreds.
profile.fails.json
profile.works.json

@meteorcloudy @Wyverald

@fmeum
Copy link
Collaborator

fmeum commented Jun 7, 2024

Good news: I tested the repro with 7.2.0rc3 and it seems that the issue is fixed there. I don't know why though.
profile.rc3.json

@meteorcloudy
Copy link
Member

Maybe somehow fixed by #22573?

@ivan-golub Can you please also verify this issue no longer exists with 7.2.0rc3?

@hugocarr
Copy link

hugocarr commented Jun 7, 2024

Just to reiterate: Chatting with @fmeum in Bazel slack about an OOM we were experiencing during the fetch stage when we run bazel build @pypi//... for a project with many 3rd party Python dependencies.

It appears that upgrading from 7.1.2 to 7.2.0rc3 solves this issue for us. Memory stays stable. Not sure if it's directly causal but wanted to add a 👍 for #thissolvedmyproblem

@jfirebaugh
Copy link

This is fixed in 7.2.0 for us too.

@Wyverald
Copy link
Member

Thanks for the reports! I'll go ahead and close this for now. If new reports surface, we can revisit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug
Projects
None yet
Development

No branches or pull requests