Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel deadlocks hosts with large numbers of cores #11868

Closed
stellaraccident opened this issue Jul 29, 2020 · 19 comments
Closed

Bazel deadlocks hosts with large numbers of cores #11868

stellaraccident opened this issue Jul 29, 2020 · 19 comments
Labels
team-Performance Issues for Performance teams

Comments

@stellaraccident
Copy link

Description of the problem / feature request:

Bazel appears to be incredibly sensitive to the number of cores available on the local machine: at high core counts (>=64), bazel will predictably deadlock, often bringing the machine down with it, if there is any I/O latency or contention. We have experienced this consistently on multiple Linux hosts, both physical and virtual -- although it happens at a much greater frequency on virtual machines, even when provisioned to be fairly isolated and running on SSDs. On a GCE n1-highcpu-64 (96 core, 57GB RAM, 100GB SSD VM, we can trigger this deadlock for roughly 80% of builds of our project (which includes part of TensorFlow and LLVM as deps, and has on the order of ~7000 actions). In such a configuration, the deadlock usually occurs after ~2000-3000 actions and always with the same pattern. Bazel will report that "128 jobs running" but watching top will show very low utilization (say 16-30 processes), high CPU usage of the Bazel Java process (200-400%), and the tendency for jobs to "get lost", eventually with no active jobs running (from the perspective of top).

On internal developer specialist workstations (with normal disks, not SSD), the occurrence is 100% and happens much sooner in a build.

I have found two workarounds that help the problem:

  1. Set the --output_base to a directory on tmpfs (under /dev/shm), making sure that it is sized appropriately.
  2. Set --spawn_strategy=standalone

For the first, this seems to make things relatively reliable. Looking at top, bazel gets close to saturating all cores (90-95% consistent utilization). For the second, it seems to help but deadlocks still occur at a lower rate. When not triggering a deadlock, the utilization seems relatively high. I have a low sample count for the second option.

Note also that we maintain roughly parallel CMake/Ninja builds for a substantial fraction of the code, have never had any such issues with it, and in general, it is much more reliable at utilizing all available cores for cc compilation jobs than Bazel is. This is a fairly apples to apples comparison running on the same systems.

I have no real knowledge of bazel internals, but all of the evidence I have seen suggests that at high core counts, Bazel is extremely sensitive to I/O latency, the presence of which exacerbates some kind of locking issue which can turn into a flood which does something really bad causing machines to become unresponsive with no obvious resource contention. I have seen such machines eventually wake back up after an hour or so on occasion if an external agent kills processes.

Feature requests: what underlying problem are you trying to solve with this feature?

Bazel should operate reliably regardless of the machine size.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Comment out this line and run our build pipeline. Based on internal, ad-hoc testing, I suspect that this can be easily triggered on affected machines by building TensorFlow, or another such project with ~thousands of actions.

Alternatively, building our project on such a machine repros easily: https://google.github.io/iree/get-started/getting-started-linux-bazel

What operating system are you running Bazel on?

Various. Most commonly Debian 10.

What's the output of bazel info release?

Various - we've experienced this over many versions over ~months.

Here is one:
release 3.3.1

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

N/A - although I have also experienced this with custom build bazel versions on exotic configs.

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

https://github.com/google/iree.git
c96bbb1d38d3fe81230e38ce3214d80b922ba4c3
c96bbb1d38d3fe81230e38ce3214d80b922ba4c3

Have you found anything relevant by searching the web?

No.

Any other information, logs, or outputs that you want to share?

I can followup with artifacts you think might be valuable. I have not found anything worthwhile, and when it gets into a really bad state, I'm often on a remote ssh connection and the machine locks to the point that it is hard to do much.

@jin jin added team-Performance Issues for Performance teams untriaged labels Jul 30, 2020
@meisterT
Copy link
Member

I tried to repro with iree on a 72 core machine and ran into build errors like these:

In file included from external/llvm-project/llvm/lib/MC/MCParser/WasmAsmParser.cpp:19:
In file included from external/llvm-project/llvm/include/llvm/BinaryFormat/Wasm.h:17:
In file included from external/llvm-project/llvm/include/llvm/ADT/ArrayRef.h:14:
In file included from external/llvm-project/llvm/include/llvm/ADT/SmallVector.h:20:
external/llvm-project/llvm/include/llvm/Support/MathExtras.h:396:23: error: no template named 'enable_if_t' in namespace 'std'
constexpr inline std::enable_if_t<(N < 64), bool> isUInt(uint64_t X) {
                 ~~~~~^
external/llvm-project/llvm/include/llvm/Support/MathExtras.h:401:23: error: no template named 'enable_if_t' in namespace 'std'
constexpr inline std::enable_if_t<N >= 64, bool> isUInt(uint64_t X) {
                 ~~~~~^
external/llvm-project/llvm/include/llvm/Support/MathExtras.h:406:35: error: no function template matches function template specialization 'isUInt'
template <> constexpr inline bool isUInt<8>(uint64_t x) {
                                  ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.

Do you know what I should do to get rid of them?

@meisterT
Copy link
Member

Couldn't repro with tensorflow either.

@meisterT
Copy link
Member

Ok, I have tried iree now on a GCP instance n1-highcpu-64 and the build succeeded without bringing the machine to a halt.

@stellaraccident
Copy link
Author

Thanks for giving it a try. I'm not able to plus non-collaborators into this issue directly so I will direct a couple of my co-workers here who were having the same issue and may have a more reproducible case. Fwiw, we have been able to largely ignore this for many months because the issue was sporadic; however, recently, with larger VMs and different IO setups, it has become much higher frequency. It isn't surprising that a couple of test runs in a different environment did not surface the issue, and we can try to get you a better isolated case.

Regarding the build errors on the IREE project, what was your command line? And can you confirm that you pulled submodules and ran the configure script in the root?

@larsrc-google
Copy link
Contributor

When it deadlocks, are you sometimes able to Ctrl-C and break out of it? If so, could you attach the profile file (see https://docs.bazel.build/versions/master/skylark/performance.html#performance-profiling). If not, the command.log file might still have interesting clues.

@meisterT
Copy link
Member

meisterT commented Aug 3, 2020

A stack trace (triggered by Ctrl+\) can also help to understand what's going on.

Regarding the build errors on the IREE project, what was your command line? And can you confirm that you pulled submodules and ran the configure script in the root?

Feel free to ignore. This didn't happen on the GCP instance, so I must have done something wrong on the physical machine.

@phoenix-meadowlark
Copy link
Contributor

phoenix-meadowlark commented Aug 5, 2020

When it deadlocks, are you sometimes able to Ctrl-C and break out of it? If so, could you attach the profile file (see https://docs.bazel.build/versions/master/skylark/performance.html#performance-profiling). If not, the command.log file might still have interesting clues.

In my experience I can usually break out of it. It takes quite a while unless you interrupt three times, and it can still take quite a while even then. Anecdotally bazel becomes more difficult to interrupt the more times you stop and restart these jobs in quick succession (e.g. while debugging).

I ran bazel build iree/... --profile=/tmp/bazel_profile and the first 490 actions completed in a few seconds, and then everything froze. Three actions completed in the next two minutes and then I interrupted once. It only took a few minutes to shutdown. The profile can be found in this gist. The specs of the machine are 96 Intel Xeon cores and 180 Gb of RAM.

To get around this issue,* I've been building on this machine by specifying this in my .bazelrc:

build --disk_cache=/tmp/iree/bazel-cache/  # Disk is fine here.
startup --output_base=/dev/shm/iree/bazel-out

I've also tried using --spawn_strategy=standalone and --spawn_strategy=local, but both run into build errors related to C++ missing includes, and I haven't debugged that further.

*Edited for clarity.

@phoenix-meadowlark
Copy link
Contributor

The command.log wasn't particularly helpful. Most of the information was overwritten:

INFO: Analyzed 901 targets (215 packages loaded, 9180 targets configured).
INFO: Found 901 targets...
ERROR: build interrupted
INFO: Elapsed time: 181.408s, Critical Path: 162.58s
INFO: 220 processes: 220 linux-sandbox.
FAILED: Build did NOT complete successfully

Related to that, is there a way to make bazel output the name and build time for each action that it runs to stdout? Somewhat similar to how cmake lists each target it builds?

@larsrc-google
Copy link
Contributor

The first interesting thing I see is that starting at about 20s into the build, there are some very long sandbox.createFileSystem parts. Can you give some details on the disk and filesystem this is on? Is it close to full? Is it low on inodes? Is it a remote filesystem?

My first intuition is that the Bazel server leaks open files. This would match that it gets harder to interrupt after restarting the jobs. Could you check if the server process has many fds? You can use jps to get the PID of the server, then ls /proc/<pid>/fds to see how many fds it has open.

@phoenix-meadowlark
Copy link
Contributor

phoenix-meadowlark commented Aug 5, 2020

The disk is mostly empty. The machine is an internal developer instance.

$ sudo fdisk -l
Disk /dev/sda: 300 GiB, 322122547200 bytes, 629145600 sectors
Disk model: PersistentDisk
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 243D4D3D-7B78-4C58-83A1-7A45DDAFA374

Device       Start       End   Sectors   Size Type
/dev/sda1     2048   3908297   3906250   1.9G EFI System
/dev/sda2  3908298   7814547   3906250   1.9G Linux filesystem
/dev/sda3  7814548 629145566 621331019 296.3G Linux LVM

Partition 2 does not start on physical sector boundary.
Partition 3 does not start on physical sector boundary.


Disk /dev/mapper/******_20200320-root: 294.42 GiB, 316116303872 bytes, 617414656 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Alignment offset: 2048 bytes


Disk /dev/mapper/******_20200320-swap: 1.88 GiB, 2000683008 bytes, 3907584 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Alignment offset: 2048 bytes
$ jps
2547974 A-server.jar
2564566 Jps
$ ls /proc/2547974/fd/ # tab
zsh: do you wish to see all 1034 possibilities (58 lines)?

The number of items in .../fd/ is varying between 700 and 1100.

@meisterT
Copy link
Member

meisterT commented Aug 6, 2020

I tried with the .bazelrc from #11868 (comment) but couldn't repro either.

Some data on the sandbox.createFileSystem task for my run:

# NumSamples = 7814; Min = 65.00; Max = 351754.00
# Mean = 22501.716406; Variance = 353940719.598612; SD = 18813.312297; Median 27878.000000
# each ∎ represents a count of 95
   65.0000 - 35233.9000 [  7160]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
35233.9000 - 70402.8000 [   615]: ∎∎∎∎∎∎
70402.8000 - 105571.7000 [    11]: 
105571.7000 - 140740.6000 [     3]: 
140740.6000 - 175909.5000 [     6]: 
175909.5000 - 211078.4000 [     6]: 
211078.4000 - 246247.3000 [     5]: 
246247.3000 - 281416.2000 [     3]: 
281416.2000 - 316585.1000 [     4]: 
316585.1000 - 351754.0000 [     1]: 

@phoenix-meadowlark
Copy link
Contributor

Sorry, that .bazelrc is what I've been using to get around the problem. Removing startup --output_base=/dev/shm/iree/bazel-out causes bazel to deadlock.

@philwo
Copy link
Member

philwo commented Aug 7, 2020

I wrote a little benchmark tool that creates 2500 * 64 = 160000 symlinks and deletes them again using the same syscalls that Bazel uses: https://gist.github.com/philwo/9619f39c7d92ca75805bad810c4604b4

Expected behavior:
No disk I/O observable, because the symlinks don't live long enough that the kernel would write them to disk.

Actual behavior:
Case 1: symlink to short filename (max. 59 characters)
We can observe a short burst of a 150 IOPS ~80MB write after the benchmark finished, which is probably a transaction in ext4's journal?

Case 2: symlink to long filenames (>= 60 characters)
When creating symlinks to long file names, ext4 stores them as regular files instead of as "fast symlinks" inside the inode.

In this case, iostat -k 1 shows that the benchmark tool saturates the disk and constantly writes ~180 MB/s while it is running. This is what happens during the Bazel build, too.

Open question: Is it possible to tune the kernel or ext4 behavior that these symlinks don't hit the disk?

@meisterT
Copy link
Member

meisterT commented Aug 7, 2020

We could reproduce this now. @philwo and me had a closer look at what happens for the iree project.

First thing to note is that this is not a deadlock (at least we didn't see one), but it just is super slow. Eventually it finishes.

Most of the actions in this project have ~2850 inputs (mostly via the external/llvm-project/llvm). Given that a symlink takes up 4kB, this is reported as ~11MB. Each of those needs to be created and deleted again.

Note that by default Bazel uses the number of logical cores as the number of jobs, so in our case 64.

According to https://cloud.google.com/compute/docs/disks/performance a disk of 128GB size the sustained write performance is ~60MB/s. While bazel is running, we could confirm with the help of iotop that this is maxed out.

With strace we double checked that Bazel is not doing anything weird (like syncs), but the expected symlink, and unlinkat.
What is still unclear to us why this hits the disk at all since it is deleted quickly and there is plenty of RAM available, so we would expect the disk cache to help.

What you can do to work around:

  • Move your sandbox into RAM: --sandbox_base=/dev/shm. This will work if you don't have any action that produces large outputs which would otherwise consume all your RAM. (This is our recommendation.)
  • Make your build with --spawn_strategy=standalone which disables sandboxing.
  • Increase the disk size in the VM to scale I/O performance with the number of cores/Bazel jobs.
  • Decrease the size of your toolchain, but I don't know whether that's doable and how trivial that is.
  • Try https://github.com/bazelbuild/sandboxfs (cc @jmmv for the current state of the project).

@philwo
Copy link
Member

philwo commented Aug 7, 2020

Interesting fact: The only two filesystems that produced absolutely no I/O during the benchmark were btrfs and ext4 without a journal (mkfs.ext4 -O ^has_journal).

I couldn't find any mount flags to convince journalled ext4 to reduce the I/O when dealing with slow symlinks, so I think we hit a dead end here.

As Tobi already wrote, my recommendation is also to go with --sandbox_base=/dev/shm so that the symlinks are created on a tmpfs, which circumvents this issue.

@philwo
Copy link
Member

philwo commented Aug 7, 2020

Another interesting finding: When creating regular files instead of symlinks, the assumption that they should not hit the disk, because they're deleted fast enough, holds.

@meisterT
Copy link
Member

Closing as we have diagnosed the problem and given workarounds.

@stellaraccident
Copy link
Author

Thanks for the diagnosis and workarounds - will let you know if any of those run in to issues (I distinctly recall having tried each of those in a different configuration and still hitting the same point; however, there was hardly any experimental control when I was just trying things in the dark).

It is unlikely that we are going to impose yet another layer of infrastructure on our users (i.e. sandboxfs) to help Bazel achieve a level of performance that the other build systems already do out of the box. Given that Bazel is the outlier here, I do question whether this bug should be closed: we chased our tail on this for quite some time and have seen anecdotal evidence online that others have hit it without a solution (just practicing avoidance, like we did until it started repro'ing regularly). If it can't actually be "fixed" out of the box, some kind of warning/breadcrumb when a build-graph has such a high degree of fanout seems like it might be appropriate (which is a known source of extreme overhead -- even if it doesn't morbidly slow down on a specific build, maxing out disk systems creating symlinks is hardly a "nice" thing to be doing to people).

@philwo
Copy link
Member

philwo commented Aug 10, 2020

It is unlikely that we are going to impose yet another layer of infrastructure on our users (i.e. sandboxfs)

I also wouldn't recommend to do that. We just gave it as an example of what's possible with Bazel - you'll have to figure out whether it's a good option for your use case. For example, it might not be useful to recommend to your users, but it might be interesting to speed up your CI system, if you want that and it helps.

That said, I think --sandbox_base=/dev/shm will solve your issues on Linux and thus you won't need sandboxfs in any case. This option cannot be enabled by default, because it would break builds on machines with not enough RAM to hold all intermediate output files - whether that's because the machine just doesn't have much RAM or because the build has steps like "And now, let's concatenate ALL the results into one huge 120 GB output file". But for your build it seems fine.

We are always looking into making sandboxing faster, but please realize that I'm 99% sure that we have examined all possible ways of doing filesystem sandboxing on Linux and macOS, from old-school to modern, from basic to exotic kernel features. So far we have not found a better way that works for the average user's Linux box than symlinking (and symlinks aren't great either).

We could come up with something faster if, say, running Bazel as root would be fair game. Or loading a Bazel-specific kernel module. Or running a helper daemon as root. Or only supporting builds on XFS and btrfs. But that's not what we want, we want Bazel to work out of the box for all users and thus we have to use what the average Linux distro offers us and that's really not much, unfortunately.

to help Bazel achieve a level of performance that the other build systems already do out of the box

CMake / Ninja do not offer sandboxing. This isn't a fair comparison at all. If you run your build with --strategy=local to disable sandboxing, Bazel should perform roughly as Ninja. We tried that, but your build failed without sandboxing, so we couldn't benchmark it for comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Performance Issues for Performance teams
Projects
None yet
Development

No branches or pull requests

6 participants