Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go_tool_binary: set GOMAXPROCS to 1 #3536

Merged
merged 1 commit into from Apr 19, 2023

Conversation

sluongng
Copy link
Contributor

What type of PR is this?

Uncomment one line below and remove others.

Bug fix

Feature
Documentation
Other

What does this PR do? Why is it needed?
On darwin, in repositories with many transitions configured such as
rules_go.git, the action GoToolchainBinaryBuild could be run multiple
times in parallel by Bazel.

Each of these actions would execute a command like this in a sandbox:

$GO build -o $OUTPUT -trimpath \
  go/tools/builders/ar.go \
  go/tools/builders/asm.go \
  go/tools/builders/builder.go \
  go/tools/builders/cgo2.go \
  go/tools/builders/compilepkg.go \
  go/tools/builders/cover.go \
  go/tools/builders/edit.go \
  go/tools/builders/embedcfg.go \
  go/tools/builders/env.go \
  go/tools/builders/filter.go \
  go/tools/builders/filter_buildid.go \
  go/tools/builders/flags.go \
  go/tools/builders/generate_nogo_main.go \
  go/tools/builders/generate_test_main.go \
  go/tools/builders/importcfg.go \
  go/tools/builders/link.go \
  go/tools/builders/pack.go \
  go/tools/builders/read.go \
  go/tools/builders/replicate.go \
  go/tools/builders/stdlib.go \
  go/tools/builders/stdliblist.go \
  go/tools/builders/path.go

Then, each go build process would execute multiple go tools compile
command up to the number of max Go routines it is set to. The default
max number of Go routines for each go build process is the number of
core on the machine executing the action.

On a local build, this means that we could have up to n x n number of
goroutines running in parallel when combining the asynchronousity of
both Bazel and go build.

After upgrading to Go 1.20.x, we observed that such parallelism could
get a local MacOS machine run into a deadlock on the OS-level, halting
spawn of new processes such as git config, or new Chrome tab browser.
Typical sampling of a go build process stuck in deadlock would result
in the following call graphs:

2430 Thread_44978   DispatchQueue_1: com.apple.main-thread  (serial)
+ 2430 ???  (in <unknown binary>)  [0x1358]
+   2430 runtime.asmcgocall.abi0  (in go) + 124  [0x100b702ac]
+     2430 runtime.syscall6.abi0  (in go) + 56  [0x100b71a98]
+       2430 __wait4_nocancel  (in libsystem_kernel.dylib) + 8  [0x1a17a04f4]

2430 Thread_44999
+ 2430 runtime.asmcgocall.abi0  (in go) + 201  [0x100b702f9]
+   2430 runtime.pthread_cond_timedwait_relative_np_trampoline.abi0  (in go) + 28  [0x100b717ec]
+     2430 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1276  [0x1a17d45a0]
+       2430 __psynch_cvwait  (in libsystem_kernel.dylib) + 8  [0x1a1797710]

2430 Thread_45038
+ 2430 runtime.kevent_trampoline.abi0  (in go) + 40  [0x100b71518]
+   2430 kevent  (in libsystem_kernel.dylib) + 8  [0x1a179a060]

We could not reproduce outside of Bazel (using bash & to parallelize go build). We also could not reproduce this issue with Go 1.19 releases.

Let's apply a rudimentry workaround by limitting the concurrency of go build process by setting it's max Go routines to 1. This should make
this and future race conditions happen a lot less on all platforms.
By doing this, we rely more on Bazel and less on Go runtime to saturate
our CPU resources.

This change should not affect remote execution as the default
ResourceSet for each action is set at 1 CPU and 250MB RAM.

Which issues(s) does this PR fix?

Related to golang/go#59657

Other notes for review

@sluongng sluongng force-pushed the sluongng/go-toolchain-binary-1 branch 2 times, most recently from 7646fe8 to 8de2570 Compare April 17, 2023 12:58
Copy link
Collaborator

@fmeum fmeum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tyler-french Could you test this at Uber? We should also include it in the patch release.

@fmeum
Copy link
Collaborator

fmeum commented Apr 17, 2023

@sluongng Could you copy these lines over to the definitions of the failing pipeline runs? Bazel CI changed again a few hours ago.

@sluongng
Copy link
Contributor Author

@sluongng Could you copy these lines over to the definitions of the failing pipeline runs? Bazel CI changed again a few hours ago.

I would not mind rebasing on top of origin/master.

I will leave the CI fixing to a separate PR?

On darwin, in repositories with many transitions configured such as
rules_go.git, the action GoToolchainBinaryBuild could be run multiple
times in parallel by Bazel.

Each of these actions would execute a command like this in a sandbox:

  ```
  $GO build -o $OUTPUT -trimpath \
    go/tools/builders/ar.go \
    go/tools/builders/asm.go \
    go/tools/builders/builder.go \
    go/tools/builders/cgo2.go \
    go/tools/builders/compilepkg.go \
    go/tools/builders/cover.go \
    go/tools/builders/edit.go \
    go/tools/builders/embedcfg.go \
    go/tools/builders/env.go \
    go/tools/builders/filter.go \
    go/tools/builders/filter_buildid.go \
    go/tools/builders/flags.go \
    go/tools/builders/generate_nogo_main.go \
    go/tools/builders/generate_test_main.go \
    go/tools/builders/importcfg.go \
    go/tools/builders/link.go \
    go/tools/builders/pack.go \
    go/tools/builders/read.go \
    go/tools/builders/replicate.go \
    go/tools/builders/stdlib.go \
    go/tools/builders/stdliblist.go \
    go/tools/builders/path.go
  ```

Then, each `go build` process would execute multiple `go tools compile`
command up to the number of max Go routines it is set to. The default
max number of Go routines for each `go build` process is the number of
core on the machine executing the action.

On a local build, this means that we could have up to n x n number of
goroutines running in parallel when combining the asynchronousity of
both Bazel and `go build`.

After upgrading to Go 1.20.x, we observed that such parallelism could
get a local MacOS machine run into a deadlock on the OS-level, halting
spawn of new processes such as `git config`, or new Chrome tab browser.
Typical sampling of a `go build` process stuck in deadlock would result
in the following call graphs:

  ```
  2430 Thread_44978   DispatchQueue_1: com.apple.main-thread  (serial)
  + 2430 ???  (in <unknown binary>)  [0x1358]
  +   2430 runtime.asmcgocall.abi0  (in go) + 124  [0x100b702ac]
  +     2430 runtime.syscall6.abi0  (in go) + 56  [0x100b71a98]
  +       2430 __wait4_nocancel  (in libsystem_kernel.dylib) + 8  [0x1a17a04f4]

  2430 Thread_44999
  + 2430 runtime.asmcgocall.abi0  (in go) + 201  [0x100b702f9]
  +   2430 runtime.pthread_cond_timedwait_relative_np_trampoline.abi0  (in go) + 28  [0x100b717ec]
  +     2430 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1276  [0x1a17d45a0]
  +       2430 __psynch_cvwait  (in libsystem_kernel.dylib) + 8  [0x1a1797710]

  2430 Thread_45038
  + 2430 runtime.kevent_trampoline.abi0  (in go) + 40  [0x100b71518]
  +   2430 kevent  (in libsystem_kernel.dylib) + 8  [0x1a179a060]
  ```

We could not reproduce outside of Bazel (using bash & to parallelize `go
build`). We also could not reproduce this issue with Go 1.19 releases.

Let's apply a rudimentry workaround by limitting the concurrency of `go
build` process by setting it's max Go routines to 1. This should make
this and future race conditions happen a lot less on all platforms.
By doing this, we rely more on Bazel and less on Go runtime to saturate
our CPU resources.

This change should not affect remote execution as the default
ResourceSet for each action is set at 1 CPU and 250MB RAM.
@sluongng sluongng force-pushed the sluongng/go-toolchain-binary-1 branch from 8de2570 to d4bce7f Compare April 18, 2023 11:46
@tyler-french
Copy link
Contributor

@tyler-french Could you test this at Uber? We should also include it in the patch release.

Tested at Uber

@fmeum fmeum merged commit 7377180 into bazelbuild:master Apr 19, 2023
2 checks passed
linzhp pushed a commit that referenced this pull request Apr 20, 2023
On darwin, in repositories with many transitions configured such as
rules_go.git, the action GoToolchainBinaryBuild could be run multiple
times in parallel by Bazel.

Each of these actions would execute a command like this in a sandbox:

  ```
  $GO build -o $OUTPUT -trimpath \
    go/tools/builders/ar.go \
    go/tools/builders/asm.go \
    go/tools/builders/builder.go \
    go/tools/builders/cgo2.go \
    go/tools/builders/compilepkg.go \
    go/tools/builders/cover.go \
    go/tools/builders/edit.go \
    go/tools/builders/embedcfg.go \
    go/tools/builders/env.go \
    go/tools/builders/filter.go \
    go/tools/builders/filter_buildid.go \
    go/tools/builders/flags.go \
    go/tools/builders/generate_nogo_main.go \
    go/tools/builders/generate_test_main.go \
    go/tools/builders/importcfg.go \
    go/tools/builders/link.go \
    go/tools/builders/pack.go \
    go/tools/builders/read.go \
    go/tools/builders/replicate.go \
    go/tools/builders/stdlib.go \
    go/tools/builders/stdliblist.go \
    go/tools/builders/path.go
  ```

Then, each `go build` process would execute multiple `go tools compile`
command up to the number of max Go routines it is set to. The default
max number of Go routines for each `go build` process is the number of
core on the machine executing the action.

On a local build, this means that we could have up to n x n number of
goroutines running in parallel when combining the asynchronousity of
both Bazel and `go build`.

After upgrading to Go 1.20.x, we observed that such parallelism could
get a local MacOS machine run into a deadlock on the OS-level, halting
spawn of new processes such as `git config`, or new Chrome tab browser.
Typical sampling of a `go build` process stuck in deadlock would result
in the following call graphs:

  ```
  2430 Thread_44978   DispatchQueue_1: com.apple.main-thread  (serial)
  + 2430 ???  (in <unknown binary>)  [0x1358]
  +   2430 runtime.asmcgocall.abi0  (in go) + 124  [0x100b702ac]
  +     2430 runtime.syscall6.abi0  (in go) + 56  [0x100b71a98]
  +       2430 __wait4_nocancel  (in libsystem_kernel.dylib) + 8  [0x1a17a04f4]

  2430 Thread_44999
  + 2430 runtime.asmcgocall.abi0  (in go) + 201  [0x100b702f9]
  +   2430 runtime.pthread_cond_timedwait_relative_np_trampoline.abi0  (in go) + 28  [0x100b717ec]
  +     2430 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1276  [0x1a17d45a0]
  +       2430 __psynch_cvwait  (in libsystem_kernel.dylib) + 8  [0x1a1797710]

  2430 Thread_45038
  + 2430 runtime.kevent_trampoline.abi0  (in go) + 40  [0x100b71518]
  +   2430 kevent  (in libsystem_kernel.dylib) + 8  [0x1a179a060]
  ```

We could not reproduce outside of Bazel (using bash & to parallelize `go
build`). We also could not reproduce this issue with Go 1.19 releases.

Let's apply a rudimentry workaround by limitting the concurrency of `go
build` process by setting it's max Go routines to 1. This should make
this and future race conditions happen a lot less on all platforms.
By doing this, we rely more on Bazel and less on Go runtime to saturate
our CPU resources.

This change should not affect remote execution as the default
ResourceSet for each action is set at 1 CPU and 250MB RAM.
tingilee pushed a commit to tingilee/rules_go that referenced this pull request Jul 19, 2023
On darwin, in repositories with many transitions configured such as
rules_go.git, the action GoToolchainBinaryBuild could be run multiple
times in parallel by Bazel.

Each of these actions would execute a command like this in a sandbox:

  ```
  $GO build -o $OUTPUT -trimpath \
    go/tools/builders/ar.go \
    go/tools/builders/asm.go \
    go/tools/builders/builder.go \
    go/tools/builders/cgo2.go \
    go/tools/builders/compilepkg.go \
    go/tools/builders/cover.go \
    go/tools/builders/edit.go \
    go/tools/builders/embedcfg.go \
    go/tools/builders/env.go \
    go/tools/builders/filter.go \
    go/tools/builders/filter_buildid.go \
    go/tools/builders/flags.go \
    go/tools/builders/generate_nogo_main.go \
    go/tools/builders/generate_test_main.go \
    go/tools/builders/importcfg.go \
    go/tools/builders/link.go \
    go/tools/builders/pack.go \
    go/tools/builders/read.go \
    go/tools/builders/replicate.go \
    go/tools/builders/stdlib.go \
    go/tools/builders/stdliblist.go \
    go/tools/builders/path.go
  ```

Then, each `go build` process would execute multiple `go tools compile`
command up to the number of max Go routines it is set to. The default
max number of Go routines for each `go build` process is the number of
core on the machine executing the action.

On a local build, this means that we could have up to n x n number of
goroutines running in parallel when combining the asynchronousity of
both Bazel and `go build`.

After upgrading to Go 1.20.x, we observed that such parallelism could
get a local MacOS machine run into a deadlock on the OS-level, halting
spawn of new processes such as `git config`, or new Chrome tab browser.
Typical sampling of a `go build` process stuck in deadlock would result
in the following call graphs:

  ```
  2430 Thread_44978   DispatchQueue_1: com.apple.main-thread  (serial)
  + 2430 ???  (in <unknown binary>)  [0x1358]
  +   2430 runtime.asmcgocall.abi0  (in go) + 124  [0x100b702ac]
  +     2430 runtime.syscall6.abi0  (in go) + 56  [0x100b71a98]
  +       2430 __wait4_nocancel  (in libsystem_kernel.dylib) + 8  [0x1a17a04f4]

  2430 Thread_44999
  + 2430 runtime.asmcgocall.abi0  (in go) + 201  [0x100b702f9]
  +   2430 runtime.pthread_cond_timedwait_relative_np_trampoline.abi0  (in go) + 28  [0x100b717ec]
  +     2430 _pthread_cond_wait  (in libsystem_pthread.dylib) + 1276  [0x1a17d45a0]
  +       2430 __psynch_cvwait  (in libsystem_kernel.dylib) + 8  [0x1a1797710]

  2430 Thread_45038
  + 2430 runtime.kevent_trampoline.abi0  (in go) + 40  [0x100b71518]
  +   2430 kevent  (in libsystem_kernel.dylib) + 8  [0x1a179a060]
  ```

We could not reproduce outside of Bazel (using bash & to parallelize `go
build`). We also could not reproduce this issue with Go 1.19 releases.

Let's apply a rudimentry workaround by limitting the concurrency of `go
build` process by setting it's max Go routines to 1. This should make
this and future race conditions happen a lot less on all platforms.
By doing this, we rely more on Bazel and less on Go runtime to saturate
our CPU resources.

This change should not affect remote execution as the default
ResourceSet for each action is set at 1 CPU and 250MB RAM.
@hauserx
Copy link
Contributor

hauserx commented Mar 11, 2024

@sluongng Are you still getting multiple builds on bazel 7? I have seen that it reduced number of builder builds from 2 to 1 for my toy example.

If I understand correctly in theory only single builder should be needed as it does not need to be compiled against target configuration, but should be in exec configuration itself - i.e. single builder should be able to build for pure, race, target platform, etc. But not sure if resetting config is enough here, maybe @fmeum has more insights including why bazel 7 helped in some cases.

@fmeum
Copy link
Collaborator

fmeum commented Mar 11, 2024

Bazel 7 should reduce the number of builder builds to at most 2 (target and exec) since identical configurations now result in identical output paths (--experimental_output_directory_naming_scheme=diff_against_dynamic_baseline). While it is true that the target and exec configuration doesn't matter in any way that affects the builder build, this is currently not something we can lower down to 1 as there is no way to reset the exec "bit" of the exec configuration. This is what path mapping will solve eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants