Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toolchain resolution problems with multiple platforms and remote execution #8636

Closed
eytankidron opened this issue Jun 14, 2019 · 6 comments
Closed
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Configurability Issues for Configurability team type: bug

Comments

@eytankidron
Copy link

When playing around with remote execution platform selection, I'm hitting toolchain resolution error with bazel 0.26.1

In .bazelrc, I'm setting:
build:remote --incompatible_enable_cc_toolchain_resolution
build:remote --extra_toolchains=@rbe_default//config:cc-toolchain
build:remote --extra_execution_platforms=//:plain_platform,@rbe_default//config:platform
build:remote --host_platform=//:plain_platform
build:remote --platforms=@rbe_default//config:platform

Where :plain_platform does not implement the appropriate constraint_values to satisfy the cc_toolchain and it also uses a docker image that does not contain clang.

platform(
name = "plain_platform",
remote_execution_properties = """
properties: {
name: "container-image"
value:"docker://gcr.io/gcp-runtimes/ubuntu_16_0_4@sha256:096632d8fb3e78fbd58ae6a2b25ed46020dc70e65d89bca774af6f7b2de6898c"
}
properties {
name: "OSFamily"
value: "Linux"
}
""",
)

My BUILD file defines a cc_library and a genrule which uses that cc_library as a tool.

cc_library(
name = "hello-world-lib",
srcs = ["hello-world-lib.cc"],
hdrs = ["hello-world-lib.h"],
)

genrule(
name = "echo",
outs = ["echo.txt"],
cmd = "echo $(locations :hello-world-lib) > $@",
tools = [
":hello-world-lib",
],
)

Building :hello-world-lib correctly chooses the platform @rbe_default//config:platform, and just works.
However, building :echo fails with the following errors:

$ bazel build --config=remote --noremote_accept_cached :echo
...
ERROR: While resolving toolchains for target //:hello-world-lib: no matching toolchains found for types @bazel_tools//tools/cpp:toolchain_type
ERROR: Analysis of target '//:echo' failed; build aborted: no matching toolchains found for types @bazel_tools//tools/cpp:toolchain_type
...

I don't think :echo should need a cc toolchain. I believe it should be able to choose //:plain_platform. But if I'm wrong and it does need a cc toolchain, it should be able to choose @rbe_default//config:platform as well. Instead it does neither and just fails.

Note that the error does not reproduce if --host_platform=@rbe_default//config:platform.
It also does not reproduce if :echo does not use the cc_library as a tool.

@katre
Copy link
Member

katre commented Jun 14, 2019

Can you create a plain git repository with this example that I can clone and test? I'd like to be verifying as close as possible to what you are using.

The cc toolchain dependency was removed from genrule a few releases ago, and shouldn't still exist, so that's confusing.

I'll investigate this as I can.

@katre katre self-assigned this Jun 14, 2019
@katre katre added P2 We'll consider working on this in future. (Assignee optional) team-Configurability Issues for Configurability team type: bug labels Jun 14, 2019
eytankidron added a commit to eytankidron/toolchain-resolution-example that referenced this issue Jun 14, 2019
In order to reproduce this:
* Install docker
* Ask me to grant permissions to access projects/rbe-eytankidron-prod-0/instances/default_instance, or just use a different RBE instance.
* $ bazel build --config=remote --noremote_accept_cached :echo
@katre
Copy link
Member

katre commented Jun 17, 2019

Thanks for the sample repo, this made it much easier to debug the problem. I encountered three issues when building your sample:

  1. The platform //:plain_platform is defined to not have any constraint_values, which means that it cannot match any of the registered cc toolchains (all of which have at least os and cpu-level constraints). The toolchain resolution debug output shows this, although it is hard to read (I had to step through the code with a debugger, which is not ideal):
INFO: ToolchainResolution: Selected execution platform //:plain_platform, 
INFO: ToolchainResolution: Looking for toolchain of type @bazel_tools//tools/cpp:toolchain_type...
INFO: ToolchainResolution:   Considering toolchain @bazel_toolchains//configs/ubuntu16_04_clang/9.0.0/bazel_0.26.0/cc:cc-compiler-k8...
INFO: ToolchainResolution:     Toolchain constraint @bazel_tools//platforms:os has value @bazel_tools//platforms:linux, which does not match value <missing> from the target platform //:plain_platform
INFO: ToolchainResolution:     Toolchain constraint @bazel_tools//platforms:cpu has value @bazel_tools//platforms:x86_64, which does not match value <missing> from the target platform //:plain_platform
INFO: ToolchainResolution:   Rejected toolchain @bazel_toolchains//configs/ubuntu16_04_clang/9.0.0/bazel_0.26.0/cc:cc-compiler-k8, because of target platform mismatch

Here we can see that toolchain resolution is trying the toolchain @bazel_toolchains//configs/ubuntu16_04_clang/9.0.0/bazel_0.26.0/cc:cc-compiler-k8, but it cannot be used with the target platform //:plain_platform, because the toolchain has cpu and os constraints but the target does not.

The fix here was to add constraints to //:plain_platform:

platform(
    name = "plain_platform",
    remote_execution_properties = """
        properties: {
          name: "container-image"
          value:"docker://gcr.io/gcp-runtimes/ubuntu_16_0_4@sha256:096632d8fb3e78fbd58ae6a2b25ed46020dc70e65d89bca774af6f7b2de6898c"
        }
        properties {
           name: "OSFamily"
           value:  "Linux"
        }
        """,
    constraint_values = [
      "@bazel_tools//platforms:linux",
      "@bazel_tools//platforms:x86_64",
    ],
)
  1. The second problem was the ordering of your platforms. In .bazelrc, you have:
build:remote --extra_execution_platforms=//:plain_platform,@rbe_default//config:platform

This tells toolchain resolution to prefer using //:plain_platform as the execution platform (since it is first listed, and the --extra_execution_platforms flag has higher precedence than the register_execution_platforms() function in WORKSPACE.

However, when running the build (and fixing the definition of //:plain_platform, this error occurs:

ERROR: /usr/local/google/home/jcater/repos/toolchain-resolution-example/BUILD:1:1: Couldn't build file _objs/hello-world-lib/hello-world-lib.o: C++ compilation of rule '//:hello-world-lib' failed (Exit 127)
docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"/usr/bin/gcc\": stat /usr/bin/gcc: no such file or directory": unknown.

This is because //:plain_platform is running remotely, but has selected the local gcc toolchain, and gcc is not installed in the docker image.

Why was the local toolchain selected?

INFO: ToolchainResolution:   Considering toolchain @bazel_toolchains//configs/ubuntu16_04_clang/9.0.0/bazel_0.26.0/cc:cc-compiler-k8...
INFO: ToolchainResolution:     Toolchain constraint @bazel_tools//tools/cpp:cc_compiler has value @bazel_tools//tools/cpp:clang, which does not match value <missing> from the execution platform //:plain_platform
INFO: ToolchainResolution:     Toolchain constraint @bazel_tools//tools/cpp:cc_compiler has value @bazel_tools//tools/cpp:clang, which does not match value <missing> from the execution platform //:plain_platform

The cc toolchain @bazel_toolchains//configs/ubuntu16_04_clang/9.0.0/bazel_0.26.0/cc:cc-compiler-k8
also has the constraint @bazel_tools//tools/cpp:clang, which needs to be added to //:plain_platform.

  1. Finally, the build fails with this error:
ERROR: /usr/local/google/home/jcater/repos/toolchain-resolution-example/BUILD:1:1: Couldn't build file _objs/hello-world-lib/hello-world-lib.o: C++ compilation of rule '//:hello-world-lib' failed (Exit 127)
docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"/usr/local/bin/clang\": stat /usr/local/bin/clang: no such file or directory": unknown.

It appears that the docker image specified by //:plain_platform does not have clang installed where the toolchain thinks it is.

The build does work if I swap the line in .bazelrc:

build:remote --extra_execution_platforms=@rbe_default//config:platform,//:plain_platform

This prefers the default RBE platform, which does have clang installed, so that platform is used to compile //:hello-world-lib.

I've sent PR eytankidron/toolchain-resolution-example#1 to show the changes.

@eytankidron
Copy link
Author

Just to be clear, I intentionally set up //:plain_platform to not support the cc_toolchain by not having clang in its docker image and by not specifying the constraint_values. I also intentionally put it first in the order of platforms to be considered. This was by design, in order to get different targets to use different platforms. I realize that if I give @rbe_default//config:platform a higher priority the bug does not reproduce.

What I expected to happen was that bazel would select @rbe_default//config:platform for the cc_library target and //:plain_platform for the genrule target. I'm still not clear why //:plain_platform could not be used for the genrule target. In other words, why does the genrule target require a cc_toolchain?

And a secondary question is this: if bazel does have a valid reason to decide that the genrule target needs the platform to be compatible with the cc_toolchain (presumably due to its tools dependency), the fact that //:plain_platform does not have the appropriate constraint_values, should have ruled out that platform and as a result bazel should have used @rbe_default//config:platform instead. But that did not happen.

@katre
Copy link
Member

katre commented Jun 17, 2019

At no point does the genrule require a cc toolchain. The genrule declares a tool dependency, and //:hello-world-lib requires a cc toolchain (because it is a cc_library).

In your initial example, there is no platform that can build //:hello-world-lib:

  • The genrule //:echo selected //:plain_platform as its execution platform
  • Therefore, //:plain_platform is the target platform for //:hello-world-lib
  • However, //:plain_platform does not declare os or cpu constraints, so there is no cc toolchain that can target it.

@eytankidron
Copy link
Author

In your initial example, there is no platform that can build //:hello-world-lib:

Why? Can't @rbe_default//config:platform be that platform?

The genrule //:echo selected //:plain_platform as its execution platform
Therefore, //:plain_platform is the target platform for //:hello-world-lib

I thought the whole point was that different targets in the same invocation can be handled by different platforms. Is that not the case?

@katre
Copy link
Member

katre commented Jun 18, 2019

it is entirely possible that different targets have different execution platforms. Every configured target has two platforms it is concerned with: the target platform (where the output will execute) and the execution platform (where build actions will execute).

The target //:hello-world-lib ends up with a target platform of //:plain_platform. There is no cc toolchain, however, that can generate output for that platform. This is entirely unrelated to the execution platform that is selected: the problem is with the target platform.

@katre katre closed this as completed Jul 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Configurability Issues for Configurability team type: bug
Projects
None yet
Development

No branches or pull requests

2 participants