Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorFlow is broken with Bazel@HEAD #16296

Closed
meteorcloudy opened this issue Sep 19, 2022 · 40 comments
Closed

TensorFlow is broken with Bazel@HEAD #16296

meteorcloudy opened this issue Sep 19, 2022 · 40 comments
Assignees
Labels
breakage P1 I'll work on this now. (Assignee required) team-Rules-CPP Issues for C++ rules type: bug

Comments

@meteorcloudy
Copy link
Member

https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/2640#01835384-a5a8-4cef-839a-92041e93f6fb

(03:29:23) ERROR: /var/lib/buildkite-agent/builds/bk-docker-062w/bazel-downstream-projects/tensorflow/tensorflow/BUILD:1409:19: Executing genrule //tensorflow:tf_python_api_gen_v2 failed: (Aborted): bash failed: error executing command (from target //tensorflow:tf_python_api_gen_v2)
  (cd /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/18887e87f83f3c86081828342b40a00a/execroot/org_tensorflow && \
  exec env - \
    PATH=/var/lib/buildkite-agent/.cache/bazelisk/local/-tmp-tmppfqvvf95-bazel/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
    PYTHON_BIN_PATH=/usr/bin/python3 \
    PYTHON_LIB_PATH=/usr/lib/python3/dist-packages \
    TF2_BEHAVIOR=1 \
  /bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh; bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2 --root_init_template=tensorflow/api_template.__init__.py --apidir=bazel-out/k8-opt/bin/tensorflow_api/v2/ --apiname=tensorflow --apiversion=2  --compat_apiversion=1 --compat_apiversion=2  -- 
...
...

Culprit finder: https://buildkite.com/bazel/culprit-finder/builds/3607#01834b10-64ec-4e64-812f-2357b4ccd39a
Bisect shows dadc49e is the culprit

@meteorcloudy meteorcloudy added type: bug P1 I'll work on this now. (Assignee required) breakage team-Rules-CPP Issues for C++ rules labels Sep 19, 2022
@meteorcloudy meteorcloudy added this to the 6.0.0 release blockers milestone Sep 19, 2022
@meteorcloudy
Copy link
Member Author

meteorcloudy commented Sep 19, 2022

/cc @c-parsons @oquenchil Can you look into this? This is blocking 6.0 cut since it's breaking TensorFlow.

My guess is dadc49e somehow caused the command line to be too long?

If dadc49e is valid, maybe the fix should be from the TensorFlow side?

@c-parsons
Copy link
Contributor

It's unclear to me how dadc49e could be causing this failure, both by looking at the failure and looking at the surrounding builds.

  1. The commit in question changes cc_shared_library validation, and shouldn't practically change actions or command lines.
    The broken target is a genrule, and the underying error description (I think) is:
terminate called after throwing an instance of 'google::protobuf::FatalException'
--
  | what():  CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):
  | /bin/bash: line 1: 87712 Aborted

I'm unsure how this could be related...

  1. It looks like Tensor succeeded with Bazel@HEAD on Aug 3, which was well after dadc49e (Jun 10) ?
    https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/2573

I'm rerunning the culprit finder, just in case this was a flake. If I get a similar result pointing to that commit, I'll need to dig much deeper.

@meteorcloudy
Copy link
Member Author

It looks like Tensor succeeded with Bazel@HEAD on Aug 3, which was well after dadc49e (Jun 10) ?

Yes, that's why we didn't discover this earlier. TensorFlow started to use cc_shared_library recently. I believe it's newer commits in TensorFlow that are broken with this change.

@meteorcloudy
Copy link
Member Author

I can probably help you bisect TensorFlow to find out which change is incompatible with dadc49e

@meteorcloudy
Copy link
Member Author

meteorcloudy commented Sep 19, 2022

A known good TF commit is a3182a1eb28b5b6d081391669db3546325b5f1f1, and a known bad TF commit is 322c0a555914006635603c51f39d073ce83cd368

@meteorcloudy
Copy link
Member Author

meteorcloudy commented Sep 19, 2022

Bisecting on my gLinux machine with:

$ docker run -d -it gcr.io/bazel-public/ubuntu2004-java11
3d0814114fd76a1013f8f78028da45fd9b07ad56d7d55be2df44e365f0e50388
pcloudy@pcloudy:~
$ docker exec -it 3d0814114fd76a1013f8f78028da45fd9b07ad56d7d55be2df44e365f0e50388  /bin/bash
root@3d0814114fd7:/# mkdir workdir
root@3d0814114fd7:/# cd workdir/
root@3d0814114fd7:/workdir# git clone https://github.com/tensorflow/tensorflow.git
...
root@3d0814114fd7:/workdir# cd tensorflow/
root@3d0814114fd7:/workdir/tensorflow# export USE_BAZEL_VERSION=dadc49e437018f482640ed76fae5307daf9911a8
root@3d0814114fd7:/workdir/tensorflow# pip3 install portpicker keras_applications keras_preprocessing future packaging dataclasses
...
root@3d0814114fd7:/workdir/tensorflow# ./configure
...
root@3d0814114fd7:/workdir/tensorflow# git bisect start d7f0308d5c2eb647d7348ee1eacd2042c0504b27 ae909b0995845407a8fc77e16726a6314386544a
Bisecting: 1193 revisions left to test after this (roughly 10 steps)
[a3182a1eb28b5b6d081391669db3546325b5f1f1] [xla:mlir] NFC: Use std::string_view instead of llvm::StringRef
root@3d0814114fd7:/workdir/tensorflow# git bisect run bazel build //tensorflow/tools/pip_package:build_pip_package

@c-parsons
Copy link
Contributor

Indeed, the second culprit finder still points to that commit.

Would be interested in the results of your bisect.
Sorry, it's been a bit of a day with unrelated issues. I can look into this more actively tomorrow.

@meteorcloudy
Copy link
Member Author

The bisect is taking longer than I expected, it was interrupted by unrelated build failures.

@meteorcloudy
Copy link
Member Author

I can confirm I can reproduce this issue at tensorflow/tensorflow@007f31c with Bazel at dadc49e, but not at the parent commit of tensorflow/tensorflow@007f31c

@c-parsons
Copy link
Contributor

I'm having issues getting even a clean build passing with tensorflow, either at HEAD or at tensorflow/tensorflow@007f31c (with bazel 5.3.1) :\

In both cases:

bazel build -c opt --output_filter=^$ --test_env=HOME --test_env=BAZELISK_USER_AGENT --test_env=USE_BAZEL_VERSION -- //tensorflow/tools/pip_package:build_pip_package

Error at tensorflow/tensorflow@007f31c:

ERROR: /usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/external/boringssl/BUILD:130:11: Compiling src/third_party/fiat/curve25519.c failed: (Exit 1): gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 44 arguments skipped)
external/boringssl/src/third_party/fiat/curve25519.c:511:57: error: argument 2 of type 'const uint8_t[32]' {aka 'const unsigned char[32]'} with mismatched bound [-Werror=array-parameter=]
  511 | int x25519_ge_frombytes_vartime(ge_p3 *h, const uint8_t s[32]) {
      |                                           ~~~~~~~~~~~~~~^~~~~
In file included from external/boringssl/src/third_party/fiat/curve25519.c:41:
external/boringssl/src/third_party/fiat/internal.h:117:58: note: previously declared as 'const uint8_t *' {aka 'const unsigned char *'}
  117 | int x25519_ge_frombytes_vartime(ge_p3 *h, const uint8_t *s);
      |                                           ~~~~~~~~~~~~~~~^
external/boringssl/src/third_party/fiat/curve25519.c:831:57: error: argument 2 of type 'const uint8_t *' {aka 'const unsigned char *'} declared as a pointer [-Werror=array-parameter=]
  831 | void x25519_ge_scalarmult_base(ge_p3 *h, const uint8_t *a) {
      |                                          ~~~~~~~~~~~~~~~^
external/boringssl/src/third_party/fiat/internal.h:125:56: note: previously declared as an array 'const uint8_t[32]' {aka 'const unsigned char[32]'}
  125 | void x25519_ge_scalarmult_base(ge_p3 *h, const uint8_t a[32]);
      |                                          ~~~~~~~~~~~~~~^~~~~
cc1: all warnings being treated as errors
Target //tensorflow/tools/pip_package:build_pip_package failed to build

Error at HEAD:

ERROR: /usr/local/google/home/cparsons/tensorflow-github/tensorflow/tensorflow/BUILD:1409:19: Executing genrule //tensorflow:tf_python_api_gen_v2 failed: (Exit 1): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped)
2022-09-20 13:21:57.687575: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/tools/api/generator/create_python_api.py", line 22, in <module>
    from tensorflow.python.tools.api.generator import doc_srcs
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/__init__.py", line 42, in <module>
    from tensorflow.python import data
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/data/__init__.py", line 21, in <module>
    from tensorflow.python.data import experimental
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/data/experimental/__init__.py", line 96, in <module>
    from tensorflow.python.data.experimental import service
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/data/experimental/service/__init__.py", line 419, in <module>
    from tensorflow.python.data.experimental.ops.data_service_ops import distribute
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/data/experimental/ops/data_service_ops.py", line 22, in <module>
    from tensorflow.python.data.experimental.ops import compression_ops
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/data/experimental/ops/compression_ops.py", line 16, in <module>
    from tensorflow.python.data.util import structure
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/data/util/structure.py", line 29, in <module>
    from tensorflow.python.ops import resource_variable_ops
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/ops/resource_variable_ops.py", line 36, in <module>
    from tensorflow.python.framework import meta_graph
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/framework/meta_graph.py", line 18, in <module>
    from packaging import version as packaging_version  # pylint: disable=g-bad-import-order
ModuleNotFoundError: No module named 'packaging'
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: /usr/local/google/home/cparsons/tensorflow-github/tensorflow/tensorflow/lite/python/BUILD:68:10 Middleman _middlemen/tensorflow_Slite_Spython_Stflite_Uconvert-runfiles failed: (Exit 1): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped)

Tensorflow seems green at HEAD, so I'm not sure what's up.

@meteorcloudy
Copy link
Member Author

cc1: all warnings being treated as errors

Ah, I was also having this strange error. Temporary hack is to

vim bazel-tensorflow/external/boringssl/BUILD

Then comment out the -Wall, -Werror flags in the BUILD file.

ModuleNotFoundError: No module named 'packaging'

I guess you'll have to run pip3 install portpicker keras_applications keras_preprocessing future packaging dataclasses

@c-parsons
Copy link
Contributor

Thanks a bunch! Have reproed the error successfully.

Still don't have any leads on the underlying failure -- it may take me a bit.

@c-parsons
Copy link
Contributor

This has been exceedingly hard to debug, both because Tensorflow's setup is very complicated, and that debug tools for tracking cc_shared_library dependencies could use some work.

I've made a little progress, and have a fix, but I'm not sure if it's the most principled solution.

Effectively, it seems that, with dadc49e, we collect static dependencies of cc_shared_library across any deps-like edges.

This results in a change to //tensorflow/python:_pywrap_tensorflow_internal.so (there are symbol differences before/after this change). But, there's no change to several other shared libraries, such as //tensorflow:libtensorflow_framework.so

By doing some print debugging and diffing, I tried to track down which link-static dependencies were new to this shared library as a result of the Bazel change. Several filegroup dependencies via deps and srcs
Many of these added dependencies have no effect, but there must be at least one that does.

Amending the check in the offending change to:

if fieldname == "_cc_toolchain" or fieldname == "target_compatible_with" or fieldname == "srcs" or fieldname == "hdrs":
            continue

seems to make tensorflow's build pass successfully.

Still, I'm not sure if this is really a principled change. I feel like I still haven't found the true root cause of tensorflow's failure. While the proposed fix might work to fix tensorflow's build, it feels like a hacky and unprincipled solution.

@meteorcloudy
Copy link
Member Author

@c-parsons Thanks!

/cc @oquenchil Do you have any insight of this issue? How the aspect should work for cc_shared_library?

/cc @learning-to-play, who is working on TF's cc_shared_library migration.

@oquenchil
Copy link
Contributor

Hi Chris,

I agree with you that if we add more attribute names in the check, it would be getting a bit out of hand. Even hard coding "_cc_toolchain" does not seem ideal when looking at it again, I can imagine someone's custom C++ rule in Starlark placing the toolchain in an attribute called "_toolchain".

From the change description and attached issues I can't really tell what prompted this change in the first place. Maybe if we go back to the original problem, we can find a different way to do this that doesn't break TF.

@oquenchil
Copy link
Contributor

Looking at this a bit more I can imagine why you sent the change in the first place. First just to document what the problem is, let me rephrase the problem that Chris found during debugging.

The cc_shared_library implementation decides what to link statically based on the structure of the build graph, it uses the aspect to figure out that structure. It tries to link statically every library in the transitive closure of the direct deps of the shared library (i.e. roots) that is reachable without going through dynamic libraries. That's why we need the structure of the graph.

Chris' original change is correct and wrong at the same time. It makes libraries be linked statically that should be linked statically but weren't linked before (because his project has custom C++ rules with different attribute names which is a valid use case) but it also links libraries statically that shouldn't be linked statically.

The proposed fix with hard-coding the attribute names "srcs" and "hdrs" cuts the chain that builds the structure of the graph that propagates via genrules, filegroups, etc.. I think that chain can be cut earlier by using required_providers on the aspect definition. This will skip the genrule and filegroup altogether, it also forces the custom C++ rules to advertise the CcInfo provider, so Chris you might want to check that this works for your project too, not just for Tensorflow.

If it works, I think the whole line if fieldname == "_cc_toolchain" or fieldname == "target_compatible_with" or fieldname == "srcs" or fieldname == "hdrs": can be removed, including the field names that were already there: _cc_toolchain and target_compatible_with.

Note:
There is still one edge case that a precompiled library coming from a filegroup or generated by a genrule isn't linked statically when it should, but if any project in the future bumped into that, they should wrap that target in a cc_import which would be the principled thing to do. This shouldn't happen with TF now since by using the old code they were definitely not relying on any filegroup or genrule being linked that anyway.

@c-parsons
Copy link
Contributor

Thanks for the summary! You are indeed correct on my project's motivations.

Using required_providers of CcInfo sounds like a good idea, and a more principled solution. Let me give this a try on our project and i'll get back to you.

@c-parsons
Copy link
Contributor

I've run into some issues using the required_providers approach. My understanding is that this will only propagate the aspect to targets which advertise their provider using provides =. This is only a subset of targets which propagate this provider.

For example, the native implementation of cc_import propagates CcInfo, but does not advertise using provides = [CcInfo]. As a result, the aspect will not propagate to cc_import.

We can get around this by assessing CcInfo in dep manually in the aspect. This is a bit of a hack, but i've confirmed it makes my project continue to work.

However, this seems to not fix the Tensorflow issue.

Briefly, the proposed solution is to change

for fieldname in dir(ctx.rule.attr):
if fieldname == "_cc_toolchain" or fieldname == "target_compatible_with":
continue
deps = getattr(ctx.rule.attr, fieldname, None)
if type(deps) == "list":
for dep in deps:
if type(dep) == "Target" and GraphNodeInfo in dep:
children.append(dep[GraphNodeInfo])
# Single dep.
elif type(deps) == "Target" and GraphNodeInfo in deps:
children.append(deps[GraphNodeInfo])
to:

    for fieldname in dir(ctx.rule.attr):
        deps = getattr(ctx.rule.attr, fieldname, None)
        if type(deps) == "list":
            for dep in deps:
                if type(dep) == "Target" and GraphNodeInfo in dep and CcInfo in dep:
                    children.append(dep[GraphNodeInfo])

            # Single dep.
        elif type(deps) == "Target" and GraphNodeInfo in deps and CcInfo in deps:
            children.append(deps[GraphNodeInfo])

@meteorcloudy
Copy link
Member Author

I guess the reason is that the CcInfo approach didn't prevent some dependencies propagated via "srcs" and "hdrs" attributes because some targets in those attributes do have the CcInfo provider?

@meteorcloudy
Copy link
Member Author

I think one unsolved mystery is why linking more dependencies will cause a crash.

@c-parsons
Copy link
Contributor

Indeed, can we get any help from TF here to demystify their build?

I'm inclined to go with the hacky solution I outlined in #16296 (comment) for the time being. We can follow up with a more principled solution after we unblock the release. WDYT?

@oquenchil
Copy link
Contributor

It feels like we are missing a way to express this in Starlark. Chris how are your custom attribute names called? If we are going to hardcode hdrs and srcs, I'd prefer that we go back to the way it was before your change and include your attribute names in a list apart from "deps". Unless they don't sound very generic and are very custom to your project.

In any case, to unblock this it will definitely require one of those two hacky solutions while we come up with a better way.

@c-parsons
Copy link
Contributor

We use a number of custom Starlark rules in our graph (generally hidden away by macros). Attribute names include:

"root", "roots", "shared", "includes_deps", "system_libraries", "stub_target", "library_target", "root_target", "source_target"

I'm frankly not sure without further investigation whether we'd need to allowlist all of these attribute names -- it's possible we only need "root", "roots", and "shared".

If it's worth investigating the viability of this path (that a hacky allowlist is better than a hacky denylist), I'm happy to give this a whirl in my project. Let me know.

@oquenchil
Copy link
Contributor

So I was thinking about this yesterday some more and I think the right way and the way to do it from the very beginning is with the concept of aspect hints introduced by @scentini and @hlopko.

I said "concept" because we don't need to use the built-in support for aspect_hints. There is no need for individual targets of your custom rule to set the aspect hints. It would just be the rule definition that does this. So cc_shared_library can have a new provider PropagationHints that is public just like CcSharedLibraryInfo. In there you would put "root", "roots" and "shared" which we would read from the aspect implementation to decide whether we should merge the results.

If the PropagationHints provider is not present, then propagation would happen just like it happened before your original change, via the "deps" attribute. I should probably come back later to this and make sure that in the default case it only propagates via the "deps" attribute if it has CcInfo but that's a change in behavior that may break things and it's unrelated to your change, so I'll take care of this.

@meteorcloudy
Copy link
Member Author

@c-parsons Does Pedro's proposal work for your use case?

@c-parsons
Copy link
Contributor

It probably does, but I haven't had the cycles to verify. I hope to get you an answer tomorrow. Sorry about this!

@meteorcloudy
Copy link
Member Author

meteorcloudy commented Oct 6, 2022

@c-parsons No worries, we pushed the release cut another 2 weeks back (2022-10-24), so we'll have some extra time for this. That said, this is still one of the few hard release blockers for 6.0.

@c-parsons
Copy link
Contributor

OK, I have confirmed the approach described above fixes my project. Furthermore, after about another day of diagnosing and debugging, I've determined we would only need to add PropagationHints to a single additional rule, to support the attribute "root_target".

In light of this finding, we technically have a workaround: we can rename root_target of that rule to deps. Thankfully, this is a fairly easy refactoring as this rule is private to a macro.

So this begs the question: Do we want to fixforward with your PropagationHints suggestion, or rollback the original offending commit dadc49e ?

Issues with the rollforward suggestion
One snag in taking the PropagationHints approach is that there isn't currently an easy and clean way to expose a namespace-scoped PropagationHints provider to users. Correct me if I'm wrong, but I think that users can't import "PropagationHints" as a symbol from experimental_cc_shared_library.bzl, they would need to refer to it as a global native symbol. At which point -- do we name this CcSharedLibraryPropgationHints ? Do we make this as a part of the cc_common Starlark module?

This requires more thought, IMO.

Issues with the rollback suggestion
I'm admittedly not 100% confident about the analysis of my project on how limited the root problem is. Our push for the upstream commit dadc49e actually predates the private root_target rule. So, this is evidence that our rule / macro structure has changed significantly in the last 3 months.
Thus, I have concerns that we'll need a principled, customizable solution to this problem, if not now then very soon.

My suggestion
I suggest we rollback dadc49e to unblock 6.0, but raise this issue as something we need to get solved at very high priority (though not necessarily a 6.0 blocker!). It helps that my project follows Bazel@HEAD very closely, so that it's probably reasonable to solve this after the 6.0 cut.

What do you think?

@meteorcloudy
Copy link
Member Author

Thanks @c-parsons for such detailed analysing, I agree with your suggestion! I think we can even cherry pick the solution in 6.x minor releases if necessary, ideally it should be made backwards compatible, but since cc_shared_library is still under the experimental flag, I guess it's even fine to make some small incompatible change?

@c-parsons
Copy link
Contributor

Relatedly, it looks like 21904a9 somewhat contradicts the philosophy we've discussed here.

@meteorcloudy
Copy link
Member Author

/cc @fmeum

@meteorcloudy
Copy link
Member Author

I see looks like 21904a9 is blocking the rollback of dadc49e

@fmeum
Copy link
Collaborator

fmeum commented Oct 7, 2022

@c-parsons @meteorcloudy I created this commit over half a year ago, feel free to change the logic in any way you see fit - including rolling it back. It only affects the non-Starlark implementation anyway.

@meteorcloudy
Copy link
Member Author

meteorcloudy commented Oct 10, 2022

@oquenchil @c-parsons I guess we should rollback 21904a9 as well.

@oquenchil
Copy link
Contributor

I agree with rolling back both changes.

The logic for 21904a9 is also in the Starlark implementation now and I think cc_library.implementation_deps should indeed follow the same philosophy but we can cut cc_library some slack for now since I cannot imagine a future where cc_library and cc_shared_library are not living next to each other maintained by the same owners.

If we need to rollback 21904a9 too in order to cleanly rollback dadc49e then as Fabian says that should not be an issue.

copybara-service bot pushed a commit that referenced this issue Oct 11, 2022
*** Reason for rollback ***

Due to #16296 (comment)

*** Original change description ***

Collect implementation_deps in graph node aspect

This is needed for implementation_deps of cc_library targets to be linked into cc_binary targets with dynamic_deps and cc_shared_library targets.

Fixes #14731

Closes #14730.

PiperOrigin-RevId: 480360695
Change-Id: Ic1b9c17ce3af730fa0131f5d87e5ca99ae2740c5
copybara-service bot pushed a commit that referenced this issue Oct 11, 2022
*** Reason for rollback ***

Breaks Tensorflow. See #16296

*** Original change description ***

Apply graph_structure aspect across all edges

RELNOTES: None.
PiperOrigin-RevId: 480381529
Change-Id: Ied2f6b14a4b80e3c77691cf716c199b9c8a63f85
aiuto pushed a commit to aiuto/bazel that referenced this issue Oct 12, 2022
*** Reason for rollback ***

Due to bazelbuild#16296 (comment)

*** Original change description ***

Collect implementation_deps in graph node aspect

This is needed for implementation_deps of cc_library targets to be linked into cc_binary targets with dynamic_deps and cc_shared_library targets.

Fixes bazelbuild#14731

Closes bazelbuild#14730.

PiperOrigin-RevId: 480360695
Change-Id: Ic1b9c17ce3af730fa0131f5d87e5ca99ae2740c5
aiuto pushed a commit to aiuto/bazel that referenced this issue Oct 12, 2022
*** Reason for rollback ***

Breaks Tensorflow. See bazelbuild#16296

*** Original change description ***

Apply graph_structure aspect across all edges

RELNOTES: None.
PiperOrigin-RevId: 480381529
Change-Id: Ied2f6b14a4b80e3c77691cf716c199b9c8a63f85
@meteorcloudy
Copy link
Member Author

This is strange, after rolling back both commits, TF still failed with the same error:
https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/2675#0183d0a7-0257-4221-aa4a-b29ab6dc0908

@meteorcloudy
Copy link
Member Author

2022-10-13 12:35:12.477198: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-13 12:35:20.982806: F ./tensorflow/core/framework/collective.h:501] Non-OK-status: CollectiveRegistry::Register(collective_name, factory) status: INTERNAL: Already registered collective AllToAll
/bin/bash: line 1: 48649 Abort trap: 6           bazel-out/darwin-opt-exec-

indicates some library was probably linked twice.

@meteorcloudy
Copy link
Member Author

I found there are still some differences of cc_shared_library between Bazel HEAD and Bazel 5.3.1:
https://gist.github.com/meteorcloudy/392355424b6986a8c7eb3361813d3849

@oquenchil My guess is some change caused the failure by making some libraries linked twice. Can you help debug this to see if this is a bug in cc_shared_library or TF?
I can help confirm if reverting all changes in cc_shared_library fixes the failure.

fweikert pushed a commit that referenced this issue Oct 20, 2022
*** Reason for rollback ***

Due to #16296 (comment)

*** Original change description ***

Collect implementation_deps in graph node aspect

This is needed for implementation_deps of cc_library targets to be linked into cc_binary targets with dynamic_deps and cc_shared_library targets.

Fixes #14731

Closes #14730.

PiperOrigin-RevId: 480360695
Change-Id: Ic1b9c17ce3af730fa0131f5d87e5ca99ae2740c5
fweikert pushed a commit that referenced this issue Oct 20, 2022
*** Reason for rollback ***

Breaks Tensorflow. See #16296

*** Original change description ***

Apply graph_structure aspect across all edges

RELNOTES: None.
PiperOrigin-RevId: 480381529
Change-Id: Ied2f6b14a4b80e3c77691cf716c199b9c8a63f85
jrguzman-ms pushed a commit to msft-mirror-aosp/platform.build.bazel that referenced this issue Jan 24, 2023
This hack allows us to support native cc_shared_library aspect
propagation along the "deps" attribute.

See
bazelbuild/bazel#16296 (comment)
for details.

Test: mixed_droid, both at HEAD and after a rollback of the referenced
Bazel commit

Change-Id: I2247b974a5cd7cc105ec1d674569d9eacdbc302b
copybara-service bot pushed a commit that referenced this issue Feb 6, 2023
This is rolling forward ff927f7 which I mistakenly confused as the root cause for #16296. The implementation does have an issue with too much propagation which is fixed on this change with required_providers and required_aspect_providers restrictions on the aspect.

Fixes #17091.

RELNOTES:none
PiperOrigin-RevId: 507457624
Change-Id: I7317c4532ef20cd7584c55aacc3eca82c50a618b
hvadehra pushed a commit that referenced this issue Feb 14, 2023
This is rolling forward ff927f7 which I mistakenly confused as the root cause for #16296. The implementation does have an issue with too much propagation which is fixed on this change with required_providers and required_aspect_providers restrictions on the aspect.

Fixes #17091.

RELNOTES:none
PiperOrigin-RevId: 507457624
Change-Id: I7317c4532ef20cd7584c55aacc3eca82c50a618b
copybara-service bot pushed a commit that referenced this issue Mar 14, 2023
One reason why cc_shared_library wasn't working for upb_proto_library was because indirect cc_shared_library.deps (i.e.  linker_inputs not provided by the immediate dependency) were being dropped, that was fixed in f9008f6.

A second reason why it wasn't working was because until now cc_shared_library has relied on transitive deps providing CcInfo or ProtoInfo while the upb_proto_library aspect provides a custom wrapped CcInfo.  This change fixes that via aspect hints. (It would still require a change in the upb rule implementation).

A third reason why it didn't work for upb was because to avoid a collision with the other aspects applied on proto_libraries, the upb_proto_library implementation added a suffix to the name when calling cc_common.create_linking_context(). This in turn caused the `owner` used in the `linker_inputs` to not match the labels of the targets seen by the cc_shared_library. To fix this, the hints provider accepts a list of owners.

Apart from those features, the hints provider also allows specifying a list of attributes for the cases where we don't want the result of every dependency to be used (this hasn't come up yet and it doesn't affect upb).

This idea was more or less described [here](#16296 (comment)).

Note: I am using the words "aspect hints" because conceptually they are. However, none of this is using the aspect_hints mechanism supported by Bazel rules. Here we are simply using providers.

Fixes #17676.

Closes #17725.

PiperOrigin-RevId: 516488095
Change-Id: I603ed8df90fef0636525d60707692930cd23fa5a
fweikert pushed a commit to fweikert/bazel that referenced this issue May 25, 2023
One reason why cc_shared_library wasn't working for upb_proto_library was because indirect cc_shared_library.deps (i.e.  linker_inputs not provided by the immediate dependency) were being dropped, that was fixed in bazelbuild@f9008f6.

A second reason why it wasn't working was because until now cc_shared_library has relied on transitive deps providing CcInfo or ProtoInfo while the upb_proto_library aspect provides a custom wrapped CcInfo.  This change fixes that via aspect hints. (It would still require a change in the upb rule implementation).

A third reason why it didn't work for upb was because to avoid a collision with the other aspects applied on proto_libraries, the upb_proto_library implementation added a suffix to the name when calling cc_common.create_linking_context(). This in turn caused the `owner` used in the `linker_inputs` to not match the labels of the targets seen by the cc_shared_library. To fix this, the hints provider accepts a list of owners.

Apart from those features, the hints provider also allows specifying a list of attributes for the cases where we don't want the result of every dependency to be used (this hasn't come up yet and it doesn't affect upb).

This idea was more or less described [here](bazelbuild#16296 (comment)).

Note: I am using the words "aspect hints" because conceptually they are. However, none of this is using the aspect_hints mechanism supported by Bazel rules. Here we are simply using providers.

Fixes bazelbuild#17676.

Closes bazelbuild#17725.

PiperOrigin-RevId: 516488095
Change-Id: I603ed8df90fef0636525d60707692930cd23fa5a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breakage P1 I'll work on this now. (Assignee required) team-Rules-CPP Issues for C++ rules type: bug
Projects
None yet
Development

No branches or pull requests

4 participants