-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorflow build is broken in Bazel Nightly and in 0.10 #4474
Comments
I bisected this to 2aeaeba @jmillikin-stripe |
I'll be rollbacking 2aeaeba |
Talked with lberki@ offline, assigning to him for investigation |
Paging @jmillikin-stripe and @htuch This change appears to have been a mistake according to my admittedly limited knowledge. The point of mostly static linking is that every library is linked statically except for the runtime libraries, which are linked dynamically, so why would one link runtime libraries statically in that mode? I'll send out a rollback to be cherry-picked into the 0.10.0 release tomorrow morning CEST unless a strong argument against it and a workaround for this issue comes up. @iirina : How come this comes to light now (as opposed to right after having submitted that change?) |
I think the idea is that libstdc++ is a language/compiler library, not an environment specific runtime library. @mattklein123 for Envoy perspective on what a mostly static link should be. |
Yes, libstdc++ is supposed to be safe to link statically. The primary goal of "mostly-static" (as I understand it from GOOG days) is to statically link everything except glibc, which depends on dynamic linking for core functionality. I don't know much about Tensorflow or CUDA, but I notice https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/fused_conv/BUILD is referencing shared objects and https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tensorflow.bzl is setting If you remove |
@lberki can we hold back on this for a short period; I'll reach out to you on IM. |
On the high level topic, libstdc++ should definitely be statically linked in "mostly static" builds. libstdc++ is a compiler/application library, not a system library like glibc. Thus, an application mostly statically linked to everything other than glibc can be dropped on a system that only has a compatible glibc version. This is very important. |
@jhseu and @allenlavoie to check if he has any ideas on this. |
Happy to help debug, but I haven't been able to reproduce the failure. I tried with Bazel nightly #198 and TensorFlow at head (tensorflow/tensorflow@4595f1c), and I'm getting a successful build.
Could be related to the version of libstdc++? I'm using GLIBCXX_3.4.22 |
Ok, interesting, I can reliably reproduce on my machine, and CI machines repro this too. We need to look more. |
Ok, the debug yielded that:
This is problematic at least for gcc 4.8.4 (libstdc++.so.6.0.19) |
Machines that have no issues use gcc 6.3.0. |
Let's roll back. This is a case of linking Now, one could make the case that TensorFlow is doing something weird (namely, linking dynamic libraries into a binary built in mostly-static mode), but I'd like to make that decision without time pressure. |
@dslomov , my hunch is that it's just so happens that the standard library of gcc 6.3.0 is accidentally more resilient to issues like this. |
TF is now green on our CI, but I see no rollback. Was this fixed on TF side? |
I think some machines have gcc 6.3.0 and some have gcc 4.8.4, that is why we haven't detected this earlier. |
and also why CI is now green |
Is there a reason why we have machines with different gcc versions for our CI? |
@dslomov I don't think that explains it. The rollback probably fixed this for now, but I still do not understand why the CI did not catch this. The job is green on our CI because it uses the latest bazel release (0.9.0) to build TF and that release doesn't have this bug. About machines with different gcc versions: Jenkins runs the job both on Ubuntu 16 and on Ubuntu 14. I checked and each of the Ubuntu 14 machines uses gcc 4.8.4, and each of the Ubuntu 16 machines uses gcc 5.4.0. So this should have been detected earlier. |
All the Tensorflow jobs that used bazel built @ HEAD have been red for a while https://ci.bazel.build/job/Global/job/TensorFlow/ I don't think anyone was checking on them. |
Put then why did we have a green build recently? Are our CI machines not uniform enough? |
*** Reason for rollback *** Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: #4474) Fixes #4474 *** Original change description *** When linking mostly-static Linux binaries, link libstdc++.a explicitly. This allows libstdc++ to be statically linked, which is normally only possible when invoking GCC as `g++` with the `-static-libstdc++` flag. Fixes #2840 See envoyproxy/envoy#415 for additional background and context. cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)... *** RELNOTES: None. PiperOrigin-RevId: 182519445
*** Reason for rollback *** Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: #4474) Fixes #4474 *** Original change description *** When linking mostly-static Linux binaries, link libstdc++.a explicitly. This allows libstdc++ to be statically linked, which is normally only possible when invoking GCC as `g++` with the `-static-libstdc++` flag. Fixes #2840 See envoyproxy/envoy#415 for additional background and context. cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)... *** RELNOTES: None. PiperOrigin-RevId: 182519445
*** Reason for rollback *** Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: #4474) Fixes #4474 *** Original change description *** When linking mostly-static Linux binaries, link libstdc++.a explicitly. This allows libstdc++ to be statically linked, which is normally only possible when invoking GCC as `g++` with the `-static-libstdc++` flag. Fixes #2840 See envoyproxy/envoy#415 for additional background and context. cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)... *** RELNOTES: None. PiperOrigin-RevId: 182519445
*** Reason for rollback *** Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: #4474) Fixes #4474 *** Original change description *** When linking mostly-static Linux binaries, link libstdc++.a explicitly. This allows libstdc++ to be statically linked, which is normally only possible when invoking GCC as `g++` with the `-static-libstdc++` flag. Fixes #2840 See envoyproxy/envoy#415 for additional background and context. cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)... *** RELNOTES: None. PiperOrigin-RevId: 182519445
*** Reason for rollback *** Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: #4474) Fixes #4474 *** Original change description *** When linking mostly-static Linux binaries, link libstdc++.a explicitly. This allows libstdc++ to be statically linked, which is normally only possible when invoking GCC as `g++` with the `-static-libstdc++` flag. Fixes #2840 See envoyproxy/envoy#415 for additional background and context. cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)... *** RELNOTES: None. PiperOrigin-RevId: 182519445
*** Reason for rollback *** Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: #4474) Fixes #4474 *** Original change description *** When linking mostly-static Linux binaries, link libstdc++.a explicitly. This allows libstdc++ to be statically linked, which is normally only possible when invoking GCC as `g++` with the `-static-libstdc++` flag. Fixes #2840 See envoyproxy/envoy#415 for additional background and context. cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)... *** RELNOTES: None. PiperOrigin-RevId: 182519445
*** Reason for rollback *** Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: #4474) Fixes #4474 *** Original change description *** When linking mostly-static Linux binaries, link libstdc++.a explicitly. This allows libstdc++ to be statically linked, which is normally only possible when invoking GCC as `g++` with the `-static-libstdc++` flag. Fixes #2840 See envoyproxy/envoy#415 for additional background and context. cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)... *** RELNOTES: None. PiperOrigin-RevId: 182519445
*** Reason for rollback *** Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: #4474) Fixes #4474 *** Original change description *** When linking mostly-static Linux binaries, link libstdc++.a explicitly. This allows libstdc++ to be statically linked, which is normally only possible when invoking GCC as `g++` with the `-static-libstdc++` flag. Fixes #2840 See envoyproxy/envoy#415 for additional background and context. cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)... *** RELNOTES: None. PiperOrigin-RevId: 182519445
The 0.10.0 bazel has problems with static-linking on linux: bazelbuild/bazel#4474. This PR bumps to the latest bazel that produces proper binaries w/o the linking issue.
*** Reason for rollback *** Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: bazelbuild#4474) Fixes bazelbuild#4474 *** Original change description *** When linking mostly-static Linux binaries, link libstdc++.a explicitly. This allows libstdc++ to be statically linked, which is normally only possible when invoking GCC as `g++` with the `-static-libstdc++` flag. Fixes bazelbuild#2840 See envoyproxy/envoy#415 for additional background and context. cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)... *** RELNOTES: None. PiperOrigin-RevId: 182519445
The 0.10.0 bazel has problems with static-linking on linux: bazelbuild/bazel#4474. This PR bumps to the latest bazel that produces proper binaries w/o the linking issue.
I was wondering what combination of bazel and gcc is now needed to successfully built tensorflow? It looks a bit confusing and I'm unsure of what versions to try and what specific commits of bazel to try. I currently have 0.16 and get this error but the bug is filled for 0.10 is there a version in between where this was fixed or do i need bazel < 0.10? |
Failure:
https://ci.bazel.build/blue/organizations/jenkins/Global%2FTensorFlow/detail/TensorFlow/375/pipeline/
Smaller repro:
The text was updated successfully, but these errors were encountered: