Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow build is broken in Bazel Nightly and in 0.10 #4474

Closed
dslomov opened this issue Jan 18, 2018 · 23 comments
Closed

Tensorflow build is broken in Bazel Nightly and in 0.10 #4474

dslomov opened this issue Jan 18, 2018 · 23 comments
Assignees
Labels
breakage P0 This is an emergency and more important than other current work. (Assignee required)

Comments

@dslomov
Copy link
Contributor

dslomov commented Jan 18, 2018

Failure:
https://ci.bazel.build/blue/organizations/jenkins/Global%2FTensorFlow/detail/TensorFlow/375/pipeline/
Smaller repro:

/bazel build //tensorflow/contrib/fused_conv:fused_conv2d_bias_activation_op_pygenrule
.........
INFO: Analysed target //tensorflow/contrib/fused_conv:fused_conv2d_bias_activation_op_pygenrule (46 packages loaded).
INFO: Found 1 target...
ERROR: /usr/local/google/home/dslomov/work/tensorflow/tensorflow/contrib/fused_conv/BUILD:105:1: Executing genrule //tensorflow/contrib/fused_conv:fused_conv2d_bias_activation_op_pygenrule failed (Aborted): bash failed: error executing command /bin/bash -c ... (remaining 1 argument(s) skipped)
*** Error in `bazel-out/host/bin/tensorflow/contrib/fused_conv/gen_fused_conv2d_bias_activation_op_py_wrappers_cc': munmap_chunk(): invalid pointer: 0x0000000000480320 ***
/bin/bash: line 1: 186348 Aborted                 (core dumped) bazel-out/host/bin/tensorflow/contrib/fused_conv/gen_fused_conv2d_bias_activation_op_py_wrappers_cc , '' 0 0 > bazel-out/k8-opt/genfiles/tensorflow/contrib/fused_conv/ops/gen_fused_conv2d_bias_activation_op.py
Target //tensorflow/contrib/fused_conv:fused_conv2d_bias_activation_op_pygenrule failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 5.692s, Critical Path: 1.64s
FAILED: Build did NOT complete successfully
@dslomov dslomov self-assigned this Jan 18, 2018
@dslomov dslomov added P0 This is an emergency and more important than other current work. (Assignee required) Release blocker labels Jan 18, 2018
@dslomov
Copy link
Contributor Author

dslomov commented Jan 18, 2018

I bisected this to 2aeaeba @jmillikin-stripe

@dslomov
Copy link
Contributor Author

dslomov commented Jan 18, 2018

I'll be rollbacking 2aeaeba

@dslomov
Copy link
Contributor Author

dslomov commented Jan 18, 2018

Talked with lberki@ offline, assigning to him for investigation

@lberki
Copy link
Contributor

lberki commented Jan 18, 2018

Paging @jmillikin-stripe and @htuch

This change appears to have been a mistake according to my admittedly limited knowledge. The point of mostly static linking is that every library is linked statically except for the runtime libraries, which are linked dynamically, so why would one link runtime libraries statically in that mode?

I'll send out a rollback to be cherry-picked into the 0.10.0 release tomorrow morning CEST unless a strong argument against it and a workaround for this issue comes up.

@iirina : How come this comes to light now (as opposed to right after having submitted that change?)

@htuch
Copy link

htuch commented Jan 18, 2018

I think the idea is that libstdc++ is a language/compiler library, not an environment specific runtime library. @mattklein123 for Envoy perspective on what a mostly static link should be.

@jmillikin-stripe
Copy link
Contributor

Yes, libstdc++ is supposed to be safe to link statically. The primary goal of "mostly-static" (as I understand it from GOOG days) is to statically link everything except glibc, which depends on dynamic linking for core functionality.

I don't know much about Tensorflow or CUDA, but I notice https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/fused_conv/BUILD is referencing shared objects and https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tensorflow.bzl is setting alwayslink=1, linkstatic=1. Is it possible that the new Bazel flags are exposing a pre-existing bug in the compilation of gen_fused_conv2d_bias_activation_op_py_wrappers_cc?

If you remove linkstatic=1 from the tf_cc_binary() call in tf_gen_op_wrapper_py (https://github.com/tensorflow/tensorflow/blob/2c5667cad1eccf599b8834557a89453222f29e0b/tensorflow/tensorflow.bzl#L508), does that fix the error?

@htuch
Copy link

htuch commented Jan 18, 2018

@lberki can we hold back on this for a short period; I'll reach out to you on IM.

@mattklein123
Copy link

mattklein123 commented Jan 18, 2018

On the high level topic, libstdc++ should definitely be statically linked in "mostly static" builds. libstdc++ is a compiler/application library, not a system library like glibc. Thus, an application mostly statically linked to everything other than glibc can be dropped on a system that only has a compatible glibc version. This is very important.

@gunan
Copy link

gunan commented Jan 18, 2018

@jhseu and @allenlavoie to check if he has any ideas on this.

@allenlavoie
Copy link
Contributor

Happy to help debug, but I haven't been able to reproduce the failure.

I tried with Bazel nightly #198 and TensorFlow at head (tensorflow/tensorflow@4595f1c), and I'm getting a successful build.

> bazel version
Build label: 
Build target: bazel-out/k8-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Thu Jul 13 02:35:31 +50017 (1516233609331)
Build timestamp: 1516233609331
Build timestamp as int: 1516233609331
> yes "" | ./configure
> bazel build //tensorflow/contrib/fused_conv:fused_conv2d_bias_activation_op_pygenrule
[...]
Target //tensorflow/contrib/fused_conv:fused_conv2d_bias_activation_op_pygenrule up-to-date:
  bazel-genfiles/tensorflow/contrib/fused_conv/ops/gen_fused_conv2d_bias_activation_op.py
INFO: Elapsed time: 141.657s, Critical Path: 34.78s
INFO: Build completed successfully, 1011 total actions

Could be related to the version of libstdc++? I'm using GLIBCXX_3.4.22

@dslomov
Copy link
Contributor Author

dslomov commented Jan 19, 2018

Ok, interesting, I can reliably reproduce on my machine, and CI machines repro this too. We need to look more.

@dslomov
Copy link
Contributor Author

dslomov commented Jan 19, 2018

Ok, the debug yielded that:

  • //tensorflow:libtensorflow_framework.so is linked against a dynamic version of libstdc++ (-lstdc++ im params)
  • binaries dependent on it (e.g. //tensorflow/contrib/fused_conv:gen_fused_conv2d_bias_activation_op_py_wrappers_cc) is linked against a static version of libstfc++ (-l:libstdc++.a)

This is problematic at least for gcc 4.8.4 (libstdc++.so.6.0.19)

@dslomov
Copy link
Contributor Author

dslomov commented Jan 19, 2018

Machines that have no issues use gcc 6.3.0.

@lberki
Copy link
Contributor

lberki commented Jan 19, 2018

Let's roll back.

This is a case of linking libstdc++ twice, but unfortunately, there is no easy workaround for TensorFlow and I'm not convinced the offending patch does the right thing: it makes dynamic libraries link libstdc++ dynamically but binaries link it statically, thus making dynamic libraries built with Bazel and binaries built with Bazel incompatible. That doesn't sound good.

Now, one could make the case that TensorFlow is doing something weird (namely, linking dynamic libraries into a binary built in mostly-static mode), but I'd like to make that decision without time pressure.

@lberki
Copy link
Contributor

lberki commented Jan 19, 2018

@dslomov , my hunch is that it's just so happens that the standard library of gcc 6.3.0 is accidentally more resilient to issues like this.

@iirina
Copy link
Contributor

iirina commented Jan 19, 2018

TF is now green on our CI, but I see no rollback. Was this fixed on TF side?

@dslomov
Copy link
Contributor Author

dslomov commented Jan 19, 2018

I think some machines have gcc 6.3.0 and some have gcc 4.8.4, that is why we haven't detected this earlier.

@dslomov
Copy link
Contributor Author

dslomov commented Jan 19, 2018

and also why CI is now green

@iirina
Copy link
Contributor

iirina commented Jan 19, 2018

Is there a reason why we have machines with different gcc versions for our CI?

@iirina
Copy link
Contributor

iirina commented Jan 19, 2018

@dslomov I don't think that explains it. The rollback probably fixed this for now, but I still do not understand why the CI did not catch this.

The job is green on our CI because it uses the latest bazel release (0.9.0) to build TF and that release doesn't have this bug.

About machines with different gcc versions: Jenkins runs the job both on Ubuntu 16 and on Ubuntu 14. I checked and each of the Ubuntu 14 machines uses gcc 4.8.4, and each of the Ubuntu 16 machines uses gcc 5.4.0. So this should have been detected earlier.

@iirina
Copy link
Contributor

iirina commented Jan 19, 2018

All the Tensorflow jobs that used bazel built @ HEAD have been red for a while

https://ci.bazel.build/job/Global/job/TensorFlow/

I don't think anyone was checking on them.

@lberki
Copy link
Contributor

lberki commented Jan 19, 2018

Put then why did we have a green build recently? Are our CI machines not uniform enough?

iirina pushed a commit that referenced this issue Jan 19, 2018
*** Reason for rollback ***

Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: #4474)

Fixes #4474

*** Original change description ***

When linking mostly-static Linux binaries, link libstdc++.a explicitly.

This allows libstdc++ to be statically linked, which is normally only
possible when invoking GCC as `g++` with the `-static-libstdc++` flag.

Fixes #2840

See envoyproxy/envoy#415 for additional
background and context.

cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)...

***

RELNOTES: None.
PiperOrigin-RevId: 182519445
iirina pushed a commit that referenced this issue Jan 23, 2018
*** Reason for rollback ***

Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: #4474)

Fixes #4474

*** Original change description ***

When linking mostly-static Linux binaries, link libstdc++.a explicitly.

This allows libstdc++ to be statically linked, which is normally only
possible when invoking GCC as `g++` with the `-static-libstdc++` flag.

Fixes #2840

See envoyproxy/envoy#415 for additional
background and context.

cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)...

***

RELNOTES: None.
PiperOrigin-RevId: 182519445
iirina pushed a commit that referenced this issue Jan 24, 2018
*** Reason for rollback ***

Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: #4474)

Fixes #4474

*** Original change description ***

When linking mostly-static Linux binaries, link libstdc++.a explicitly.

This allows libstdc++ to be statically linked, which is normally only
possible when invoking GCC as `g++` with the `-static-libstdc++` flag.

Fixes #2840

See envoyproxy/envoy#415 for additional
background and context.

cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)...

***

RELNOTES: None.
PiperOrigin-RevId: 182519445
iirina pushed a commit that referenced this issue Jan 26, 2018
*** Reason for rollback ***

Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: #4474)

Fixes #4474

*** Original change description ***

When linking mostly-static Linux binaries, link libstdc++.a explicitly.

This allows libstdc++ to be statically linked, which is normally only
possible when invoking GCC as `g++` with the `-static-libstdc++` flag.

Fixes #2840

See envoyproxy/envoy#415 for additional
background and context.

cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)...

***

RELNOTES: None.
PiperOrigin-RevId: 182519445
iirina pushed a commit that referenced this issue Jan 26, 2018
*** Reason for rollback ***

Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: #4474)

Fixes #4474

*** Original change description ***

When linking mostly-static Linux binaries, link libstdc++.a explicitly.

This allows libstdc++ to be statically linked, which is normally only
possible when invoking GCC as `g++` with the `-static-libstdc++` flag.

Fixes #2840

See envoyproxy/envoy#415 for additional
background and context.

cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)...

***

RELNOTES: None.
PiperOrigin-RevId: 182519445
iirina pushed a commit that referenced this issue Jan 30, 2018
*** Reason for rollback ***

Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: #4474)

Fixes #4474

*** Original change description ***

When linking mostly-static Linux binaries, link libstdc++.a explicitly.

This allows libstdc++ to be statically linked, which is normally only
possible when invoking GCC as `g++` with the `-static-libstdc++` flag.

Fixes #2840

See envoyproxy/envoy#415 for additional
background and context.

cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)...

***

RELNOTES: None.
PiperOrigin-RevId: 182519445
iirina pushed a commit that referenced this issue Jan 30, 2018
*** Reason for rollback ***

Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: #4474)

Fixes #4474

*** Original change description ***

When linking mostly-static Linux binaries, link libstdc++.a explicitly.

This allows libstdc++ to be statically linked, which is normally only
possible when invoking GCC as `g++` with the `-static-libstdc++` flag.

Fixes #2840

See envoyproxy/envoy#415 for additional
background and context.

cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)...

***

RELNOTES: None.
PiperOrigin-RevId: 182519445
iirina pushed a commit that referenced this issue Feb 12, 2018
*** Reason for rollback ***

Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: #4474)

Fixes #4474

*** Original change description ***

When linking mostly-static Linux binaries, link libstdc++.a explicitly.

This allows libstdc++ to be statically linked, which is normally only
possible when invoking GCC as `g++` with the `-static-libstdc++` flag.

Fixes #2840

See envoyproxy/envoy#415 for additional
background and context.

cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)...

***

RELNOTES: None.
PiperOrigin-RevId: 182519445
gunan pushed a commit to tensorflow/tensorflow that referenced this issue Feb 27, 2018
The 0.10.0 bazel has problems with static-linking on linux:
bazelbuild/bazel#4474. This PR bumps to the
latest bazel that produces proper binaries w/o the linking issue.
werkt pushed a commit to werkt/bazel that referenced this issue Mar 2, 2018
*** Reason for rollback ***

Breaks C++ on gcc 4.8.4 (specifically, TensorFlow: bazelbuild#4474)

Fixes bazelbuild#4474

*** Original change description ***

When linking mostly-static Linux binaries, link libstdc++.a explicitly.

This allows libstdc++ to be statically linked, which is normally only
possible when invoking GCC as `g++` with the `-static-libstdc++` flag.

Fixes bazelbuild#2840

See envoyproxy/envoy#415 for additional
background and context.

cc @htuch (for Envoy) and @calpeyser @hlopko (who I talked to earlier about this)...

***

RELNOTES: None.
PiperOrigin-RevId: 182519445
StanislawAntol pushed a commit to StanislawAntol/tensorflow that referenced this issue Mar 23, 2018
The 0.10.0 bazel has problems with static-linking on linux:
bazelbuild/bazel#4474. This PR bumps to the
latest bazel that produces proper binaries w/o the linking issue.
@Kaju-Bubanja
Copy link

I was wondering what combination of bazel and gcc is now needed to successfully built tensorflow? It looks a bit confusing and I'm unsure of what versions to try and what specific commits of bazel to try. I currently have 0.16 and get this error but the bug is filled for 0.10 is there a version in between where this was fixed or do i need bazel < 0.10?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breakage P0 This is an emergency and more important than other current work. (Assignee required)
Projects
None yet
Development

No branches or pull requests

10 participants