Bazel does not handle symlinks in system include paths for CUDA #10167

Flamefire · 2019-11-05T08:14:22Z

Description of the problem / feature request:

Very similar to #7315:
If the compilers system includes are installed in a Path which is referenced by a symlink then the dependency check will fail.

Example:

Compiler includes in real path /software/haswell/GCCcore/8.2.0/include/c++/8.2.0/x86_64-pc-linux-gnu /8.2.0/include
Symlink exists so user sees /sw/installed/GCCcore/8.2.0/lib/gcc/x86_64-pc-linux-gnu/8.2.0/include/
Toolchain file (e.g. created by TensorFlow for CUDA) will contain the 2nd one (resolved path, /sw/installed)
Check will use unresolved path and check only the prefix in

bazel/src/main/java/com/google/devtools/build/lib/rules/cpp/HeaderDiscovery.java

Line 141 in c42f93f

if (FileSystemUtils.startsWithAny(execPath, permittedSystemIncludePrefixes)) {
Error will be something like:

ERROR: /tmp/easybuild-tmp/eb-5QGVSJ/tmpMZolLj-bazel-build/external/fft2d/BUILD.bazel:27:1: undeclared inclusion(s) in rule '@fft2d//:fft2d':
this rule is missing dependency declarations for the following files included by 'external/fft2d/fft2d/fftsg2d.c':
  '/sw/installed/GCCcore/8.2.0/lib/gcc/x86_64-pc-linux-gnu/8.2.0/include/stddef.h'
  '/sw/installed/GCCcore/8.2.0/lib/gcc/x86_64-pc-linux-gnu/8.2.0/include/stdarg.h'
Target //tensorflow/tools/pip_package:build_pip_package failed to build

Feature requests: what underlying problem are you trying to solve with this feature?

Building TensorFlow on a HPC cluster. See related issue tensorflow/tensorflow#33975

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

See above: Setup the compiler with a symlink path and try to build tensorflow 2 with CUDA

What operating system are you running Bazel on?

RHEL 7.4

What's the output of `bazel info release`?

0.26.1 (@non-git)

If `bazel info release` returns "development version" or "(@non-git)", tell us how you built Bazel.

With EasyBuild recipe

Have you found anything relevant by searching the web?

TensorFlow issue Bazel build does not pick up correct compiler include paths tensorflow/tensorflow#33975
Bazel issue cc autoconfiguration resolves symlinks in builtin include paths #7315
SO questions like https://stackoverflow.com/questions/35256110/tensorflow-build-fails-with-missing-dependency-error/35274720
- Accepted answer does not apply (anymore)
- Using -MMD to exclude system headers from dependencies might work (2nd answer)

Possible solutions:

patch TensorFlow to include non-resolved paths: https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/t/TensorFlow/TensorFlow-1.14.0_fix-cuda-build.patch
Unify logic of https://github.com/tensorflow/tensorflow/blob/ce69fe54695361d88c3b73d18e0d647baeb50c49/third_party/gpus/cuda_configure.bzl#L280 with

bazel/tools/cpp/unix_cc_configure.bzl

Line 125 in 8b0bfaf

def get_escaped_cxx_inc_directories(repository_ctx, cc, lang_flag, additional_flags = []):

as indicated at https://github.com/tensorflow/tensorflow/blob/ce69fe54695361d88c3b73d18e0d647baeb50c49/third_party/gpus/cuda_configure.bzl#L221 and handle it in bazel (resolved and non-resolved paths added)
Check resolved dependencies against each other in https://github.com/bazelbuild/bazel/blob/c42f93f2285e0d8f1ea70ccf3168200db34235f5/src/main/java/com/google/devtools/build/lib/rules/cpp/HeaderDiscovery.java

The text was updated successfully, but these errors were encountered:

Flamefire · 2019-11-18T13:04:27Z

Update: Proposed solution 1 does not work: It relies on the compiler being where the includes are. This works for e.g.:

/sw/installed/GCCcore/8.3.0/lib/gcc/x86_64-pc-linux-gnu/8.3.0/include/stdarg.h
/sw/installed/GCCcore/8.3.0/bin/g++

But it fails when e.g. using ccache:

/sw/installed/GCCcore/8.3.0/lib/gcc/x86_64-pc-linux-gnu/8.3.0/include/stdarg.h
/sw/installed/GCCcore/8.3.0/bin/g++
/usr/bin/g++ -> /usr/bin/ccache g++

So the only real solution is to compare realpaths as suggested in 3

scentini · 2019-12-11T18:03:23Z

cc @oquenchil

sgowroji · 2023-02-10T12:08:22Z

Hi there! We're doing a clean up of old issues and will be closing this one. Please reopen if you’d like to discuss anything further. We’ll respond as soon as we have the bandwidth/resources to do so.

Flamefire · 2023-02-16T16:04:45Z

This is still an issue in Bazel. However current TensorFlow has a patch that resolves symlinks to the compiler before passing it to Bazel which works but is only a workaround.

vors · 2023-02-24T22:05:36Z

Hello! We are hitting the same problem when trying to make nvcc+sccache work in https://github.com/pytorch/pytorch
Any workarounds are highly appreciated.

Fixes #79348 This change is mostly focused on enabling nvcc+sccache in the PyTorch CI. Along the way we had to do couple tweaks: 1. Split the rules_cc from the rules_cuda that embeeded them before. This is needed in order to apply a different patch to the rules_cc compare to the one that rules_cuda does by default. This is in turn needed because we need to workaround an nvcc behavior where it doesn't send `-iquote xxx` to the host compiler, but it does send `-isystem xxx`. So we workaround this problem with (ab)using `-isystem` instead. Without it we are getting errors like `xxx` is not found. 2. Workaround bug in bazel bazelbuild/bazel#10167 that prevents us from using a straightforward and honest `nvcc` sccache wrapper. Instead we generate ad-hock bazel specific nvcc wrapper that has internal knowledge of the relative bazel paths to local_cuda. This allows us to workaround the issue with CUDA symlinks. Without it we are getting `undeclared inclusion(s) in rule` all over the place for CUDA headers. ## Test plan Green CI build https://github.com/pytorch/pytorch/actions/runs/4267147180/jobs/7428431740 Note that now it says "CUDA" in the sccache output ``` + sccache --show-stats Compile requests 9784 Compile requests executed 6726 Cache hits 6200 Cache hits (C/C++) 6131 Cache hits (CUDA) 69 Cache misses 519 Cache misses (C/C++) 201 Cache misses (CUDA) 318 Cache timeouts 0 Cache read errors 0 Forced recaches 0 Cache write errors 0 Compilation failures 0 Cache errors 7 Cache errors (C/C++) 7 Non-cacheable compilations 0 Non-cacheable calls 2893 Non-compilation calls 165 Unsupported compiler calls 0 Average cache write 0.116 s Average cache read miss 23.722 s Average cache read hit 0.057 s Failed distributed compilations 0 ``` Pull Request resolved: #95528 Approved by: https://github.com/huydhn

Fixes #79348 This change is mostly focused on enabling nvcc+sccache in the PyTorch CI. Along the way we had to do couple tweaks: 1. Split the rules_cc from the rules_cuda that embeeded them before. This is needed in order to apply a different patch to the rules_cc compare to the one that rules_cuda does by default. This is in turn needed because we need to workaround an nvcc behavior where it doesn't send `-iquote xxx` to the host compiler, but it does send `-isystem xxx`. So we workaround this problem with (ab)using `-isystem` instead. Without it we are getting errors like `xxx` is not found. 2. Workaround bug in bazel bazelbuild/bazel#10167 that prevents us from using a straightforward and honest `nvcc` sccache wrapper. Instead we generate ad-hock bazel specific nvcc wrapper that has internal knowledge of the relative bazel paths to local_cuda. This allows us to workaround the issue with CUDA symlinks. Without it we are getting `undeclared inclusion(s) in rule` all over the place for CUDA headers. ## Test plan Green CI build https://github.com/pytorch/pytorch/actions/runs/4267147180/jobs/7428431740 Note that now it says "CUDA" in the sccache output ``` + sccache --show-stats Compile requests 9784 Compile requests executed 6726 Cache hits 6200 Cache hits (C/C++) 6131 Cache hits (CUDA) 69 Cache misses 519 Cache misses (C/C++) 201 Cache misses (CUDA) 318 Cache timeouts 0 Cache read errors 0 Forced recaches 0 Cache write errors 0 Compilation failures 0 Cache errors 7 Cache errors (C/C++) 7 Non-cacheable compilations 0 Non-cacheable calls 2893 Non-compilation calls 165 Unsupported compiler calls 0 Average cache write 0.116 s Average cache read miss 23.722 s Average cache read hit 0.057 s Failed distributed compilations 0 ``` Pull Request resolved: pytorch/pytorch#95528 Approved by: https://github.com/huydhn

lgulich · 2023-10-30T10:12:35Z

This is still an issue in Bazel. However current TensorFlow has a patch that resolves symlinks to the compiler before passing it to Bazel which works but is only a workaround.

@Flamefire Could you share a link to this patch? I am facing the same problem (for a non-tensorflow workspace) and would like to investigate if I could reuse your fix. Thanks in advance!

Flamefire · 2023-10-30T11:37:13Z

@lgulich The patch I was referring to is tensorflow/tensorflow@58b236b
However that doesn't fully solve the issue when compiler wrappers (e.g. ccache) are used. We use a patch I added as a PR there but it is pending for ages: tensorflow/tensorflow#60668

Fixes pytorch#79348 This change is mostly focused on enabling nvcc+sccache in the PyTorch CI. Along the way we had to do couple tweaks: 1. Split the rules_cc from the rules_cuda that embeeded them before. This is needed in order to apply a different patch to the rules_cc compare to the one that rules_cuda does by default. This is in turn needed because we need to workaround an nvcc behavior where it doesn't send `-iquote xxx` to the host compiler, but it does send `-isystem xxx`. So we workaround this problem with (ab)using `-isystem` instead. Without it we are getting errors like `xxx` is not found. 2. Workaround bug in bazel bazelbuild/bazel#10167 that prevents us from using a straightforward and honest `nvcc` sccache wrapper. Instead we generate ad-hock bazel specific nvcc wrapper that has internal knowledge of the relative bazel paths to local_cuda. This allows us to workaround the issue with CUDA symlinks. Without it we are getting `undeclared inclusion(s) in rule` all over the place for CUDA headers. ## Test plan Green CI build https://github.com/pytorch/pytorch/actions/runs/4267147180/jobs/7428431740 Note that now it says "CUDA" in the sccache output ``` + sccache --show-stats Compile requests 9784 Compile requests executed 6726 Cache hits 6200 Cache hits (C/C++) 6131 Cache hits (CUDA) 69 Cache misses 519 Cache misses (C/C++) 201 Cache misses (CUDA) 318 Cache timeouts 0 Cache read errors 0 Forced recaches 0 Cache write errors 0 Compilation failures 0 Cache errors 7 Cache errors (C/C++) 7 Non-cacheable compilations 0 Non-cacheable calls 2893 Non-compilation calls 165 Unsupported compiler calls 0 Average cache write 0.116 s Average cache read miss 23.722 s Average cache read hit 0.057 s Failed distributed compilations 0 ``` Pull Request resolved: pytorch#95528 Approved by: https://github.com/huydhn

Flamefire mentioned this issue Nov 18, 2019

Bazel build does not pick up correct compiler include paths tensorflow/tensorflow#33975

Closed

dslomov added team-Rules-CPP Issues for C++ rules untriaged labels Nov 25, 2019

scentini added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Dec 11, 2019

oquenchil added P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) type: bug and removed P2 We'll consider working on this in future. (Assignee optional) labels Nov 19, 2020

sgowroji added the stale Issues or PRs that are stale (no activity for 30 days) label Feb 10, 2023

sgowroji closed this as completed Feb 10, 2023

oquenchil closed this as not planned Won't fix, can't repro, duplicate, stale Feb 16, 2023

sgowroji added not stale Issues or PRs that are inactive but not considered stale and removed stale Issues or PRs that are stale (no activity for 30 days) labels Feb 16, 2023

sgowroji reopened this Feb 16, 2023

vors mentioned this issue Feb 24, 2023

[bazel] enable sccache+nvcc in CI pytorch/pytorch#95528

Closed

lgulich mentioned this issue Oct 30, 2023

[bazel_ros2_rules] Undeclared inclusion when declaring new rosidl_interfaces_group RobotLocomotion/drake-ros#308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bazel does not handle symlinks in system include paths for CUDA #10167

Bazel does not handle symlinks in system include paths for CUDA #10167

Flamefire commented Nov 5, 2019

Flamefire commented Nov 18, 2019

scentini commented Dec 11, 2019

sgowroji commented Feb 10, 2023

Flamefire commented Feb 16, 2023

vors commented Feb 24, 2023

lgulich commented Oct 30, 2023

Flamefire commented Oct 30, 2023

Bazel does not handle symlinks in system include paths for CUDA #10167

Bazel does not handle symlinks in system include paths for CUDA #10167

Comments

Flamefire commented Nov 5, 2019

Description of the problem / feature request:

Feature requests: what underlying problem are you trying to solve with this feature?

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

What's the output of bazel info release?

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

Have you found anything relevant by searching the web?

Flamefire commented Nov 18, 2019

scentini commented Dec 11, 2019

sgowroji commented Feb 10, 2023

Flamefire commented Feb 16, 2023

vors commented Feb 24, 2023

lgulich commented Oct 30, 2023

Flamefire commented Oct 30, 2023

What's the output of `bazel info release`?

If `bazel info release` returns "development version" or "(@non-git)", tell us how you built Bazel.