Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drop unnecessary CUDA stub libraries from $LIBRARY_PATH #2793

Merged
merged 5 commits into from Jul 7, 2023

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Sep 27, 2022

(created using eb --new-pr)

Found stuff like this in the output of PyTorch builds:

  Cannot generate a safe runtime search path for target caffe2_nvrtc because
  files in some directories may conflict with libraries in implicit
  directories:

    runtime library [libnvrtc.so.11.2] in /beegfs/ws/1/s3248973-EasyBuild/easybuild-ml/software/CUDA/11.3.1/lib may be hidden by files in:
      /beegfs/ws/1/s3248973-EasyBuild/easybuild-ml/software/CUDA/11.3.1/lib/stubs

So I checked for duplicate files and found a lot, see command and output below.

Hence I added some code to remove libraries from the stubs folder which are also present in the lib64 folder. I can't imagine having a need for stub libraries where we have the "real" ones, so this should avoid issues.

$ for f in $(ls CUDA/11.3.1/lib64/stubs); do if [ -e "CUDA/11.3.1/lib64/$f" ]; then echo "Double $f"; else echo "OK $f"; fi; done
Double libcublasLt.so
Double libcublasLt.so.11
Double libcublas.so
Double libcublas.so.11
OK libcuda.so
OK libcuda.so.1
Double libcufft.so
Double libcufft.so.10
Double libcufftw.so
Double libcufftw.so.10
Double libcurand.so
Double libcurand.so.10
Double libcusolverMg.so
Double libcusolverMg.so.11
Double libcusolver.so
Double libcusolver.so.11
Double libcusparse.so
Double libcusparse.so.11
Double libnppc.so
Double libnppc.so.11
Double libnppial.so
Double libnppial.so.11
Double libnppicc.so
Double libnppicc.so.11
Double libnppidei.so
Double libnppidei.so.11
Double libnppif.so
Double libnppif.so.11
Double libnppig.so
Double libnppig.so.11
Double libnppim.so
Double libnppim.so.11
Double libnppist.so
Double libnppist.so.11
Double libnppisu.so
Double libnppisu.so.11
Double libnppitc.so
Double libnppitc.so.11
Double libnpps.so
Double libnpps.so.11
OK libnvidia-ml.so
OK libnvidia-ml.so.1
Double libnvjpeg.so
Double libnvjpeg.so.11
Double libnvrtc.so
Double libnvrtc.so.11.2

Copy link
Member

@ocaisa ocaisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, will test it now

easybuild/easyblocks/c/cuda.py Outdated Show resolved Hide resolved
easybuild/easyblocks/c/cuda.py Outdated Show resolved Hide resolved
Copy link
Member

@ocaisa ocaisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

easybuild/easyblocks/c/cuda.py Show resolved Hide resolved
ocaisa
ocaisa previously requested changes Sep 27, 2022
Copy link
Member

@ocaisa ocaisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to move the for loop to after stubs_dir has been populated

@ocaisa ocaisa dismissed their stale review September 27, 2022 13:20

Didn't notice the change

@Flamefire Flamefire requested a review from ocaisa October 17, 2022 09:11
@Flamefire
Copy link
Contributor Author

@ocaisa Any updates?

@boegel boegel added this to the 4.x milestone Dec 21, 2022
@ocaisa
Copy link
Member

ocaisa commented Jun 29, 2023

@Flamefire Sorry for not following up on this but I would have really liked another set of eyes to look over this and think of some of the potential unintended consequences.

@Flamefire
Copy link
Contributor Author

Yes would be great to have someone double-checking my assumption from above:

I can't imagine how removing the stubs could lead to any issues because we have the full libs and only remove the stubs when there is a full lib. So I'd say its fine

In other words: Why would the stub libraries be required if we have the full libraries available? The whole purpose of the former is to provide placeholders for linking purposes when the latter aren't available.

FWIW: I build PyTorch and TF (with tests) with this successfully.

@casparvl
Copy link
Contributor

casparvl commented Jun 30, 2023

Just to confirm if I understand correctly:

if we have the full libraries available

By this you mean that the full libraries are also available from the CUDA installation? In that case, I agree with you, I don't see the purpose.

As you are no doubt aware, the stubs libraries should be built against when cross compiling on nodes that don't have the driver installed (i.e. non-GPU nodes), and should provide stubs for the libraries that are normally provided by the driver installation (such as libcuda.so). If the stubs indeed contain duplicates with real libraries that are installed with the SDK, I don't see the point either. I'd assume their removal is safe, but I guess we'll only know for sure if we try :)

@Flamefire
Copy link
Contributor Author

Just to confirm if I understand correctly:

if we have the full libraries available

By this you mean that the full libraries are also available from the CUDA installation? In that case, I agree with you, I don't see the purpose.

Yes, see the output in the PR description: Basically all we need is libcuda.so/libcuda.so.1 and libnvidia-ml.so/libnvidia-ml.so.1 in the stub folder but not e.g. libcufft.so or libcurand.so which are also (as full libraries) in the lib64 folder of this module.

As you are no doubt aware, the stubs libraries should be built against when cross compiling on nodes that don't have the driver installed (i.e. non-GPU nodes), and should provide stubs for the libraries that are normally provided by the driver installation (such as libcuda.so). If the stubs indeed contain duplicates with real libraries that are installed with the SDK

That's exactly the intention: For every library in the stubs folder, check if it is in the lib64|lib folder and if so remove it from the stubs folder.

Result:

prepend_path("LD_LIBRARY_PATH","/sw/installed/CUDA/11.7.0/lib")
prepend_path("LD_LIBRARY_PATH","/sw/installed/CUDA/11.7.0/extras/CUPTI/lib64")
prepend_path("LD_LIBRARY_PATH","/sw/installed/CUDA/11.7.0/nvvm/lib64")
prepend_path("LIBRARY_PATH","/sw/installed/CUDA/11.7.0/lib")
prepend_path("LIBRARY_PATH","/sw/installed/CUDA/11.7.0/stubs/lib64")
$ ls /sw/installed/CUDA/11.7.0/stubs/lib64
libcuda.so  libcuda.so.1  libnvidia-ml.so  libnvidia-ml.so.1

$ ls /sw/installed/CUDA/11.7.0/lib
cmake         ...             libcufft.so     ...          libcurand.so  ...

Note how the stubs folder is only in $LIBRARY_PATH but not in $LD_LIBRARY_PATH so I'm sure it is safe to assume the libraries from the lib folder (which is in both) are preferable. Hence, it's safe to remove duplicates from the stubs folder.

Copy link
Member

@ocaisa ocaisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Have communicated the change with the other maintainers and no-one raised concerns about modifying the default stubs library of CUDA

@boegel boegel modified the milestones: 4.x, release after 4.7.3 Jul 5, 2023
@boegel boegel added the change label Jul 5, 2023
@ocaisa
Copy link
Member

ocaisa commented Jul 7, 2023

Test report by @ocaisa

Overview of tested easyconfigs (in order)

  • SUCCESS CUDA-12.1.0.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node1.int.techhpc.learnhpc.eu - Linux Rocky Linux 8.8 (Green Obsidian), x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.9.16
See https://gist.github.com/ocaisa/8e517924c895874fb7809031a08e0347 for a full test report.

@ocaisa
Copy link
Member

ocaisa commented Jul 7, 2023

Test report by @ocaisa

Overview of tested easyconfigs (in order)

  • SUCCESS CUDA-12.1.0.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node1.int.techhpc.learnhpc.eu - Linux Rocky Linux 8.8 (Green Obsidian), x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.9.16
See https://gist.github.com/ocaisa/80f7dd8698ad458fdafb189f1dd7e334 for a full test report.

@ocaisa
Copy link
Member

ocaisa commented Jul 7, 2023

Checked again that it does as expected. Going in, thanks for your patience @Flamefire

@ocaisa ocaisa merged commit fd59013 into easybuilders:develop Jul 7, 2023
47 checks passed
@boegel
Copy link
Member

boegel commented Jul 7, 2023

We briefly discussed this during the last EasyBuild conf call, and I agree it makes sense to make this change.

It does make me wonder though: why are stubs libraries provided when the real libraries are also provided?!
I feel like we're missing something here...

Maybe @ajdecon can help us out here?

@ajdecon
Copy link
Contributor

ajdecon commented Jul 7, 2023

@boegel : This nerd-sniped me for a bit. 😉

My understanding is that the stub libraries are used to run binaries that are linked against CUDA, but which support running in a CPU-only mode. Those hosts can have the stub libraries installed to avoid link errors at runtime, without having to install the whole CUDA toolkit. (Or include the much larger GPU-enabled CUDA libs.)

An example of this is the Triton Inference Server, where we provide container images for both GPU hosts and CPU-only hosts. The build script for the CPU build of the container copies the stub libs into the container rather than installing the full CUDA toolkit.

@Flamefire Flamefire deleted the 20220927134957_new_pr_cuda branch July 15, 2023 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants