Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Samples installation in CUDAcore causes failure #2497

Closed
branfosj opened this issue Jun 28, 2021 · 5 comments
Closed

Samples installation in CUDAcore causes failure #2497

branfosj opened this issue Jun 28, 2021 · 5 comments
Milestone

Comments

@branfosj
Copy link
Member

During the install of CUDAcore 11.3.1 the installation of the samples has the following error:

[ERROR]: Failed to update permissions for /rds/bear-apps/devel/eb-sjb-up/tmp/CUDAcore/11.3.1/system-system//NVIDIA_CUDA-11.3_Samples/1_Utilities

(that is the buildpath).

This causes CUDA to remove everything installed. I've tested with a different buildpath (GPFS storage, /dev/shm, and local disk /scratch) and that also fails.

If I do not install the samples (and removing the sanity check) then the installation completes.

@Micket
Copy link
Contributor

Micket commented Jun 28, 2021

Is this unique to POWER?

@branfosj
Copy link
Member Author

branfosj commented Jun 28, 2021

No. I'm seeing it on both x86_64 and POWER. I already had CUDAcore-11.3.1.eb installed on x86_64 from before the installation of the samples was added to the CUDA easyblock (#2374).

@branfosj
Copy link
Member Author

I'm seeing the same issue with:

  • CUDAcore-11.3.0.eb
  • CUDAcore-11.2.2.eb
  • CUDAcore-11.2.1.eb
  • CUDAcore-11.1.1.eb
  • CUDAcore-11.0.2.eb

@Micket
Copy link
Contributor

Micket commented Jun 28, 2021

eb CUDAcore-11.3.0.eb --include-easyblocks-from-pr 2374 --rebuild

using latest eb from develop I end up with a working installation (with samples). Build dir is on a ramdisk.

@branfosj branfosj changed the title Samples installation in CUDAcore 11.3.1 causes failure Samples installation in CUDAcore causes failure Jun 30, 2021
@branfosj
Copy link
Member Author

I've solved my issue. If I install inside a slurm job the cgroup causes problems with the installation of the samples. If I direct log on to the node then these install fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants