Skip to content

Conversation

@fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Sep 29, 2021

Update to CUDA 11.4.2 (SDK 11.4.20210830):

Update cuDNN to v8.2.2.26 for CUDA 11.4:

  • enable the use of cuDNN for ONNX on ARM for CentOS 8

Update the CUDA external packaging:

  • include the CUDA runtime static library in the external package
  • create symlinks for the library stubs to satisfy link-time checks

Add the cuda-compatible-runtime test as a new external.

Update Eigen.

Backport #7257, #7277, #7278, #7279 et al.

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @fwyzard (Andrea Bocci) for branch IB/CMSSW_12_0_X/master.

@cmsbuild, @smuzaffar, @iarspider, @mrodozov can you please review it and eventually sign? Thanks.
@perrotta, @dpiparo, @qliphy you are the release manager for this.
cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 29, 2021

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 29, 2021

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 29, 2021

please test for cc8_aarch64_gcc9

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 29, 2021

please test for cc8_amd64_gcc9

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 29, 2021

@smuzaffar @perrotta @qliphy this PR should backport all the CUDA related changes from CMSSW 12.1.x to 12.0.x.

Please let me know if you think this is OK, or if I should make a more limited backport of the minimal changes required to fix the issue mentioned yesterday (and discussed on this GGUS ticket).

@smuzaffar
Copy link
Contributor

@fwyzard , we also need new build rules with update nvidia runtime hook to go with this ... right?

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 29, 2021

Yes - but the old rules should still work with this update, so we can test in two separate steps.

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b61c8f/19222/summary.html
COMMIT: da32034
CMSSW: CMSSW_12_0_X_2021-09-28-2300/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/7340/19222/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

+ for FILE in '$FILES'
++ basename src/common.cpp
+ /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/slc7_amd64_gcc900/external/cuda/11.4.2-0939a3504c82d9c20346029080003d72/bin/nvcc -DALPAKA_ACC_GPU_CUDA_ENABLED -DCUPLA_STREAM_ASYNC_ENABLED=1 -DALPAKA_DEBUG=0 -I/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/slc7_amd64_gcc900/external/cuda/11.4.2-0939a3504c82d9c20346029080003d72/include -I/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/slc7_amd64_gcc900/external/tbb/v2021.2.0-fcaf3e8d37e2c0c2807c93f2e5bba226/include -I/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/slc7_amd64_gcc900/external/boost/1.75.0-5ba0079faea30e2a96d0dd57a4ddb60f/include -I/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/slc7_amd64_gcc900/external/alpaka/0.6.0-e2249872940140983aaf1c217659a754/include -Iinclude -std=c++17 -O3 --generate-line-info --source-in-ptx --display-error-number --expt-relaxed-constexpr --extended-lambda -gencode 'arch=compute_60,code=[sm_60,compute_60]' -gencode 'arch=compute_70,code=[sm_70,compute_70]' -gencode 'arch=compute_75,code=[sm_75,compute_75]' -Wno-deprecated-gpu-targets -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored --cudart shared -Xcompiler '-std=c++17 -O2 -pthread -fPIC -Wall -Wextra' -x cu -c src/common.cpp -o build/cuda/common.cpp.o
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/slc7_amd64_gcc900/external/alpaka/0.6.0-e2249872940140983aaf1c217659a754/include/alpaka/event/EventGenericThreads.hpp: In instantiation of 'void alpaka::traits::generic::currentThreadWaitForDevice(const TDev&) [with TDev = alpaka::DevCpu]':
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/slc7_amd64_gcc900/external/alpaka/0.6.0-e2249872940140983aaf1c217659a754/include/alpaka/dev/cpu/Wait.hpp:33:40:   required from here
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/slc7_amd64_gcc900/external/alpaka/0.6.0-e2249872940140983aaf1c217659a754/include/alpaka/event/EventGenericThreads.hpp:280:20: error: '__T30' was not declared in this scope
280 |                 auto vQueues(dev.getAllQueues());
|                 ~~~^~~~~~~~~~~~~~~~~~~~~
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.cnlSom (%build)




@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b61c8f/19223/summary.html
COMMIT: da32034
CMSSW: CMSSW_12_0_X_2021-09-28-2300/cc8_amd64_gcc9
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/7340/19223/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

+ for FILE in $FILES
++ basename src/common.cpp
+ /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/cc8_amd64_gcc9/external/cuda/11.4.2-41ceec0a69aac72c010538b7cd374b9a/bin/nvcc -DALPAKA_ACC_GPU_CUDA_ENABLED -DCUPLA_STREAM_ASYNC_ENABLED=1 -DALPAKA_DEBUG=0 -I/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/cc8_amd64_gcc9/external/cuda/11.4.2-41ceec0a69aac72c010538b7cd374b9a/include -I/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/cc8_amd64_gcc9/external/tbb/v2021.2.0-ea64429748bfcab7118ee07caf0ec8a1/include -I/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/cc8_amd64_gcc9/external/boost/1.75.0-a01e7a4f514707c75c38f8f8cc6f5b30/include -I/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/cc8_amd64_gcc9/external/alpaka/0.6.0-7658d74aec51e7ed2677aa0655a7c1cb/include -Iinclude -std=c++17 -O3 --generate-line-info --source-in-ptx --display-error-number --expt-relaxed-constexpr --extended-lambda -gencode 'arch=compute_60,code=[sm_60,compute_60]' -gencode 'arch=compute_70,code=[sm_70,compute_70]' -gencode 'arch=compute_75,code=[sm_75,compute_75]' -Wno-deprecated-gpu-targets -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored --cudart shared -Xcompiler '-std=c++17 -O2 -pthread -fPIC -Wall -Wextra' -x cu -c src/common.cpp -o build/cuda/common.cpp.o
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/cc8_amd64_gcc9/external/alpaka/0.6.0-7658d74aec51e7ed2677aa0655a7c1cb/include/alpaka/event/EventGenericThreads.hpp: In instantiation of 'void alpaka::traits::generic::currentThreadWaitForDevice(const TDev&) [with TDev = alpaka::DevCpu]':
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/cc8_amd64_gcc9/external/alpaka/0.6.0-7658d74aec51e7ed2677aa0655a7c1cb/include/alpaka/dev/cpu/Wait.hpp:33:40:   required from here
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/cc8_amd64_gcc9/external/alpaka/0.6.0-7658d74aec51e7ed2677aa0655a7c1cb/include/alpaka/event/EventGenericThreads.hpp:280:20: error: '__T30' was not declared in this scope
280 |                 auto vQueues(dev.getAllQueues());
|                 ~~~^~~~~~~~~~~~~~~~~~~~~
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.fbHwdb (%build)




@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 29, 2021

I forgot that we need the alpaka / cupla updates as well :-/

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 29, 2021

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 29, 2021

please test for cc8_amd64_gcc9

@cmsbuild
Copy link
Contributor

Pull request #7340 was updated.

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 29, 2021

please test for cc8_aarch64_gcc9

@cmsbuild
Copy link
Contributor

-1

Failed Tests: Build
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b61c8f/19239/summary.html
COMMIT: b076f72
CMSSW: CMSSW_12_0_X_2021-09-29-1100/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/7340/19239/install.sh to create a dev area with all the needed externals and cmssw changes.

Build

I found compilation error when building:

/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/slc7_amd64_gcc900/external/eigen/f612df273689a19d25b45ca4f8269463207c4fee/include/eigen3/Eigen/src/Core/Solve.h(72): warning #20011-D: calling a __host__ function("Eigen::PartialPivLU< ::Eigen::Matrix > ::cols() const") from a __host__ __device__ function("Eigen::Solve< ::Eigen::PartialPivLU< ::Eigen::Matrix > ,  ::Eigen::CwiseNullaryOp< ::Eigen::internal::scalar_identity_op ,  ::Eigen::Matrix > > ::rows const") is not allowed

/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/slc7_amd64_gcc900/external/eigen/f612df273689a19d25b45ca4f8269463207c4fee/include/eigen3/Eigen/src/LU/PartialPivLU.h(412): warning #20011-D: calling a __host__ function("Eigen::internal::FixedInt<(int)-1> ::operator ()(int) const") from a __host__ __device__ function("Eigen::internal::partial_lu_impl ::unblocked_lu") is not allowed

/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/slc7_amd64_gcc900/external/eigen/f612df273689a19d25b45ca4f8269463207c4fee/include/eigen3/Eigen/src/LU/PartialPivLU.h(412): error: identifier "Eigen::fix<(int)-1> " is undefined in device code

/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/slc7_amd64_gcc900/external/eigen/f612df273689a19d25b45ca4f8269463207c4fee/include/eigen3/Eigen/src/LU/PartialPivLU.h(422): warning #20011-D: calling a __host__ function("Eigen::internal::FixedInt<(int)-1> ::operator ()(int) const") from a __host__ __device__ function("Eigen::internal::partial_lu_impl ::unblocked_lu") is not allowed

/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/slc7_amd64_gcc900/external/eigen/f612df273689a19d25b45ca4f8269463207c4fee/include/eigen3/Eigen/src/LU/PartialPivLU.h(422): error: identifier "Eigen::fix<(int)-1> " is undefined in device code



@cmsbuild
Copy link
Contributor

-1

Failed Tests: Build
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b61c8f/19240/summary.html
COMMIT: b076f72
CMSSW: CMSSW_12_0_X_2021-09-28-2300/cc8_amd64_gcc9
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/7340/19240/install.sh to create a dev area with all the needed externals and cmssw changes.

Build

I found compilation error when building:

/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/cc8_amd64_gcc9/external/eigen/f612df273689a19d25b45ca4f8269463207c4fee/include/eigen3/Eigen/src/Core/Solve.h(72): warning #20011-D: calling a __host__ function("Eigen::PartialPivLU< ::Eigen::Matrix > ::cols() const") from a __host__ __device__ function("Eigen::Solve< ::Eigen::PartialPivLU< ::Eigen::Matrix > ,  ::Eigen::CwiseNullaryOp< ::Eigen::internal::scalar_identity_op ,  ::Eigen::Matrix > > ::rows const") is not allowed

/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/cc8_amd64_gcc9/external/eigen/f612df273689a19d25b45ca4f8269463207c4fee/include/eigen3/Eigen/src/LU/PartialPivLU.h(412): warning #20011-D: calling a __host__ function("Eigen::internal::FixedInt<(int)-1> ::operator ()(int) const") from a __host__ __device__ function("Eigen::internal::partial_lu_impl ::unblocked_lu") is not allowed

/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/cc8_amd64_gcc9/external/eigen/f612df273689a19d25b45ca4f8269463207c4fee/include/eigen3/Eigen/src/LU/PartialPivLU.h(412): error: identifier "Eigen::fix<(int)-1> " is undefined in device code

/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/cc8_amd64_gcc9/external/eigen/f612df273689a19d25b45ca4f8269463207c4fee/include/eigen3/Eigen/src/LU/PartialPivLU.h(422): warning #20011-D: calling a __host__ function("Eigen::internal::FixedInt<(int)-1> ::operator ()(int) const") from a __host__ __device__ function("Eigen::internal::partial_lu_impl ::unblocked_lu") is not allowed

/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/cc8_amd64_gcc9/external/eigen/f612df273689a19d25b45ca4f8269463207c4fee/include/eigen3/Eigen/src/LU/PartialPivLU.h(422): error: identifier "Eigen::fix<(int)-1> " is undefined in device code



@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 29, 2021 via email

@cmsbuild
Copy link
Contributor

Pull request #7340 was updated.

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 29, 2021

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 29, 2021

OK, this is starting to accumulate too many changes for a simple backport - and it may need to bring in also TensorFlow and who knows what else...

I've made a minimal backport at #7346 .

@fwyzard fwyzard closed this Sep 29, 2021
@fwyzard fwyzard deleted the IB/CMSSW_12_0_X/master__cuda_11.4.2 branch September 29, 2021 22:38
@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b61c8f/19255/summary.html
COMMIT: bfd920c
CMSSW: CMSSW_12_0_X_2021-09-29-1100/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/7340/19255/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • Reco comparison had 3 failed jobs
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19735
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19729
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 39
  • DQMHistoTests: Total histograms compared: 2998564
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2998542
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 38 files compared)
  • Checked 165 log files, 37 edm output root files, 39 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 30, 2021

(sorry, wrong PR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants