Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to CUDA 11.4.1 #7257

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Aug 31, 2021

Update to CUDA 11.4.1 (11.4.20210728):

  • CUDA runtime version 11.4.108
  • NVIDIA drivers version 470.57.02
  • cuDNN version 8.2.2.26

Add support for GCC 11 and clang 12.

See https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html .

Update to CUDA 11.4.1 (11.4.20210728):
  * CUDA runtime version 11.4.108
  * NVIDIA drivers version 470.57.02

Add support for GCC 11 and clang 12.

See https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html .
@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 31, 2021

A new Pull Request was created by @fwyzard (Andrea Bocci) for branch IB/CMSSW_12_1_X/master.

@cmsbuild, @smuzaffar, @mrodozov, @iarspider can you please review it and eventually sign? Thanks.
@perrotta, @dpiparo, @qliphy you are the release manager for this.
cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 31, 2021

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 31, 2021

Now that all externals are fixed, it should be OK to upgrade the default CMSSW releases to CUDA 11.4 .

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 31, 2021

backport #7197

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 31, 2021

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 31, 2021

please test for slc7_amd64_gcc10

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 31, 2021

please test for slc7_aarch64_gcc9

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 31, 2021

please test for slc7_ppc64le_gcc9

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-655e55/18155/summary.html
COMMIT: a8b7b2c
CMSSW: CMSSW_12_1_X_2021-08-30-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/7257/18155/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 39
  • DQMHistoTests: Total histograms compared: 3000404
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3000382
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 38 files compared)
  • Checked 165 log files, 37 edm output root files, 39 DQM output files
  • TriggerResults: no differences found

@smuzaffar
Copy link
Contributor

please test
lets run the gpu tests

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-655e55/18178/summary.html
COMMIT: a8b7b2c
CMSSW: CMSSW_12_1_X_2021-08-30-2300/slc7_aarch64_gcc9
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/7257/18178/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-655e55/18178/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-655e55/18178/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test testFWCoreConcurrency had ERRORS
---> test testFWCoreUtilities had ERRORS
---> test TestFWCoreServicesDriver had ERRORS

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 1, 2021

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-655e55/18179/summary.html
COMMIT: a8b7b2c
CMSSW: CMSSW_12_1_X_2021-08-31-1100/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/7257/18179/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-655e55/18179/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-655e55/18179/git-merge-result

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19735
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19729
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

The workflows 140.53 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 1303 differences found in the comparisons
  • DQMHistoTests: Total files compared: 39
  • DQMHistoTests: Total histograms compared: 3000404
  • DQMHistoTests: Total failures: 3676
  • DQMHistoTests: Total nulls: 20
  • DQMHistoTests: Total successes: 2996686
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 45.699 KiB( 38 files compared)
  • DQMHistoSizes: changed ( 140.53 ): 44.531 KiB Hcal/DigiRunHarvesting
  • DQMHistoSizes: changed ( 140.53 ): 1.172 KiB RPC/DCSInfo
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 165 log files, 37 edm output root files, 39 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 1, 2021

@smuzaffar could you point me to the logs of the jobs that failed to use the GPU ?

@smuzaffar
Copy link
Contributor

@fwyzard , no job failed. It is just that due to newer cuda driver in CMS our tests are not actually using the GPUs

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 1, 2021

It is just that due to newer cuda driver in CMS our tests are not actually using the GPUs

Are you sure they didn't use the GPU ?

For slc7_amd64_gcc900 the Matrix GPU Tests logs (here and for example here) look fine:

31-Aug-2021 21:50:35 UTC  Initiating request to open file file:step2.root
31-Aug-2021 21:50:37 UTC  Successfully opened file file:step2.root
Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 31-Aug-2021 21:50:43.581 UTC
Begin processing the 2nd record. Run 1, Event 2, LumiSection 1 on stream 0 at 31-Aug-2021 21:50:54.140 UTC
Begin processing the 3rd record. Run 1, Event 3, LumiSection 1 on stream 0 at 31-Aug-2021 21:50:54.361 UTC
Begin processing the 4th record. Run 1, Event 4, LumiSection 1 on stream 0 at 31-Aug-2021 21:50:54.447 UTC
Begin processing the 5th record. Run 1, Event 5, LumiSection 1 on stream 0 at 31-Aug-2021 21:50:54.722 UTC
Begin processing the 6th record. Run 1, Event 6, LumiSection 1 on stream 0 at 31-Aug-2021 21:50:54.930 UTC
Begin processing the 7th record. Run 1, Event 7, LumiSection 1 on stream 0 at 31-Aug-2021 21:50:55.147 UTC
Begin processing the 8th record. Run 1, Event 8, LumiSection 1 on stream 0 at 31-Aug-2021 21:50:55.558 UTC
Begin processing the 9th record. Run 1, Event 9, LumiSection 1 on stream 0 at 31-Aug-2021 21:50:55.806 UTC
Begin processing the 10th record. Run 1, Event 10, LumiSection 1 on stream 0 at 31-Aug-2021 21:50:55.968 UTC
31-Aug-2021 21:50:56 UTC  Closed file file:step2.root

If we don't use the GPU there should be a warning about it from the CUDAService.

@smuzaffar
Copy link
Contributor

smuzaffar commented Sep 1, 2021

by the way, I forced set the env to use system cuda driver and gpu workflows worked fine

Singularity> echo $LD_LIBRARY_PATH | tr : '\n'
/pool/condor/dir_27828/del/CMSSW_12_1_X_2021-08-31-1100/biglib/slc7_amd64_gcc900
/pool/condor/dir_27828/del/CMSSW_12_1_X_2021-08-31-1100/lib/slc7_amd64_gcc900
/pool/condor/dir_27828/del/CMSSW_12_1_X_2021-08-31-1100/external/slc7_amd64_gcc900/lib
/cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_12_1_X_2021-08-31-1100/biglib/slc7_amd64_gcc900
/cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_12_1_X_2021-08-31-1100/lib/slc7_amd64_gcc900
/cvmfs/cms-ib.cern.ch/nweek-02696/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_1_X_2021-08-30-1100/biglib/slc7_amd64_gcc900
/cvmfs/cms-ib.cern.ch/nweek-02696/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_1_X_2021-08-30-1100/lib/slc7_amd64_gcc900
/cvmfs/cms-ci.cern.ch/week0/PR_f90b4fc1/slc7_amd64_gcc900/external/llvm/12.0.0-c5cf9d7847e75fb54c7457bce214412e/lib64
/cvmfs/cms-ci.cern.ch/week0/PR_f90b4fc1/slc7_amd64_gcc900/external/gcc/9.3.0/lib64
/cvmfs/cms-ci.cern.ch/week0/PR_f90b4fc1/slc7_amd64_gcc900/external/gcc/9.3.0/lib
/.singularity.d/libs
Singularity> cat runall-report-step123-.log 
10824.502_TTbar_13+2018_Patatrack_PixelOnlyGPU+TTbar_13TeV_TuneCUETP8M1_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Wed Sep  1 06:41:04 2021-date Wed Sep  1 06:36:33 2021; exit: 0 0 0 0
10824.512_TTbar_13+2018_Patatrack_ECALOnlyGPU+TTbar_13TeV_TuneCUETP8M1_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Wed Sep  1 06:40:50 2021-date Wed Sep  1 06:36:34 2021; exit: 0 0 0 0
10824.522_TTbar_13+2018_Patatrack_HCALOnlyGPU+TTbar_13TeV_TuneCUETP8M1_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Wed Sep  1 06:40:50 2021-date Wed Sep  1 06:36:34 2021; exit: 0 0 0 0
3 3 3 3 tests passed, 0 0 0 0 failed

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 1, 2021

The expected behaviour when the NVIDIA drivers from CMSSW are newer than those from the system is that CMSSW uses it's own copy of the drivers to run.
If the GPU is a "server" model (P100, V100, T4, A10, etc.) or a "professional graphics" model (T6000, A6000, etc.) this should work fine.
Only the gaming GPUs do not support it, and would fail to run.

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 1, 2021

by the way, I forced set the env to use system cuda driver and gpu worked worked fine

Do you have the logs from these commands ?

@smuzaffar
Copy link
Contributor

No, the gpu condor node I was working on destroyed few mins ago :-( I am testing it again

@smuzaffar
Copy link
Contributor

@fwyzard , logs of runTheMAtrix for gpu workflows are avauilable under https://muzaffar.web.cern.ch/gpu/ now. Note that in this case we used system cuda driver

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 1, 2021

So, here is the log we expect when a GPU workflow is running on a GPU (from the 10824.502 workflow):

$ cmsRun step3_RAW2DIGI_RECO_VALIDATION_DQM.py

01-Sep-2021 09:21:13 CEST  Initiating request to open file file:step2.root
01-Sep-2021 09:21:13 CEST  Successfully opened file file:step2.root
Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 01-Sep-2021 09:21:22.263 CEST
Begin processing the 2nd record. Run 1, Event 2, LumiSection 1 on stream 0 at 01-Sep-2021 09:21:24.350 CEST
Begin processing the 3rd record. Run 1, Event 3, LumiSection 1 on stream 0 at 01-Sep-2021 09:21:24.520 CEST
Begin processing the 4th record. Run 1, Event 4, LumiSection 1 on stream 0 at 01-Sep-2021 09:21:24.610 CEST
Begin processing the 5th record. Run 1, Event 5, LumiSection 1 on stream 0 at 01-Sep-2021 09:21:24.827 CEST
Begin processing the 6th record. Run 1, Event 6, LumiSection 1 on stream 0 at 01-Sep-2021 09:21:24.990 CEST
Begin processing the 7th record. Run 1, Event 7, LumiSection 1 on stream 0 at 01-Sep-2021 09:21:25.190 CEST
Begin processing the 8th record. Run 1, Event 8, LumiSection 1 on stream 0 at 01-Sep-2021 09:21:25.506 CEST
Begin processing the 9th record. Run 1, Event 9, LumiSection 1 on stream 0 at 01-Sep-2021 09:21:25.685 CEST
Begin processing the 10th record. Run 1, Event 10, LumiSection 1 on stream 0 at 01-Sep-2021 09:21:25.813 CEST
01-Sep-2021 09:21:26 CEST  Closed file file:step2.root

And here is the log we expect whgen the same configuration is run without a GPU:

$ CUDA_VISIBLE_DEVICES= cmsRun step3_RAW2DIGI_RECO_VALIDATION_DQM.py

%MSG-w CUDAService:  (NoModuleName) 01-Sep-2021 09:22:13 CEST pre-events
Failed to initialize the CUDA runtime.
Disabling the CUDAService.
%MSG
01-Sep-2021 09:22:19 CEST  Initiating request to open file file:step2.root
01-Sep-2021 09:22:19 CEST  Successfully opened file file:step2.root
Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 01-Sep-2021 09:22:27.534 CEST
Begin processing the 2nd record. Run 1, Event 2, LumiSection 1 on stream 0 at 01-Sep-2021 09:22:29.781 CEST
Begin processing the 3rd record. Run 1, Event 3, LumiSection 1 on stream 0 at 01-Sep-2021 09:22:29.990 CEST
Begin processing the 4th record. Run 1, Event 4, LumiSection 1 on stream 0 at 01-Sep-2021 09:22:30.097 CEST
Begin processing the 5th record. Run 1, Event 5, LumiSection 1 on stream 0 at 01-Sep-2021 09:22:30.359 CEST
Begin processing the 6th record. Run 1, Event 6, LumiSection 1 on stream 0 at 01-Sep-2021 09:22:30.574 CEST
Begin processing the 7th record. Run 1, Event 7, LumiSection 1 on stream 0 at 01-Sep-2021 09:22:30.803 CEST
Begin processing the 8th record. Run 1, Event 8, LumiSection 1 on stream 0 at 01-Sep-2021 09:22:31.183 CEST
Begin processing the 9th record. Run 1, Event 9, LumiSection 1 on stream 0 at 01-Sep-2021 09:22:31.413 CEST
Begin processing the 10th record. Run 1, Event 10, LumiSection 1 on stream 0 at 01-Sep-2021 09:22:31.576 CEST
01-Sep-2021 09:22:31 CEST  Closed file file:step2.root

The job should be successful in both cases - the only difference being the warning from the CUDAService.

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 1, 2021

@fwyzard , logs of runTheMAtrix for gpu workflows are avauilable under https://muzaffar.web.cern.ch/gpu/ now. Note that in this case we used system cuda driver

OK, so running with the system driver (v460) and CUDA 11.4 works out of the box.

And I interpret the results from the PR test that also using the driver bundled with CMSSW works.

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 1, 2021

OK, so running with the system driver (v460) and CUDA 11.4 works out of the box.

Which matches the release notes, that say we should be able to run on any driver >= v450.80.02

image

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 1, 2021

While, according to the compatibility guide, using the compatibility drivers that we ship with CMSSW would be required if the system drivers are older, from CUDA 10.1 or 10.2:

image

@smuzaffar
Copy link
Contributor

thanks @fwyzard , there was nothing in the logs about GPU usage or device found message thatis why I was thinking that may be GPU was not used. Running these workflows on non-GPU system shows that

%MSG-i ThreadStreamSetup:  (NoModuleName) 01-Sep-2021 09:40:27 CEST pre-events
setting # threads 4
setting # streams 4
%MSG
%MSG-w CUDAService:  (NoModuleName) 01-Sep-2021 09:40:28 CEST pre-events
Failed to initialize the CUDA runtime.
Disabling the CUDAService.
%MSG
01-Sep-2021 09:40:34 CEST  Initiating request to open file file:step2.root
01-Sep-2021 09:40:35 CEST  Successfully opened file file:step2.root

so all looks good for this PR to go in

@smuzaffar
Copy link
Contributor

+externals

@smuzaffar smuzaffar merged commit bbd39aa into cms-sw:IB/CMSSW_12_1_X/master Sep 1, 2021
@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 1, 2021

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_12_1_X/master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 1, 2021

thanks @fwyzard , there was nothing in the logs about GPU usage or device found message thatis why I was thinking that may be GPU was not used.

That's a good point.

@makortel, maybe we should add a simple message from the CUDAService that confirms which GPUs and runtime are used ?
Just a couple of lines, like

%MSG-s CUDAService: ...
CUDA version 11.4.1, driver version 465.19.01, runtime version 470.57.02
GPU 0: NVIDIA Tesla T4 (sm 7.5)
%MSG

?

@fwyzard fwyzard deleted the IB/CMSSW_12_1_X/master_cuda_11.4.1 branch September 1, 2021 10:08
@makortel
Copy link
Contributor

makortel commented Sep 1, 2021

Seems that printout like that would be useful. I'd go with LogInfo though following the number of threads and streams printout.

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 1, 2021

We already have a more verbose LogInfo in the CUDAService.
How do we enable (by default) the new simpler message without bringing in the complex one ? Different category ?

@makortel
Copy link
Contributor

makortel commented Sep 1, 2021

Good point, I got confused by the order of that printout and the application of MessageLogger configuration (that suppresses INFO by default). Maybe the LogSystem would be ok.

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 2, 2021

I actually managed to use LogInfo, adding a new verbose flag to the CUDAService.
See #35117 for more details and a sample output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants