[TF] Added build with GPU support but default is to build without GPU #7617

smuzaffar · 2022-02-10T19:10:33Z

No description provided.

smuzaffar · 2022-02-10T19:10:40Z

please test

cmsbuild · 2022-02-10T19:10:52Z

A new Pull Request was created by @smuzaffar (Malik Shahzad Muzaffar) for branch IB/CMSSW_12_3_X/master.

@smuzaffar, @iarspider can you please review it and eventually sign? Thanks.
@perrotta, @dpiparo, @qliphy you are the release manager for this.
cms-bot commands are listed here

cmsbuild · 2022-02-10T19:30:10Z

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-049d30/22341/summary.html
COMMIT: 52fd78c
CMSSW: CMSSW_12_3_X_2022-02-10-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7617/22341/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 66.035s, Critical Path: 2.67s
INFO: 54 processes: 50 internal, 4 local.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.HeV0DH (%build)


RPM build errors:
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.HeV0DH (%build)

smuzaffar · 2022-02-11T07:33:37Z

please test

smuzaffar · 2022-02-11T12:53:13Z

please test for slc7_ppc64le_gcc11

smuzaffar · 2022-02-11T12:54:00Z

please test for slc7_aarch64_gcc11

cmsbuild · 2022-02-11T14:20:12Z

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-049d30/22354/summary.html
COMMIT: 52fd78c
CMSSW: CMSSW_12_3_X_2022-02-10-2300/slc7_aarch64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7617/22354/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

File "/data/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/pkgtools/scheduler.py", line 267, in doSerial
result = commandSpec[0](*commandSpec[1:])
File "./pkgtools/cmsBuild", line 3651, in installPackage
File "./pkgtools/cmsBuild", line 3399, in installRpm
RpmInstallFailed: Failed to install package cudnn. Reason:
error: Failed dependencies:
	libm.so.6(GLIBC_2.27)(64bit) is needed by external+cudnn+8.2.2.26-5a09bc859d16df5e6a023381eff0b19e-1-1.aarch64

* The action "build-external+tensorflow-sources+2.6.0-e16c1637b92da7a7da348b55d10b8992" was not completed successfully because The following dependencies could not complete:
install-external+cudnn+8.2.2.26-5a09bc859d16df5e6a023381eff0b19e
* The action "build-external+tensorflow+2.6.0-d7d45dfb8a5d2a6b123ee55227ad554e" was not completed successfully because The following dependencies could not complete:

cmsbuild · 2022-02-11T15:19:22Z

Pull request #7617 was updated.

smuzaffar · 2022-02-11T15:19:32Z

please test for slc7_aarch64_gcc11

smuzaffar · 2022-02-11T15:19:45Z

please test for slc7_ppc64le_gcc11

smuzaffar · 2022-02-11T16:11:54Z

please test

cmsbuild · 2022-02-22T08:32:46Z

Pull request #7617 was updated.

cmsbuild · 2022-02-22T16:35:19Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-049d30/22567/summary.html
COMMIT: a907ab6
CMSSW: CMSSW_12_3_X_2022-02-21-2300/slc7_amd64_gcc10
Additional Tests: GPU,THREADING,PROFILING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7617/22567/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 14 differences found in the comparisons
DQMHistoTests: Total files compared: 4
DQMHistoTests: Total histograms compared: 19811
DQMHistoTests: Total failures: 2437
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 17374
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
Checked 12 log files, 9 edm output root files, 4 DQM output files
TriggerResults: no differences found

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 4 differences found in the comparisons
DQMHistoTests: Total files compared: 49
DQMHistoTests: Total histograms compared: 3965143
DQMHistoTests: Total failures: 7
DQMHistoTests: Total nulls: 1
DQMHistoTests: Total successes: 3965113
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: -0.004 KiB( 48 files compared)
DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
Checked 204 log files, 45 edm output root files, 49 DQM output files
TriggerResults: no differences found

smuzaffar · 2022-02-22T18:05:58Z

@tvami, note that this change allow us to build TF with GPU support but I have disabled it by default. TF GPU build currently breaks GPU based tests ( see #7617 (comment) ).

smuzaffar · 2022-02-22T18:07:25Z

with TF GPU we get errors like

2022-02-21 12:56:42.686281: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-21 12:56:42.689865: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-21 12:56:42.693618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30005 MB memory:  -> device: 0, name: Tesla V100S-PCIE-32GB, pci bus id: 0000:00:07.0, compute capability: 7.0
2022-02-21 12:56:42.963412: I tensorflow/stream_executor/cuda/cuda_driver.cc:732] failed to allocate 29.30G (31463047168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-02-21 12:56:42.967774: I tensorflow/stream_executor/cuda/cuda_driver.cc:732] failed to allocate 26.37G (28316741632 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-02-21 12:56:42.971858: I tensorflow/stream_executor/cuda/cuda_driver.cc:732] failed to allocate 23.73G (25485066240 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

tvami · 2022-02-22T18:31:26Z

Hi @smuzaffar ok thanks, I think this is a good first step. We could integrate this and go to the GPU experts for looking into this problem. Maybe we could do a cmssw github issue about it? What do you think?

tvami · 2022-02-22T19:14:24Z

Or maybe I could tag @cms-sw/heterogeneous-l2 here

makortel · 2022-02-22T19:51:58Z

Looking at the log
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-049d30/22532/runTheMatrixGPU-results/11634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log
it is not clear to me if the root error would be

2022-02-21 12:56:39.046596: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

or

2022-02-21 12:56:42.963412: I tensorflow/stream_executor/cuda/cuda_driver.cc:732] failed to allocate 29.30G (31463047168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
...
2022-02-21 12:56:43.092337: I tensorflow/stream_executor/cuda/cuda_driver.cc:732] failed to allocate 1.01G (1080340992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

or

2022-02-21 12:56:44.784921: E tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

The job finally fails with Failed to get convolution algorithm, that is probably related to the cudnn initialization failure.

Whether or not that failure is connected to the memory allocation failures is not clear to me (there is no information in the log e.g. if any allocation actually succeeds). The implication of the first error (warning?) is also unclear.

Would ML folks have any better insight? @vlimant @gkasieczka @riga @yongbinfeng

fwyzard · 2022-02-23T00:26:43Z

According to this StackOverflow answer, this

2022-02-21 12:56:39.046596: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

is just a warning, and it should not be the cause of the failure.

tvami · 2022-02-23T00:43:11Z

I guess the real reason for the wf to fail is the initialization issue that Matti mentioned:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-049d30/22532/runTheMatrixGPU-results/11634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log

----- Begin Fatal Exception 21-Feb-2022 12:56:45 UTC-----------------------
An exception of category 'InvalidRun' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=DeepTauId label='hltHpsPFTauDeepTauProducer'
Exception Message:
error while running session: Unknown: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node inner_egamma_conv_1/convolution}}]]
	 [[inner_all_dropout_4/Identity/_7]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node inner_egamma_conv_1/convolution}}]]

fwyzard · 2022-02-23T00:44:51Z

How can I reproduce the failing tests ?

I tried on lxplus-gpu:

/cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7617/22532/install.sh
cd CMSSW_12_3_X_2022-02-20-2300/src
cmsenv
runTheMatrix.py -w gpu -l 11634.506

and it worked fine:

Preparing to run 11634.506 TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano
...
11634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Wed Feb 23 01:40:53 2022-date Wed Feb 23 01:30:54 2022; exit: 0 0 0 0
1 1 1 1 tests passed, 0 0 0 0 failed

tvami · 2022-02-23T00:49:52Z

@fwyzard I think that's expected bc the current version of this PR switches the TF support off by default.
I think this commit was the one that did this default version change
62b6e91
and indeed it does seem to show that the left has %define enable_cuda 1 in the tensorflow-requires.file

fwyzard · 2022-02-23T01:14:12Z

I thought the area I used for the test was before that change. Otherwise, how can I reproduce the error ?

smuzaffar · 2022-02-23T07:07:05Z

@fwyzard , let me merge this PR and I will open a new PR with GPU enabled which you can use for testing.

smuzaffar · 2022-02-25T09:25:27Z

@fwyzard , tensorflow with GPU support is now available via #7648 . You can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7648/22659/install.sh area to tst it.

One thing with TF and GPU is that when GPU is available then TF will use it. I guess that is the reason many tests failed for ppc64le ( #7617 (comment) ) as our powerpc nodes has GPU.

[TF] Build with GPU support

52fd78c

cmsbuild added externals-pending orp-pending pending-signatures tests-started labels Feb 10, 2022

cmsbuild added tests-rejected tests-started and removed tests-started tests-rejected labels Feb 10, 2022

cms-sw deleted a comment from cmsbuild Feb 11, 2022

cmsbuild added tests-started and removed tests-rejected labels Feb 11, 2022

disable gpu for slc7_aarch64 due to missing cudnn support

0ab8919

cmsbuild added tests-pending and removed tests-started labels Feb 11, 2022

cmsbuild added tests-started and removed tests-pending labels Feb 11, 2022

cmsbuild added tests-started and removed tests-rejected labels Feb 22, 2022

cmsbuild added tests-approved and removed tests-started labels Feb 22, 2022

smuzaffar merged commit d2d1a65 into IB/CMSSW_12_3_X/master Feb 23, 2022

smuzaffar deleted the tf-gpu-260 branch February 23, 2022 07:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TF] Added build with GPU support but default is to build without GPU #7617

[TF] Added build with GPU support but default is to build without GPU #7617

smuzaffar commented Feb 10, 2022

smuzaffar commented Feb 10, 2022

cmsbuild commented Feb 10, 2022

cmsbuild commented Feb 10, 2022

smuzaffar commented Feb 11, 2022

smuzaffar commented Feb 11, 2022

smuzaffar commented Feb 11, 2022

cmsbuild commented Feb 11, 2022

cmsbuild commented Feb 11, 2022

smuzaffar commented Feb 11, 2022

smuzaffar commented Feb 11, 2022

smuzaffar commented Feb 11, 2022

cmsbuild commented Feb 22, 2022

cmsbuild commented Feb 22, 2022

smuzaffar commented Feb 22, 2022 •

edited

smuzaffar commented Feb 22, 2022

tvami commented Feb 22, 2022

tvami commented Feb 22, 2022

makortel commented Feb 22, 2022

fwyzard commented Feb 23, 2022

tvami commented Feb 23, 2022

fwyzard commented Feb 23, 2022

tvami commented Feb 23, 2022

fwyzard commented Feb 23, 2022 via email

smuzaffar commented Feb 23, 2022

smuzaffar commented Feb 25, 2022

[TF] Added build with GPU support but default is to build without GPU #7617

[TF] Added build with GPU support but default is to build without GPU #7617

Conversation

smuzaffar commented Feb 10, 2022

smuzaffar commented Feb 10, 2022

cmsbuild commented Feb 10, 2022

cmsbuild commented Feb 10, 2022

External Build

smuzaffar commented Feb 11, 2022

smuzaffar commented Feb 11, 2022

smuzaffar commented Feb 11, 2022

cmsbuild commented Feb 11, 2022

External Build

cmsbuild commented Feb 11, 2022

smuzaffar commented Feb 11, 2022

smuzaffar commented Feb 11, 2022

smuzaffar commented Feb 11, 2022

cmsbuild commented Feb 22, 2022

cmsbuild commented Feb 22, 2022

GPU Comparison Summary

Comparison Summary

smuzaffar commented Feb 22, 2022 • edited

smuzaffar commented Feb 22, 2022

tvami commented Feb 22, 2022

tvami commented Feb 22, 2022

makortel commented Feb 22, 2022

fwyzard commented Feb 23, 2022

tvami commented Feb 23, 2022

fwyzard commented Feb 23, 2022

tvami commented Feb 23, 2022

fwyzard commented Feb 23, 2022 via email

smuzaffar commented Feb 23, 2022

smuzaffar commented Feb 25, 2022

smuzaffar commented Feb 22, 2022 •

edited