Triton test fixes #35328

kpedro88 · 2021-09-17T23:30:35Z

PR description:

Resolves #34547 (GPU IB tests)
Resolves #35206 (PPC IB tests)

PR validation:

Ran tests on machines with appropriate architectures/hardware.

cmsbuild · 2021-09-17T23:35:46Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35328/25373

This PR adds an extra 20KB to repository

cmsbuild · 2021-09-17T23:36:08Z

A new Pull Request was created by @kpedro88 (Kevin Pedro) for master.

It involves the following packages:

HeterogeneousCore/SonicTriton (heterogeneous)

@makortel, @cmsbuild, @fwyzard can you please review it and eventually sign? Thanks.
@makortel, @riga, @rovere this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

makortel · 2021-09-17T23:41:05Z

test parameters:

enable tests = gpu
release = slc7_ppc64le_gcc9

makortel · 2021-09-17T23:41:12Z

@cmsbuild, please test

kpedro88 · 2021-09-18T00:00:35Z

@makortel the bot gave your test settings a thumbs down... I think that means there's a syntax error

makortel · 2021-09-18T00:10:48Z

test parameters:

enable_tests = gpu
release = slc7_ppc64le_gcc9

makortel · 2021-09-18T00:11:05Z

@kpedro Thanks, let's see if I was more successful now...

makortel · 2021-09-18T00:11:14Z

@cmsbuild, please test

cmsbuild · 2021-09-18T02:14:29Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a6d7ce/18725/summary.html
COMMIT: 1815dec
CMSSW: CMSSW_12_1_X_2021-09-16-2300/slc7_ppc64le_gcc9
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/35328/18725/install.sh to create a dev area with all the needed externals and cmssw changes.

fwyzard · 2021-09-18T05:46:02Z

HeterogeneousCore/SonicTriton/test/unittest.sh

+		echo "has NVIDIA driver"
+	else
+		echo "missing (or too old) NVIDIA driver"
+		exit 0


Rather than simply bailing out, you could try using the CUDA compatibility drivers that are shipped with CMSSW.

You can see how it is implemented for the CMSSW environment set up by SCRAM in https://github.com/cms-sw/cmssw-config/blob/scramv3/SCRAM/hooks/runtime/00-nvidia-drivers .

For more (somewhat confusing) information, see https://docs.nvidia.com/deploy/cuda-compatibility/ .

Thanks, this is a definite improvement. The PR is updated with this now. I have a few questions and comments that I'll post in the main PR thread.

cmsbuild · 2021-09-18T08:56:49Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a6d7ce/18726/summary.html
COMMIT: 1815dec
CMSSW: CMSSW_12_1_X_2021-09-17-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/35328/18726/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

The workflows 140.53 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons

Summary:

No significant changes to the logs found
Reco comparison results: 1299 differences found in the comparisons
DQMHistoTests: Total files compared: 39
DQMHistoTests: Total histograms compared: 3000833
DQMHistoTests: Total failures: 3671
DQMHistoTests: Total nulls: 19
DQMHistoTests: Total successes: 2997121
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 45.703 KiB( 38 files compared)
DQMHistoSizes: changed ( 140.53 ): 44.531 KiB Hcal/DigiRunHarvesting
DQMHistoSizes: changed ( 140.53 ): 1.172 KiB RPC/DCSInfo
Checked 165 log files, 37 edm output root files, 39 DQM output files
TriggerResults: no differences found

cmsbuild · 2021-09-21T17:56:10Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35328/25435

This PR adds an extra 28KB to repository

cmsbuild · 2021-09-21T17:56:33Z

Pull request #35328 was updated. @makortel, @cmsbuild, @fwyzard can you please check and sign again.

kpedro88 · 2021-09-21T18:19:43Z

Compatibility drivers are now working.

This actually led to a bug report for Triton, triton-inference-server/server#3382, though it's possible the actual bug should be fixed in nvidia-docker or something lower-level. For now, I've implemented a workaround in cmsTriton.

A few questions (for @fwyzard or anyone else who knows):

When I run cuda-compatible-runtime -v on a (datacenter) GPU with older drivers, it reports:
```
CUDA driver version 11.2
CUDA runtime version 11.4
```
and succeeds even without putting the compatibility drivers in LD_LIBRARY_PATH (as 00-nvidia-drivers does if the test fails without the compatibility drivers). Is this expected?
What is the right way to test for non-datacenter GPUs, for which the compatibility drivers won't work (if I understand correctly)?

fwyzard · 2021-09-21T19:22:53Z

A few questions (for @fwyzard or anyone else who knows):

1. When I run `cuda-compatible-runtime -v` on a (datacenter) GPU with older drivers, it reports:
   ```
   CUDA driver version 11.2
   CUDA runtime version 11.4
   ```
   and succeeds even without putting the compatibility drivers in `LD_LIBRARY_PATH` (as [00-nvidia-drivers](https://github.com/cms-sw/cmssw-config/blob/scramv3/SCRAM/hooks/runtime/00-nvidia-drivers) does if the test fails without the compatibility drivers). Is this expected?

Yes, it's expected.
I find the documentation confusing, but the gist seems to be that any CUDA 11.x runtime should work with any CUDA 11.y driver, as long as the underlying NVIDIA driver is >= 450.80.02.
This is a welcome change with respect to the previous versions of CUDA.

2. What is the right way to test for non-datacenter GPUs, for which the compatibility drivers won't work (if I understand correctly)?

I'm not aware of any explicit way to test for it.

For the SCRAM check, I resorted to this logic:

first, check if the CUDA runtime bundled with CMSSW is compatible with the system libraries; if it is (due to being the same or newer version, or due to the minor version compatibility), nothing else is needed;
otherwise, check if the CUDA runtime bundled with CMSSW can be used with the compatibility drivers; if the previous check failed and this one works, we have a datacenter GPU - add the compatibility driver to the environment
otherwise, add the stub libraries to the environment; CUDA application will fail, but at least we can compile and link.

kpedro88 · 2021-09-21T19:48:01Z

I also find the documentation rather confusing.

From what I can tell with the Triton server, it's a bit pickier than CMSSW: it cares about driver-driver compatibility, not driver-runtime compatibility. I think the check I've implemented handles this properly.

As far as non-datacenter GPUs, I guess it's fine if the test just fails on those machines for now, since it's unlikely we'll be running IB tests on them frequently. (I actually have such a GPU, but I keep its drivers up to date in order to continue Triton-related development.)

kpedro88 · 2021-09-21T21:46:29Z

please test

cmsbuild · 2021-09-21T23:24:58Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a6d7ce/18808/summary.html
COMMIT: b96778b
CMSSW: CMSSW_12_1_X_2021-09-20-2300/slc7_ppc64le_gcc9
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/35328/18808/install.sh to create a dev area with all the needed externals and cmssw changes.

fwyzard · 2021-09-23T09:42:48Z

From what I can tell with the Triton server, it's a bit pickier than CMSSW: it cares about driver-driver compatibility, not driver-runtime compatibility. I think the check I've implemented handles this properly.

Ah... then maybe Triton (or, possibly, the TensorRT backend) is implemented using the CUDA driver API, rather than the runtime API.

kpedro88 · 2021-09-23T17:26:32Z

@fwyzard @makortel any further comments or tests?

makortel · 2021-09-23T19:16:58Z

Looks ok to me

fwyzard · 2021-09-24T07:39:01Z

The changes to the main part look good.

Skipping the unit test on non-amd64 architectures doesn't seem like the intended behaviour for the unit tests - but I can see the point, since we don't have an "expected to fail" category for the tests.

fwyzard · 2021-09-24T07:39:08Z

+heterogeneous

cmsbuild · 2021-09-24T07:39:26Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

perrotta · 2021-09-24T09:43:03Z

+1

Technical
Tested by the experts

kpedro88 mentioned this pull request Sep 17, 2021

HeterogeneousCore/SonicTriton unit test failing for PPC IBs #35206

Closed

cmsbuild added this to the CMSSW_12_1_X milestone Sep 17, 2021

cmsbuild added code-checks-pending heterogeneous-pending orp-pending pending-signatures tests-pending labels Sep 17, 2021

kpedro88 mentioned this pull request Sep 17, 2021

SonicTriton test failures in GPU IBs #34547

Closed

cmsbuild added code-checks-approved and removed code-checks-pending labels Sep 17, 2021

cmsbuild added tests-started and removed tests-pending labels Sep 17, 2021

fwyzard reviewed Sep 18, 2021

View reviewed changes

cmsbuild added tests-approved and removed tests-started labels Sep 18, 2021

kpedro88 added 5 commits September 20, 2021 09:44

gpu singularity-in-singularity workaround

e19d9e3

check nvidia driver version

dcf080b

disable tests for non-amd64 archs

2f06e80

try to find compatibility drivers

0fe705f

simpler check

03d9f28

cmsbuild added code-checks-approved and removed code-checks-pending labels Sep 21, 2021

cmsbuild added tests-started and removed tests-pending labels Sep 21, 2021

cmsbuild added tests-approved and removed tests-started labels Sep 21, 2021

cmsbuild added fully-signed heterogeneous-approved and removed pending-signatures heterogeneous-pending labels Sep 24, 2021

fwyzard mentioned this pull request Sep 24, 2021

add support for "expected to fail" tests ? #35395

Open

cmsbuild added orp-approved and removed orp-pending labels Sep 24, 2021

cmsbuild merged commit 7c3f222 into cms-sw:master Sep 24, 2021

cmsbuild mentioned this pull request Sep 25, 2021

Add NanoAOD DQM to RelVal wfs #35412

Merged

kpedro88 mentioned this pull request May 2, 2022

Triton fallback server failure in workflow 10805.31 step 3 #37767

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triton test fixes #35328

Triton test fixes #35328

kpedro88 commented Sep 17, 2021

cmsbuild commented Sep 17, 2021

cmsbuild commented Sep 17, 2021

makortel commented Sep 17, 2021

makortel commented Sep 17, 2021

kpedro88 commented Sep 18, 2021

makortel commented Sep 18, 2021

makortel commented Sep 18, 2021

makortel commented Sep 18, 2021

cmsbuild commented Sep 18, 2021

fwyzard Sep 18, 2021

kpedro88 Sep 21, 2021

cmsbuild commented Sep 18, 2021

cmsbuild commented Sep 21, 2021

cmsbuild commented Sep 21, 2021

kpedro88 commented Sep 21, 2021

fwyzard commented Sep 21, 2021

kpedro88 commented Sep 21, 2021

kpedro88 commented Sep 21, 2021

cmsbuild commented Sep 21, 2021

fwyzard commented Sep 23, 2021

kpedro88 commented Sep 23, 2021

makortel commented Sep 23, 2021

fwyzard commented Sep 24, 2021

fwyzard commented Sep 24, 2021

cmsbuild commented Sep 24, 2021

perrotta commented Sep 24, 2021

Triton test fixes #35328

Triton test fixes #35328

Conversation

kpedro88 commented Sep 17, 2021

PR description:

PR validation:

cmsbuild commented Sep 17, 2021

cmsbuild commented Sep 17, 2021

makortel commented Sep 17, 2021

makortel commented Sep 17, 2021

kpedro88 commented Sep 18, 2021

makortel commented Sep 18, 2021

makortel commented Sep 18, 2021

makortel commented Sep 18, 2021

cmsbuild commented Sep 18, 2021

fwyzard Sep 18, 2021

Choose a reason for hiding this comment

kpedro88 Sep 21, 2021

Choose a reason for hiding this comment

cmsbuild commented Sep 18, 2021

Comparison Summary

cmsbuild commented Sep 21, 2021

cmsbuild commented Sep 21, 2021

kpedro88 commented Sep 21, 2021

fwyzard commented Sep 21, 2021

kpedro88 commented Sep 21, 2021

kpedro88 commented Sep 21, 2021

cmsbuild commented Sep 21, 2021

fwyzard commented Sep 23, 2021

kpedro88 commented Sep 23, 2021

makortel commented Sep 23, 2021

fwyzard commented Sep 24, 2021

fwyzard commented Sep 24, 2021

cmsbuild commented Sep 24, 2021

perrotta commented Sep 24, 2021