Fix for crash in gpuClusterChargeCut.h after "warning too many clusters ..." #35829

czangela · 2021-10-25T12:46:04Z

PR description:

This issue occurs running on some TTbar data generated with Run3 end-of-2024 conditions.
The full dataset can be found at https://cmsweb.cern.ch/das/request?instance=prod/phys03&input=file+dataset%3D%2FTTbar%2Ftsusa-TTbar_14TeV_GSDR_2024_121X_mcRun3_2024_realistic_v8_10k-58ae893e2544746328b272cd68e5d2f1%2FUSER, there is a trimmed file in /afs/cern.ch/work/a/aczirkos/public/gpu_charge_crash_10_25/ containing the crashing event in step2_2.root, running on this should be enough to reproduce the error.

In case there are too many clusters in a module, the local reconstruction fails in gpuClusterChargeCut.h with the following error:

Warning too many clusters in module 824 in block 369: 2932 > 1024
terminate called after throwing an instance of 'std::runtime_error'
  what():  
/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/16fe195e1e17677c5420a19d749bdb25/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_1_X_2021-10-24-2300/src/HeterogeneousCore/CUDAUtilities/src/CachingHostAllocator.h, line 543:
cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
cudaErrorIllegalAddress: an illegal memory access was encountered

There is a full log at /afs/cern.ch/work/a/aczirkos/public/gpu_charge_crash_10_25/log.

This is due to an indexing bug, where we set clusterId[i] = invalidModuleId first, and then we use this to index the newClusId array which has size maxNumClustersPerModules. (invalidModuleId > maxNumClustersPerModules)

This PR fixes this error.

PR validation:

This is a short recipe for reproducing the error, and checking that this PR fixes it.
On a GPU equipped machine:

cmsrel CMSSW_12_1_X_2021-10-24-2300
cd CMSSW_12_1_X_2021-10-24-2300/src
cmsenv
git cms-addpkg RecoLocalTracker/SiPixelClusterizer
mkdir ../runtests
cd ../runtests
rsync -avzzr aczirkos@lxplus.cern.ch:/afs/cern.ch/work/a/aczirkos/public/gpu_charge_crash_10_25/* .
# run without fix
cmsRun step3.py
# apply fix and run with it
cd -
git apply ../runtests/fix.patch
scram b -j
cd -
cmsRun step3.py

if this PR is a backport please specify the original PR and why you need to backport that PR:

kindly pinging @tsusa @connorpa

cmsbuild · 2021-10-25T13:02:22Z

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35829/26191

This PR adds an extra 12KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File RecoLocalTracker/SiPixelClusterizer/plugins/gpuClusterChargeCut.h modified in PR(s): Improve various Patatrack Kernels #35598

Code check has found code style and quality issues which could be resolved by applying following patch(s)

code-format:
https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35829/26191/code-format.patch
e.g. curl -k https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35829/26191/code-format.patch | patch -p1
You can also run scram build code-format to apply code format directly

VinInn · 2021-10-25T13:56:11Z

Thanks for the fix.
I think that the new version proposed in
https://github.com/cms-sw/cmssw/pull/35598/files#diff-dc414af4275c81d976ca3bb89ac05527f0a84df4fbc5c549368d50e4721e897aR111

should not suffer of this defect.

slava77 · 2021-10-25T14:42:51Z

Thanks for the fix. I think that the new version proposed in https://github.com/cms-sw/cmssw/pull/35598/files#diff-dc414af4275c81d976ca3bb89ac05527f0a84df4fbc5c549368d50e4721e897aR111

should not suffer of this defect.

does it mean that this PR is not needed? (assuming we sort out the unexpected diffs in #35598 DQM plots)

VinInn · 2021-10-25T14:45:10Z

I'm checking further. It seems that there can be other issues with that event with a huge occupancy in one module.

VinInn · 2021-10-25T14:52:59Z

The clusterizer is limited to 6000 pixels BUT in reality the max is 16*blocksize and blocksize now is 256...
As usual: theere is an assert (k<maxiter) BUT the assert is not compiled
The max was raised
//6000 max pixels required for HI operations with no measurable impact on pp performance
w/o checking the consenquenses...

in the event in question there are 4340 in the module
So either we move maxpixel back to 4K or increase maxiter, or increase block size to 512.
and for sure let's add a warning in the clusterizer!

VinInn · 2021-10-25T15:15:01Z

This commit (or similar) is required as well
58fd079

VinInn · 2021-10-25T15:35:39Z

Besides bugs (that should not be there) I would like to understand what kind of events are those:
Very few hits and tracks. no pu I assume/hope.

So: what is causing such an anomalous occupancy in just one module?

cmsbuild · 2021-10-26T07:25:53Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35829/26213

This PR adds an extra 20KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu modified in PR(s): Improve Various Patatrack kernels #35835
- File RecoLocalTracker/SiPixelClusterizer/plugins/gpuClusterChargeCut.h modified in PR(s): Improve Various Patatrack kernels #35835
- File RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h modified in PR(s): Improve Various Patatrack kernels #35835, Use cooperative groups to populate Associations (Histograms) in Pixel Patatrack #35713

cmsbuild · 2021-10-26T07:26:15Z

A new Pull Request was created by @czangela for master.

It involves the following packages:

RecoLocalTracker/SiPixelClusterizer (reconstruction)

@jpata, @cmsbuild, @slava77 can you please review it and eventually sign? Thanks.
@mtosi, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @OzAmram, @ferencek, @dkotlins, @gpetruc, @mmusich, @threus, @tvami this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

czangela · 2021-10-26T07:30:37Z

I will close this PR since #35835 contains the necessary fixes.
@tsusa can give a better comment on the nature of the events, but as far as understand there should be PU.
Apart from that they are ttbar events with 2024 conditions. We talked about applying dynamic inefficiency to be able to better test some features but didn't in the end.

VinInn · 2021-10-26T07:32:49Z

This PR may be useful as backport.

out of bound if clusterId[i] == invalidModuleId fix

4cf3f08

cmsbuild added this to the CMSSW_12_1_X milestone Oct 25, 2021

cmsbuild added code-checks-pending orp-pending pending-signatures reconstruction-pending tests-pending labels Oct 25, 2021

cmsbuild added code-checks-rejected and removed code-checks-pending labels Oct 25, 2021

This was referenced Oct 25, 2021

Improve various Patatrack Kernels #35598

Closed

Improve Various Patatrack kernels #35835

Merged

fix gpu clustering block size and add warning

3254676

cmsbuild added code-checks-pending and removed code-checks-rejected labels Oct 26, 2021

cmsbuild added code-checks-approved and removed code-checks-pending labels Oct 26, 2021

czangela closed this Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for crash in gpuClusterChargeCut.h after "warning too many clusters ..." #35829

Fix for crash in gpuClusterChargeCut.h after "warning too many clusters ..." #35829

czangela commented Oct 25, 2021

cmsbuild commented Oct 25, 2021

VinInn commented Oct 25, 2021 •

edited

slava77 commented Oct 25, 2021

VinInn commented Oct 25, 2021

VinInn commented Oct 25, 2021 •

edited

VinInn commented Oct 25, 2021

VinInn commented Oct 25, 2021

cmsbuild commented Oct 26, 2021

cmsbuild commented Oct 26, 2021

czangela commented Oct 26, 2021

VinInn commented Oct 26, 2021

Fix for crash in gpuClusterChargeCut.h after "warning too many clusters ..." #35829

Fix for crash in gpuClusterChargeCut.h after "warning too many clusters ..." #35829

Conversation

czangela commented Oct 25, 2021

PR description:

PR validation:

if this PR is a backport please specify the original PR and why you need to backport that PR:

cmsbuild commented Oct 25, 2021

VinInn commented Oct 25, 2021 • edited

slava77 commented Oct 25, 2021

VinInn commented Oct 25, 2021

VinInn commented Oct 25, 2021 • edited

VinInn commented Oct 25, 2021

VinInn commented Oct 25, 2021

cmsbuild commented Oct 26, 2021

cmsbuild commented Oct 26, 2021

czangela commented Oct 26, 2021

VinInn commented Oct 26, 2021

VinInn commented Oct 25, 2021 •

edited

VinInn commented Oct 25, 2021 •

edited