Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for crash in gpuClusterChargeCut.h after "warning too many clusters ..." #35829

Closed

Conversation

czangela
Copy link
Contributor

PR description:

This issue occurs running on some TTbar data generated with Run3 end-of-2024 conditions.
The full dataset can be found at https://cmsweb.cern.ch/das/request?instance=prod/phys03&input=file+dataset%3D%2FTTbar%2Ftsusa-TTbar_14TeV_GSDR_2024_121X_mcRun3_2024_realistic_v8_10k-58ae893e2544746328b272cd68e5d2f1%2FUSER, there is a trimmed file in /afs/cern.ch/work/a/aczirkos/public/gpu_charge_crash_10_25/ containing the crashing event in step2_2.root, running on this should be enough to reproduce the error.

In case there are too many clusters in a module, the local reconstruction fails in gpuClusterChargeCut.h with the following error:

Warning too many clusters in module 824 in block 369: 2932 > 1024
terminate called after throwing an instance of 'std::runtime_error'
  what():  
/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/16fe195e1e17677c5420a19d749bdb25/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_1_X_2021-10-24-2300/src/HeterogeneousCore/CUDAUtilities/src/CachingHostAllocator.h, line 543:
cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
cudaErrorIllegalAddress: an illegal memory access was encountered

There is a full log at /afs/cern.ch/work/a/aczirkos/public/gpu_charge_crash_10_25/log.

This is due to an indexing bug, where we set clusterId[i] = invalidModuleId first, and then we use this to index the newClusId array which has size maxNumClustersPerModules. (invalidModuleId > maxNumClustersPerModules)

This PR fixes this error.

PR validation:

This is a short recipe for reproducing the error, and checking that this PR fixes it.
On a GPU equipped machine:

cmsrel CMSSW_12_1_X_2021-10-24-2300
cd CMSSW_12_1_X_2021-10-24-2300/src
cmsenv
git cms-addpkg RecoLocalTracker/SiPixelClusterizer
mkdir ../runtests
cd ../runtests
rsync -avzzr aczirkos@lxplus.cern.ch:/afs/cern.ch/work/a/aczirkos/public/gpu_charge_crash_10_25/* .
# run without fix
cmsRun step3.py
# apply fix and run with it
cd -
git apply ../runtests/fix.patch
scram b -j
cd -
cmsRun step3.py

if this PR is a backport please specify the original PR and why you need to backport that PR:

kindly pinging @tsusa @connorpa

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35829/26191

  • This PR adds an extra 12KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@VinInn
Copy link
Contributor

VinInn commented Oct 25, 2021

Thanks for the fix.
I think that the new version proposed in
https://github.com/cms-sw/cmssw/pull/35598/files#diff-dc414af4275c81d976ca3bb89ac05527f0a84df4fbc5c549368d50e4721e897aR111

should not suffer of this defect.

@slava77
Copy link
Contributor

slava77 commented Oct 25, 2021

Thanks for the fix. I think that the new version proposed in https://github.com/cms-sw/cmssw/pull/35598/files#diff-dc414af4275c81d976ca3bb89ac05527f0a84df4fbc5c549368d50e4721e897aR111

should not suffer of this defect.

does it mean that this PR is not needed? (assuming we sort out the unexpected diffs in #35598 DQM plots)

@VinInn
Copy link
Contributor

VinInn commented Oct 25, 2021

I'm checking further. It seems that there can be other issues with that event with a huge occupancy in one module.

@VinInn
Copy link
Contributor

VinInn commented Oct 25, 2021

The clusterizer is limited to 6000 pixels BUT in reality the max is 16*blocksize and blocksize now is 256...
As usual: theere is an assert (k<maxiter) BUT the assert is not compiled
The max was raised
//6000 max pixels required for HI operations with no measurable impact on pp performance
w/o checking the consenquenses...

in the event in question there are 4340 in the module
So either we move maxpixel back to 4K or increase maxiter, or increase block size to 512.
and for sure let's add a warning in the clusterizer!

@VinInn
Copy link
Contributor

VinInn commented Oct 25, 2021

This commit (or similar) is required as well
58fd079

@VinInn
Copy link
Contributor

VinInn commented Oct 25, 2021

Besides bugs (that should not be there) I would like to understand what kind of events are those:
Very few hits and tracks. no pu I assume/hope.

So: what is causing such an anomalous occupancy in just one module?

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35829/26213

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @czangela for master.

It involves the following packages:

  • RecoLocalTracker/SiPixelClusterizer (reconstruction)

@jpata, @cmsbuild, @slava77 can you please review it and eventually sign? Thanks.
@mtosi, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @OzAmram, @ferencek, @dkotlins, @gpetruc, @mmusich, @threus, @tvami this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@czangela
Copy link
Contributor Author

I will close this PR since #35835 contains the necessary fixes.
@tsusa can give a better comment on the nature of the events, but as far as understand there should be PU.
Apart from that they are ttbar events with 2024 conditions. We talked about applying dynamic inefficiency to be able to better test some features but didn't in the end.

@czangela czangela closed this Oct 26, 2021
@VinInn
Copy link
Contributor

VinInn commented Oct 26, 2021

This PR may be useful as backport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants