-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT Farm crashes in run 378940 #44634
Comments
cms-bot internal usage |
A new Issue was created by @wonpoint4. @makortel, @sextonkennedy, @antoniovilela, @Dr15Jones, @smuzaffar, @rappoccio can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign hlt, heterogeneous |
New categories assigned: hlt,heterogeneous @Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
executing the reproducer with I see the following stack:
@cms-sw/pf-l2 FYI |
From the stack trace, it seems that an exception was thrown while another exception was being handled:
while
@mmusich, if you have time to look into this further, could you try running with a single stream / single thread, and post the full stack trace ? |
sure. Adding to the configuration file
I get the following stack attached: crash_run378940.log
|
Thanks. So, "cudaErrorIllegalAddress" is basically the GPU equivalent of "Segmentation violation" :-( What happens with the stack trace is that once we hit a CUDA error, we raise an exception and start unwinding the stack. While doing that we try to free some CUDA memory, but that call to do that also fails (because Of course this doesn't explain the reason for the error that we hit in the first place... that will need to be debugged. |
Here's a second reproducer (same input events). I see the seg-fault when running on CPU only too. #!/bin/bash -ex
# CMSSW_14_0_4
hltGetConfiguration run:378940 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input /store/group/tsg/FOG/debug/240405_run378940/files/run378940_ls0021_index000036_fu-c2b02-31-01_pid1363776.root \
> hlt.py
cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.options.accelerators = ["*"]
@EOF
CUDA_LAUNCH_BLOCKING=1 \
cmsRun hlt.py &> hlt.log Stack trace here: hlt.log.
|
type pf |
would running in |
the trace was more informative when recompiled with |
Just to note that (see #44634 (comment)) I get a crash even on CPU, so I suspect the issue is unrelated to CUDA or GPUs (but it should be double-checked, of course). In that case, the title of the issue should be updated. @wonpoint4 |
I was wondering if the warning I reported above
generated here: cmssw/RecoParticleFlow/PFClusterProducer/plugins/alpaka/PFClusterSoAProducerKernel.dev.cc Lines 1308 to 1311 in f5861db
might give hints. |
It sort of makes sense to me that with this I am still investigating in the PF Alpaka Kernel since this number of rechit fractions seems strangely large when preceding events look more reasonable. |
I'm guessing that
@jsamudio could you check what is the actual SoA size in the event where the crash happens ? If this is overflow is the cause of the crash - what can be done to avoid it ? |
In the event where we see the crash we have As for adding an error and skipping the event, I understand the idea, but I don't know if I've seen an example of something similar to this before. Perhaps someone else has and could point me to an implementation? |
As a quick workaround, would it work to increase the 120 to something like 250 in the HLT menu ? Not as a long term solution, but to eliminate or at least reduce the online crashes, while a better solution is being investigated. |
Would this entail a configuration change or change in the code (new online release)? |
I think it's a configuration parameter.
|
answering myself: process.hltParticleFlowClusterHBHESoA = cms.EDProducer( "PFClusterSoAProducer@alpaka",
pfRecHits = cms.InputTag( "hltParticleFlowRecHitHBHESoA" ),
pfClusterParams = cms.ESInputTag( "hltESPPFClusterParams","" ),
topology = cms.ESInputTag( "hltESPPFRecHitHCALTopology","" ),
synchronise = cms.bool( False ),
- pfRecHitFractionAllocation = cms.int32( 120 ),
+ pfRecHitFractionAllocation = cms.int32( 250 ),
alpaka = cms.untracked.PSet( backend = cms.untracked.string( "" ) )
) |
FTR, I double-checked that #44634 (comment) avoids the crash in the reproducer, and the HLT throughput is not affected, so it looks like a good short-term solution. Two extra notes.
|
I took a stab on trying to have the error(s) reported properly via exceptions rather than crashes (caused by exceptions being thrown during stack unwinding caused by an exception). #44730 should improve the situation (especially when running with While developing the PR I started to wonder if Alpaka-specific exception type (or GPU runtime specific? or |
for the record this was also tracked at https://its.cern.ch/jira/browse/CMSHLT-3144 |
Report the large numbers of GPU-related HLT crashes yesterday night (elog)
Here's the recipe how to reproduce the crashes. (tested with
CMSSW_14_0_4
onlxplus8-gpu
)@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI
The text was updated successfully, but these errors were encountered: