HLT Farm crashes in run 378940 #44634

wonpoint4 · 2024-04-05T13:30:09Z

Report the large numbers of GPU-related HLT crashes yesterday night (elog)

Related to illegal memory access
Not fully understood as HLT menus were unchanged with respect to the previous runs

Here's the recipe how to reproduce the crashes. (tested with CMSSW_14_0_4 on lxplus8-gpu)

cmsrel CMSSW_14_0_4
cd CMSSW_14_0_4/src
cmsenv

https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 378940 > hlt_run378940.py
cat <<@EOF >> hlt_run378940.py
from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(
    buBaseDir = '/eos/cms/store/group/phys_muon/wjun/error_stream',
    runNumber = 378940
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
    fileListMode = True,
    fileNames = (
        '/eos/cms/store/group/phys_muon/wjun/error_stream/run378940/run378940_ls0021_index000036_fu-c2b02-31-01_pid1363776.raw',
    )
)
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

mkdir run378940
cmsRun hlt_run378940.py &> crash_run378940.log

@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI

The text was updated successfully, but these errors were encountered:

cmsbuild · 2024-04-05T13:30:32Z

cms-bot internal usage

cmsbuild · 2024-04-05T13:30:32Z

A new Issue was created by @wonpoint4.

@makortel, @sextonkennedy, @antoniovilela, @Dr15Jones, @smuzaffar, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich · 2024-04-05T13:37:58Z

assign hlt, heterogeneous

cmsbuild · 2024-04-05T13:38:22Z

New categories assigned: hlt,heterogeneous

@Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich · 2024-04-05T13:45:54Z

executing the reproducer with CUDA_LAUNCH_BLOCKING=1

I see the following stack:

Fri Apr  5 15:43:25 CEST 2024
Thread 12 (Thread 0x7f8afb1ff640 (LWP 1360807) "cmsRun"):
#0  0x00007f8b5934291f in poll () from /lib64/libc.so.6
#1  0x00007f8b52e5a62f in full_read.constprop () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f8b52e0ee3c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x00007f8b52e0f7a0 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f8b592a154c in __pthread_kill_implementation () from /lib64/libc.so.6
#6  0x00007f8b59254d06 in raise () from /lib64/libc.so.6
#7  0x00007f8b592287f3 in abort () from /lib64/libc.so.6
#8  0x00007f8b596aeeea in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:50
#9  0x00007f8b596ace6a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#10 0x00007f8b596abed9 in __cxa_call_terminate (ue_header=0x7f8af8d143c0) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#11 0x00007f8b596ac5f6 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=<optimized out>, context=0x7f8afb1f88e0) at ../../../../lib\
stdc++-v3/libsupc++/eh_personality.cc:688
#12 0x00007f8b5a11b864 in _Unwind_RaiseException_Phase2 (exc=0x7f8af8d143c0, context=0x7f8afb1f88e0, frames_p=0x7f8afb1f87e8) at ../../../libgcc/unwind.inc:64
#13 0x00007f8b5a11c2bd in _Unwind_Resume (exc=0x7f8af8d143c0) at ../../../libgcc/unwind.inc:242
#14 0x00007f8af1e2e5aa in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false> >::free(void*) [clone .cold] () from /cvmfs\
/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableCudaAsync.so
#15 0x00007f8af1e3bc68 in std::_Sp_counted_ptr_inplace<alpaka::detail::BufCpuImpl<std::byte, std::integral_constant<unsigned long, 1ul>, unsigned int>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>:\
:_M_dispose() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableCudaAsync.so
#16 0x00007f8af1e30f17 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFCl\
usterProducersPluginsPortableCudaAsync.so
#17 0x00007f8af1e500bc in std::_Sp_counted_ptr_inplace<std::tuple<PortableHostCollection<reco::PFRecHitFractionSoALayout<128ul, false> >, std::shared_ptr<alpaka_cuda_async::EDMetadata> >, std::allocator\
<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableCudaAsync.s\
o
#18 0x00007f8af1e30f17 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFCl\
usterProducersPluginsPortableCudaAsync.so
#19 0x00007f8af1e4b4bc in std::any::_Manager_external<std::shared_ptr<std::tuple<PortableHostCollection<reco::PFRecHitFractionSoALayout<128ul, false> >, std::shared_ptr<alpaka_cuda_async::EDMetadata> > \
> >::_S_manage(std::any::_Op, std::any const*, std::any::_Arg*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPorta\
bleCudaAsync.so
#20 0x00007f8b5b6186ca in std::_Sp_counted_ptr_inplace<std::any, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9\
_amd64_gcc12/libFWCoreFramework.so

@cms-sw/pf-l2 FYI

fwyzard · 2024-04-05T13:49:36Z

From the stack trace, it seems that an exception was thrown while another exception was being handled:

#6  0x00007f8b59254d06 in raise () from /lib64/libc.so.6

while

#12 0x00007f8b5a11b864 in _Unwind_RaiseException_Phase2 (exc=0x7f8af8d143c0, context=0x7f8afb1f88e0, frames_p=0x7f8afb1f87e8) at ../../../libgcc/unwind.inc:64
#13 0x00007f8b5a11c2bd in _Unwind_Resume (exc=0x7f8af8d143c0) at ../../../libgcc/unwind.inc:242

@mmusich, if you have time to look into this further, could you try running with a single stream / single thread, and post the full stack trace ?

mmusich · 2024-04-05T14:10:19Z

if you have time to look into this further, could you try running with a single stream / single thread, and post the full stack trace ?

sure. Adding to the configuration file

process.options.numberOfThreads = 1
process.options.numberOfStreams = 1

I get the following stack attached: crash_run378940.log
I notice right before the stack trace:

At the end of topoClusterContraction, found large *pcrhFracSize = 2220194
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el9_amd64_gcc12/build/CMSSW_14_0_4-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/event/EventUniformCu
daHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el9_amd64_gcc12/build/CMSSW_14_0_4-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/mem/buf/BufUniformCu
daHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
terminate called after throwing an instance of 'std::runtime_error'
  what():  /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el9_amd64_gcc12/build/CMSSW_14_0_4-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/event/Eve
ntUniformCudaHipRt.hpp(160) 'TApi::eventRecord(event.getNativeHandle(), queue.getNativeHandle())' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el9_amd64_gcc12/build/CMSSW_14_0_4-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/event/EventUniformCu
daHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!

fwyzard · 2024-04-05T14:21:07Z

Thanks.

So, "cudaErrorIllegalAddress" is basically the GPU equivalent of "Segmentation violation" :-(

What happens with the stack trace is that once we hit a CUDA error, we raise an exception and start unwinding the stack. While doing that we try to free some CUDA memory, but that call to do that also fails (because cudaErrorIllegalAddress is still present), which triggers a second exception. And the second exception cannot be handled, which causes the abort.

Of course this doesn't explain the reason for the error that we hit in the first place... that will need to be debugged.

missirol · 2024-04-05T15:23:20Z

Here's a second reproducer (same input events). I see the seg-fault when running on CPU only too.

#!/bin/bash -ex

# CMSSW_14_0_4

hltGetConfiguration run:378940 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input /store/group/tsg/FOG/debug/240405_run378940/files/run378940_ls0021_index000036_fu-c2b02-31-01_pid1363776.root \
  > hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0

process.options.accelerators = ["*"]
@EOF

CUDA_LAUNCH_BLOCKING=1 \
cmsRun hlt.py &> hlt.log

Stack trace here: hlt.log.

Thread 1 (Thread 0x7f44a0bac640 (LWP 3012403) "cmsRun"):
#0  0x00007f44a1779301 in poll () from /lib64/libc.so.6
#1  0x00007f44967d56af in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f4496789dbc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x00007f449678a720 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f445fc94340 in void alpaka_serial_sync::FastCluster::operator()<false, alpaka::AccCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int>, std::enable_if<false, void> >(alpaka::AccCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int> const&, reco::PFRecHitSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false>, reco::PFClusterParamsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false>, reco::PFRecHitHCALTopologySoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false>, reco::PFClusteringVarsSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>, reco::PFClusterSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>, reco::PFRecHitFractionSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>) const [clone .constprop.0] [clone .isra.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#6  0x00007f445fc95904 in alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int, alpaka_serial_sync::FastCluster, reco::PFRecHitSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, reco::PFClusterParamsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, reco::PFRecHitHCALTopologySoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, reco::PFClusteringVarsSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>&, reco::PFClusterSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>&, reco::PFRecHitFractionSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>&>::operator()() const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#7  0x00007f445fc9735f in alpaka_serial_sync::PFClusterProducerKernel::execute(alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&, PortableHostCollection<reco::PFClusterParamsSoALayout<128ul, false> > const&, PortableHostCollection<reco::PFRecHitHCALTopologySoALayout<128ul, false> > const&, PortableHostCollection<reco::PFClusteringVarsSoALayout<128ul, false> >&, PortableHostCollection<reco::PFClusteringEdgeVarsSoALayout<128ul, false> >&, PortableHostCollection<reco::PFRecHitSoALayout<128ul, false> > const&, PortableHostCollection<reco::PFClusterSoALayout<128ul, false> >&, PortableHostCollection<reco::PFRecHitFractionSoALayout<128ul, false> >&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#8  0x00007f445fc8ddf8 in alpaka_serial_sync::PFClusterSoAProducer::produce(alpaka_serial_sync::device::Event&, alpaka_serial_sync::device::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#9  0x00007f445fc8c06d in alpaka_serial_sync::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#10 0x00007f44a41d5e91 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#11 0x00007f44a41ba7ae in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#12 0x00007f44a4145669 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#13 0x00007f44a4145bd4 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#14 0x00007f44a42fbf28 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#15 0x00007f44a2901281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f449f4d3e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#16 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f449f4d3e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#17 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#18 0x00007f44a40c8ceb in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x00007f44a40d265a in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#20 0x00007f44a40d2bb1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#21 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#22 0x00007f44a28ed9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#23 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#24 0x000000000040517c in main ()

Current Modules:

Module: PFClusterSoAProducer@alpaka:hltParticleFlowClusterHBHESoA (crashed)
Module: none

A fatal system signal has occurred: segmentation violation

mmusich · 2024-04-05T15:37:26Z

type pf

slava77 · 2024-04-05T17:23:42Z

would running in cuda-gdb help to get more info?
The last time I used it, it was with CUDBG_USE_LEGACY_DEBUGGER=1 cuda-gdb cmsRun

slava77 · 2024-04-05T17:25:48Z

would running in cuda-gdb help to get more info? The last time I used it, it was with CUDBG_USE_LEGACY_DEBUGGER=1 cuda-gdb cmsRun

the trace was more informative when recompiled with --keep passed to nvcc

missirol · 2024-04-05T18:00:56Z

Just to note that (see #44634 (comment)) I get a crash even on CPU, so I suspect the issue is unrelated to CUDA or GPUs (but it should be double-checked, of course). In that case, the title of the issue should be updated. @wonpoint4

mmusich · 2024-04-05T18:28:21Z

I get a crash even on CPU, so I suspect the issue is unrelated to CUDA or GPUs (but it should be double-checked, of course).

I was wondering if the warning I reported above

At the end of topoClusterContraction, found large *pcrhFracSize = 2220194

generated here:

cmssw/RecoParticleFlow/PFClusterProducer/plugins/alpaka/PFClusterSoAProducerKernel.dev.cc

Lines 1308 to 1311 in f5861db

    
             if (pfClusteringVars.pcrhFracSize() > 200000)  // Warning in case the fraction is too large 
        
               printf("At the end of topoClusterContraction, found large *pcrhFracSize = %d\n", 
        
                      pfClusteringVars.pcrhFracSize()); 
        
           }

might give hints.

jsamudio · 2024-04-05T21:30:39Z

I get a crash even on CPU, so I suspect the issue is unrelated to CUDA or GPUs (but it should be double-checked, of course).

I was wondering if the warning I reported above

At the end of topoClusterContraction, found large *pcrhFracSize = 2220194

generated here:

cmssw/RecoParticleFlow/PFClusterProducer/plugins/alpaka/PFClusterSoAProducerKernel.dev.cc

Lines 1308 to 1311 in f5861db

if (pfClusteringVars.pcrhFracSize() > 200000) // Warning in case the fraction is too large

printf("At the end of topoClusterContraction, found large *pcrhFracSize = %d\n",

pfClusteringVars.pcrhFracSize());

}

might give hints.

It sort of makes sense to me that with this pcrhFracSize so large that there would be a crash. The rechit fraction SoA is probably not flexible up to this size and potentially some read/write to this SoA is causing the segfault and cuda error.

I am still investigating in the PF Alpaka Kernel since this number of rechit fractions seems strangely large when preceding events look more reasonable.

fwyzard · 2024-04-05T23:47:46Z

I'm guessing that

if pfClusteringVars.pcrhFracSize() is larger than 200000, at some point we had offsets larger than 200000 (see line 1286)
which means we had pfClusteringVars[rhIdx].seedFracOffsets() larger than 200000 (see line 1289)
which means we tried to access the fracView SoA with an index larger than 200000 (in many places)

@jsamudio could you check what is the actual SoA size in the event where the crash happens ?

If this is overflow is the cause of the crash - what can be done to avoid it ?
I do not mean in the sense of improving the algorithms, I mean from a technical point of view.
Would it be possible to add a check inside the kernel that computes the offset and make it fail with an explicit error if the size of the SoA is not large enough - but without crashing or stopping the job, only skipping the offending event ?

jsamudio · 2024-04-06T03:40:23Z

In the event where we see the crash we have 11,244 PF rechits, and the current allocation is nRecHits * 120, so the fraction SoA would have 1,349,280 elements. Here then 2,220,194 is obviously outside this.

As for adding an error and skipping the event, I understand the idea, but I don't know if I've seen an example of something similar to this before. Perhaps someone else has and could point me to an implementation?

fwyzard · 2024-04-06T06:20:37Z

As a quick workaround, would it work to increase the 120 to something like 250 in the HLT menu ?

Not as a long term solution, but to eliminate or at least reduce the online crashes, while a better solution is being investigated.

mmusich · 2024-04-06T07:24:10Z

As a quick workaround, would it work to increase the 120 to something like 250 in the HLT menu ?

Would this entail a configuration change or change in the code (new online release)?

fwyzard · 2024-04-06T08:45:50Z

I think it's a configuration parameter.

mmusich · 2024-04-06T08:46:23Z

Would this entail a configuration change or change in the code (new online release)?

answering myself:

process.hltParticleFlowClusterHBHESoA = cms.EDProducer( "PFClusterSoAProducer@alpaka",
    pfRecHits = cms.InputTag( "hltParticleFlowRecHitHBHESoA" ),
    pfClusterParams = cms.ESInputTag( "hltESPPFClusterParams","" ),
    topology = cms.ESInputTag( "hltESPPFRecHitHCALTopology","" ),
    synchronise = cms.bool( False ),
-    pfRecHitFractionAllocation = cms.int32( 120 ),
+    pfRecHitFractionAllocation = cms.int32( 250 ),
    alpaka = cms.untracked.PSet(  backend = cms.untracked.string( "" ) )
)

missirol · 2024-04-07T06:18:22Z

FTR, I double-checked that #44634 (comment) avoids the crash in the reproducer, and the HLT throughput is not affected, so it looks like a good short-term solution.

Two extra notes.

The change would have to be done to 2 modules: hltParticleFlowClusterHBHESoA and its serial-sync counterpart.
Afaik, these crashes have not occurred during stable beams yet. Run-378940 was a short run during a Fill for "beam loss maps" (I wonder what was going on in this event..).

makortel · 2024-04-12T18:17:33Z

I took a stab on trying to have the error(s) reported properly via exceptions rather than crashes (caused by exceptions being thrown during stack unwinding caused by an exception). #44730 should improve the situation (especially when running with CUDA_LAUNCH_BLOCKING=1), although it doesn't completely prevent the crashes (that, at least in the case of the reproducer in this issue, come from direct CUDA code; that might not be worth of the effort trying address at this point).

While developing the PR I started to wonder if Alpaka-specific exception type (or GPU runtime specific? or cms::Exception category+exit code?) would be useful to quickly disambiguate the GPU-related errors from the rest (although it might be useful to spin off that discussion into its own issue).

mmusich · 2024-04-19T08:31:10Z

for the record this was also tracked at https://its.cern.ch/jira/browse/CMSHLT-3144

cmsbuild added the pending-assignment label Apr 5, 2024

cmsbuild added hlt-pending pending-signatures heterogeneous-pending and removed pending-assignment labels Apr 5, 2024

cmsbuild added the pf label Apr 5, 2024

wonpoint4 changed the title ~~HLT Farm GPU-related crashes in run 378940~~ HLT Farm crashes in run 378940 Apr 5, 2024

makortel mentioned this issue Apr 11, 2024

Allow BranchIDLists to be used and updated concurrently cms-sw/framework-team#882

Open

makortel mentioned this issue Apr 12, 2024

Ignore errors from alpaka::enqueue() in CachingAllocator::free() #44730

Merged

makortel mentioned this issue Apr 17, 2024

[14_0_X] Ignore errors from alpaka::enqueue() in CachingAllocator::free() #44763

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HLT Farm crashes in run 378940 #44634

HLT Farm crashes in run 378940 #44634

wonpoint4 commented Apr 5, 2024 •

edited

cmsbuild commented Apr 5, 2024 •

edited

cmsbuild commented Apr 5, 2024

mmusich commented Apr 5, 2024

cmsbuild commented Apr 5, 2024

mmusich commented Apr 5, 2024 •

edited

fwyzard commented Apr 5, 2024

mmusich commented Apr 5, 2024

fwyzard commented Apr 5, 2024

missirol commented Apr 5, 2024 •

edited

mmusich commented Apr 5, 2024

slava77 commented Apr 5, 2024

slava77 commented Apr 5, 2024

missirol commented Apr 5, 2024

mmusich commented Apr 5, 2024

jsamudio commented Apr 5, 2024

fwyzard commented Apr 5, 2024

jsamudio commented Apr 6, 2024

fwyzard commented Apr 6, 2024

mmusich commented Apr 6, 2024

fwyzard commented Apr 6, 2024 via email

mmusich commented Apr 6, 2024

missirol commented Apr 7, 2024

makortel commented Apr 12, 2024

mmusich commented Apr 19, 2024

HLT Farm crashes in run 378940 #44634

HLT Farm crashes in run 378940 #44634

Comments

wonpoint4 commented Apr 5, 2024 • edited

cmsbuild commented Apr 5, 2024 • edited

cmsbuild commented Apr 5, 2024

mmusich commented Apr 5, 2024

cmsbuild commented Apr 5, 2024

mmusich commented Apr 5, 2024 • edited

fwyzard commented Apr 5, 2024

mmusich commented Apr 5, 2024

fwyzard commented Apr 5, 2024

missirol commented Apr 5, 2024 • edited

mmusich commented Apr 5, 2024

slava77 commented Apr 5, 2024

slava77 commented Apr 5, 2024

missirol commented Apr 5, 2024

mmusich commented Apr 5, 2024

jsamudio commented Apr 5, 2024

fwyzard commented Apr 5, 2024

jsamudio commented Apr 6, 2024

fwyzard commented Apr 6, 2024

mmusich commented Apr 6, 2024

fwyzard commented Apr 6, 2024 via email

mmusich commented Apr 6, 2024

missirol commented Apr 7, 2024

makortel commented Apr 12, 2024

mmusich commented Apr 19, 2024

wonpoint4 commented Apr 5, 2024 •

edited

cmsbuild commented Apr 5, 2024 •

edited

mmusich commented Apr 5, 2024 •

edited

missirol commented Apr 5, 2024 •

edited