Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT Farm crashes in run 378940 #44634

Open
wonpoint4 opened this issue Apr 5, 2024 · 24 comments
Open

HLT Farm crashes in run 378940 #44634

wonpoint4 opened this issue Apr 5, 2024 · 24 comments

Comments

@wonpoint4
Copy link

wonpoint4 commented Apr 5, 2024

Report the large numbers of GPU-related HLT crashes yesterday night (elog)

  • Related to illegal memory access
  • Not fully understood as HLT menus were unchanged with respect to the previous runs

Here's the recipe how to reproduce the crashes. (tested with CMSSW_14_0_4 on lxplus8-gpu)

cmsrel CMSSW_14_0_4
cd CMSSW_14_0_4/src
cmsenv

https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 378940 > hlt_run378940.py
cat <<@EOF >> hlt_run378940.py
from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(
    buBaseDir = '/eos/cms/store/group/phys_muon/wjun/error_stream',
    runNumber = 378940
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
    fileListMode = True,
    fileNames = (
        '/eos/cms/store/group/phys_muon/wjun/error_stream/run378940/run378940_ls0021_index000036_fu-c2b02-31-01_pid1363776.raw',
    )
)
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

mkdir run378940
cmsRun hlt_run378940.py &> crash_run378940.log

@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 5, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 5, 2024

A new Issue was created by @wonpoint4.

@makortel, @sextonkennedy, @antoniovilela, @Dr15Jones, @smuzaffar, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor

mmusich commented Apr 5, 2024

assign hlt, heterogeneous

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 5, 2024

New categories assigned: hlt,heterogeneous

@Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Apr 5, 2024

executing the reproducer with CUDA_LAUNCH_BLOCKING=1

I see the following stack:

Fri Apr  5 15:43:25 CEST 2024
Thread 12 (Thread 0x7f8afb1ff640 (LWP 1360807) "cmsRun"):
#0  0x00007f8b5934291f in poll () from /lib64/libc.so.6
#1  0x00007f8b52e5a62f in full_read.constprop () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f8b52e0ee3c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x00007f8b52e0f7a0 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f8b592a154c in __pthread_kill_implementation () from /lib64/libc.so.6
#6  0x00007f8b59254d06 in raise () from /lib64/libc.so.6
#7  0x00007f8b592287f3 in abort () from /lib64/libc.so.6
#8  0x00007f8b596aeeea in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:50
#9  0x00007f8b596ace6a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#10 0x00007f8b596abed9 in __cxa_call_terminate (ue_header=0x7f8af8d143c0) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#11 0x00007f8b596ac5f6 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=<optimized out>, context=0x7f8afb1f88e0) at ../../../../lib\
stdc++-v3/libsupc++/eh_personality.cc:688
#12 0x00007f8b5a11b864 in _Unwind_RaiseException_Phase2 (exc=0x7f8af8d143c0, context=0x7f8afb1f88e0, frames_p=0x7f8afb1f87e8) at ../../../libgcc/unwind.inc:64
#13 0x00007f8b5a11c2bd in _Unwind_Resume (exc=0x7f8af8d143c0) at ../../../libgcc/unwind.inc:242
#14 0x00007f8af1e2e5aa in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false> >::free(void*) [clone .cold] () from /cvmfs\
/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableCudaAsync.so
#15 0x00007f8af1e3bc68 in std::_Sp_counted_ptr_inplace<alpaka::detail::BufCpuImpl<std::byte, std::integral_constant<unsigned long, 1ul>, unsigned int>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>:\
:_M_dispose() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableCudaAsync.so
#16 0x00007f8af1e30f17 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFCl\
usterProducersPluginsPortableCudaAsync.so
#17 0x00007f8af1e500bc in std::_Sp_counted_ptr_inplace<std::tuple<PortableHostCollection<reco::PFRecHitFractionSoALayout<128ul, false> >, std::shared_ptr<alpaka_cuda_async::EDMetadata> >, std::allocator\
<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableCudaAsync.s\
o
#18 0x00007f8af1e30f17 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFCl\
usterProducersPluginsPortableCudaAsync.so
#19 0x00007f8af1e4b4bc in std::any::_Manager_external<std::shared_ptr<std::tuple<PortableHostCollection<reco::PFRecHitFractionSoALayout<128ul, false> >, std::shared_ptr<alpaka_cuda_async::EDMetadata> > \
> >::_S_manage(std::any::_Op, std::any const*, std::any::_Arg*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPorta\
bleCudaAsync.so
#20 0x00007f8b5b6186ca in std::_Sp_counted_ptr_inplace<std::any, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el9\
_amd64_gcc12/libFWCoreFramework.so

@cms-sw/pf-l2 FYI

@fwyzard
Copy link
Contributor

fwyzard commented Apr 5, 2024

From the stack trace, it seems that an exception was thrown while another exception was being handled:

#6  0x00007f8b59254d06 in raise () from /lib64/libc.so.6

while

#12 0x00007f8b5a11b864 in _Unwind_RaiseException_Phase2 (exc=0x7f8af8d143c0, context=0x7f8afb1f88e0, frames_p=0x7f8afb1f87e8) at ../../../libgcc/unwind.inc:64
#13 0x00007f8b5a11c2bd in _Unwind_Resume (exc=0x7f8af8d143c0) at ../../../libgcc/unwind.inc:242

@mmusich, if you have time to look into this further, could you try running with a single stream / single thread, and post the full stack trace ?

@mmusich
Copy link
Contributor

mmusich commented Apr 5, 2024

if you have time to look into this further, could you try running with a single stream / single thread, and post the full stack trace ?

sure. Adding to the configuration file

process.options.numberOfThreads = 1
process.options.numberOfStreams = 1

I get the following stack attached: crash_run378940.log
I notice right before the stack trace:

At the end of topoClusterContraction, found large *pcrhFracSize = 2220194
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el9_amd64_gcc12/build/CMSSW_14_0_4-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/event/EventUniformCu
daHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el9_amd64_gcc12/build/CMSSW_14_0_4-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/mem/buf/BufUniformCu
daHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
terminate called after throwing an instance of 'std::runtime_error'
  what():  /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el9_amd64_gcc12/build/CMSSW_14_0_4-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/event/Eve
ntUniformCudaHipRt.hpp(160) 'TApi::eventRecord(event.getNativeHandle(), queue.getNativeHandle())' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_4-el9_amd64_gcc12/build/CMSSW_14_0_4-build/el9_amd64_gcc12/external/alpaka/1.1.0-1dfa0fea4735fa1aac182d6acd03b4c8/include/alpaka/event/EventUniformCu
daHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!

@fwyzard
Copy link
Contributor

fwyzard commented Apr 5, 2024

Thanks.

So, "cudaErrorIllegalAddress" is basically the GPU equivalent of "Segmentation violation" :-(

What happens with the stack trace is that once we hit a CUDA error, we raise an exception and start unwinding the stack. While doing that we try to free some CUDA memory, but that call to do that also fails (because cudaErrorIllegalAddress is still present), which triggers a second exception. And the second exception cannot be handled, which causes the abort.

Of course this doesn't explain the reason for the error that we hit in the first place... that will need to be debugged.

@missirol
Copy link
Contributor

missirol commented Apr 5, 2024

Here's a second reproducer (same input events). I see the seg-fault when running on CPU only too.

#!/bin/bash -ex

# CMSSW_14_0_4

hltGetConfiguration run:378940 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input /store/group/tsg/FOG/debug/240405_run378940/files/run378940_ls0021_index000036_fu-c2b02-31-01_pid1363776.root \
  > hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0

process.options.accelerators = ["*"]
@EOF

CUDA_LAUNCH_BLOCKING=1 \
cmsRun hlt.py &> hlt.log

Stack trace here: hlt.log.

Thread 1 (Thread 0x7f44a0bac640 (LWP 3012403) "cmsRun"):
#0  0x00007f44a1779301 in poll () from /lib64/libc.so.6
#1  0x00007f44967d56af in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f4496789dbc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x00007f449678a720 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f445fc94340 in void alpaka_serial_sync::FastCluster::operator()<false, alpaka::AccCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int>, std::enable_if<false, void> >(alpaka::AccCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int> const&, reco::PFRecHitSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false>, reco::PFClusterParamsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false>, reco::PFRecHitHCALTopologySoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false>, reco::PFClusteringVarsSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>, reco::PFClusterSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>, reco::PFRecHitFractionSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>) const [clone .constprop.0] [clone .isra.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#6  0x00007f445fc95904 in alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int, alpaka_serial_sync::FastCluster, reco::PFRecHitSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, reco::PFClusterParamsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, reco::PFRecHitHCALTopologySoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, reco::PFClusteringVarsSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>&, reco::PFClusterSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>&, reco::PFRecHitFractionSoALayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, false>&>::operator()() const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#7  0x00007f445fc9735f in alpaka_serial_sync::PFClusterProducerKernel::execute(alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&, PortableHostCollection<reco::PFClusterParamsSoALayout<128ul, false> > const&, PortableHostCollection<reco::PFRecHitHCALTopologySoALayout<128ul, false> > const&, PortableHostCollection<reco::PFClusteringVarsSoALayout<128ul, false> >&, PortableHostCollection<reco::PFClusteringEdgeVarsSoALayout<128ul, false> >&, PortableHostCollection<reco::PFRecHitSoALayout<128ul, false> > const&, PortableHostCollection<reco::PFClusterSoALayout<128ul, false> >&, PortableHostCollection<reco::PFRecHitFractionSoALayout<128ul, false> >&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#8  0x00007f445fc8ddf8 in alpaka_serial_sync::PFClusterSoAProducer::produce(alpaka_serial_sync::device::Event&, alpaka_serial_sync::device::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#9  0x00007f445fc8c06d in alpaka_serial_sync::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/pluginRecoParticleFlowPFClusterProducersPluginsPortableSerialSync.so
#10 0x00007f44a41d5e91 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#11 0x00007f44a41ba7ae in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#12 0x00007f44a4145669 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#13 0x00007f44a4145bd4 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#14 0x00007f44a42fbf28 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#15 0x00007f44a2901281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f449f4d3e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#16 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f449f4d3e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#17 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#18 0x00007f44a40c8ceb in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x00007f44a40d265a in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#20 0x00007f44a40d2bb1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_4/lib/el8_amd64_gcc12/libFWCoreFramework.so
#21 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#22 0x00007f44a28ed9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#23 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#24 0x000000000040517c in main ()

Current Modules:

Module: PFClusterSoAProducer@alpaka:hltParticleFlowClusterHBHESoA (crashed)
Module: none

A fatal system signal has occurred: segmentation violation

@mmusich
Copy link
Contributor

mmusich commented Apr 5, 2024

type pf

@cmsbuild cmsbuild added the pf label Apr 5, 2024
@slava77
Copy link
Contributor

slava77 commented Apr 5, 2024

would running in cuda-gdb help to get more info?
The last time I used it, it was with CUDBG_USE_LEGACY_DEBUGGER=1 cuda-gdb cmsRun

@slava77
Copy link
Contributor

slava77 commented Apr 5, 2024

would running in cuda-gdb help to get more info? The last time I used it, it was with CUDBG_USE_LEGACY_DEBUGGER=1 cuda-gdb cmsRun

the trace was more informative when recompiled with --keep passed to nvcc

@missirol
Copy link
Contributor

missirol commented Apr 5, 2024

Just to note that (see #44634 (comment)) I get a crash even on CPU, so I suspect the issue is unrelated to CUDA or GPUs (but it should be double-checked, of course). In that case, the title of the issue should be updated. @wonpoint4

@mmusich
Copy link
Contributor

mmusich commented Apr 5, 2024

I get a crash even on CPU, so I suspect the issue is unrelated to CUDA or GPUs (but it should be double-checked, of course).

I was wondering if the warning I reported above

At the end of topoClusterContraction, found large *pcrhFracSize = 2220194

generated here:

if (pfClusteringVars.pcrhFracSize() > 200000) // Warning in case the fraction is too large
printf("At the end of topoClusterContraction, found large *pcrhFracSize = %d\n",
pfClusteringVars.pcrhFracSize());
}

might give hints.

@wonpoint4 wonpoint4 changed the title HLT Farm GPU-related crashes in run 378940 HLT Farm crashes in run 378940 Apr 5, 2024
@jsamudio
Copy link
Contributor

jsamudio commented Apr 5, 2024

I get a crash even on CPU, so I suspect the issue is unrelated to CUDA or GPUs (but it should be double-checked, of course).

I was wondering if the warning I reported above

At the end of topoClusterContraction, found large *pcrhFracSize = 2220194

generated here:

if (pfClusteringVars.pcrhFracSize() > 200000) // Warning in case the fraction is too large
printf("At the end of topoClusterContraction, found large *pcrhFracSize = %d\n",
pfClusteringVars.pcrhFracSize());
}

might give hints.

It sort of makes sense to me that with this pcrhFracSize so large that there would be a crash. The rechit fraction SoA is probably not flexible up to this size and potentially some read/write to this SoA is causing the segfault and cuda error.

I am still investigating in the PF Alpaka Kernel since this number of rechit fractions seems strangely large when preceding events look more reasonable.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 5, 2024

I'm guessing that

  • if pfClusteringVars.pcrhFracSize() is larger than 200000, at some point we had offsets larger than 200000 (see line 1286)
  • which means we had pfClusteringVars[rhIdx].seedFracOffsets() larger than 200000 (see line 1289)
  • which means we tried to access the fracView SoA with an index larger than 200000 (in many places)

@jsamudio could you check what is the actual SoA size in the event where the crash happens ?

If this is overflow is the cause of the crash - what can be done to avoid it ?
I do not mean in the sense of improving the algorithms, I mean from a technical point of view.
Would it be possible to add a check inside the kernel that computes the offset and make it fail with an explicit error if the size of the SoA is not large enough - but without crashing or stopping the job, only skipping the offending event ?

@jsamudio
Copy link
Contributor

jsamudio commented Apr 6, 2024

In the event where we see the crash we have 11,244 PF rechits, and the current allocation is nRecHits * 120, so the fraction SoA would have 1,349,280 elements. Here then 2,220,194 is obviously outside this.

As for adding an error and skipping the event, I understand the idea, but I don't know if I've seen an example of something similar to this before. Perhaps someone else has and could point me to an implementation?

@fwyzard
Copy link
Contributor

fwyzard commented Apr 6, 2024

As a quick workaround, would it work to increase the 120 to something like 250 in the HLT menu ?

Not as a long term solution, but to eliminate or at least reduce the online crashes, while a better solution is being investigated.

@mmusich
Copy link
Contributor

mmusich commented Apr 6, 2024

As a quick workaround, would it work to increase the 120 to something like 250 in the HLT menu ?

Would this entail a configuration change or change in the code (new online release)?

@fwyzard
Copy link
Contributor

fwyzard commented Apr 6, 2024 via email

@mmusich
Copy link
Contributor

mmusich commented Apr 6, 2024

Would this entail a configuration change or change in the code (new online release)?

answering myself:

process.hltParticleFlowClusterHBHESoA = cms.EDProducer( "PFClusterSoAProducer@alpaka",
    pfRecHits = cms.InputTag( "hltParticleFlowRecHitHBHESoA" ),
    pfClusterParams = cms.ESInputTag( "hltESPPFClusterParams","" ),
    topology = cms.ESInputTag( "hltESPPFRecHitHCALTopology","" ),
    synchronise = cms.bool( False ),
-    pfRecHitFractionAllocation = cms.int32( 120 ),
+    pfRecHitFractionAllocation = cms.int32( 250 ),
    alpaka = cms.untracked.PSet(  backend = cms.untracked.string( "" ) )
)

@missirol
Copy link
Contributor

missirol commented Apr 7, 2024

FTR, I double-checked that #44634 (comment) avoids the crash in the reproducer, and the HLT throughput is not affected, so it looks like a good short-term solution.

Two extra notes.

  • The change would have to be done to 2 modules: hltParticleFlowClusterHBHESoA and its serial-sync counterpart.
  • Afaik, these crashes have not occurred during stable beams yet. Run-378940 was a short run during a Fill for "beam loss maps" (I wonder what was going on in this event..).

@makortel
Copy link
Contributor

I took a stab on trying to have the error(s) reported properly via exceptions rather than crashes (caused by exceptions being thrown during stack unwinding caused by an exception). #44730 should improve the situation (especially when running with CUDA_LAUNCH_BLOCKING=1), although it doesn't completely prevent the crashes (that, at least in the case of the reproducer in this issue, come from direct CUDA code; that might not be worth of the effort trying address at this point).

While developing the PR I started to wonder if Alpaka-specific exception type (or GPU runtime specific? or cms::Exception category+exit code?) would be useful to quickly disambiguate the GPU-related errors from the rest (although it might be useful to spin off that discussion into its own issue).

@mmusich
Copy link
Contributor

mmusich commented Apr 19, 2024

for the record this was also tracked at https://its.cern.ch/jira/browse/CMSHLT-3144

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants