Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valgrind reports plenty of "Uninitialised value" created by a stack allocation in an Alpaka-kernel in pluginRecoLocalCaloEcalRecProducersPluginsPortableSerialSync.so #44957

Open
VinInn opened this issue May 12, 2024 · 26 comments

Comments

@VinInn
Copy link
Contributor

VinInn commented May 12, 2024

Debugging #44940 I run

valgrind --tool=memcheck --leak-check=full --track-origins=yes --suppressions=${ROOTSYS}/etc/valgrind-root.supp  cmsRun  hlt_run380466.py

(use the script by @mmusich in #44956 to set up the environment)

and see plenty of report of the type

==855852== Conditional jump or move depends on uninitialised value(s)
==855852==    at 0xA8813D24: void std::__introsort_loop<__gnu_cxx::__normal_iterator<reco::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, long, __gnu_cxx::__ops::_Ite
r_comp_iter<NumericSafeGreaterByPt<reco::PFJet> > >(__gnu_cxx::__normal_iterator<reco::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, __gnu_cxx::__normal_iterator<rec
o::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<NumericSafeGreaterByPt<reco::PFJet> >) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/n
week-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginJetMETCorrectionsModulesPlugins.so)
==855852==    by 0xA8813DC3: void std::__introsort_loop<__gnu_cxx::__normal_iterator<reco::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, long, __gnu_cxx::__ops::_Ite
r_comp_iter<NumericSafeGreaterByPt<reco::PFJet> > >(__gnu_cxx::__normal_iterator<reco::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, __gnu_cxx::__normal_iterator<rec
o::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<NumericSafeGreaterByPt<reco::PFJet> >) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/n
week-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginJetMETCorrectionsModulesPlugins.so)
==855852==    by 0xA880BC90: reco::CorrectedJetProducer<reco::PFJet>::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/e
l9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginJetMETCorrectionsModulesPlugins.so)
==855852==    by 0x4A8B131: edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/sw/
x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==855852==    by 0x4A8495B: edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nwee
k-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==855852==    by 0x4A11528: std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__ex
ception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTra
its<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_
gcc12/libFWCoreFramework.so)
==855852==    by 0x4A1B997: edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/
el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==855852==    by 0x4E74F27: tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64
/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreConcurrency.so)
==855852==    by 0x640991A: UnknownInlinedFun (task_dispatcher.h:322)
==855852==    by 0x640991A: UnknownInlinedFun (task_dispatcher.h:458)
==855852==    by 0x640991A: UnknownInlinedFun (arena.cpp:137)
==855852==    by 0x640991A: tbb::detail::r1::market::process(rml::job&) (market.cpp:599)
==855852==    by 0x640BACD: UnknownInlinedFun (private_server.cpp:271)
==855852==    by 0x640BACD: tbb::detail::r1::rml::private_worker::thread_routine(void*) (private_server.cpp:221)
==855852==    by 0x68C6801: start_thread (in /usr/lib64/libc.so.6)
==855852==  Uninitialised value was created by a stack allocation
==855852==    at 0xBA2A6EDD: alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int, alpaka_serial_sync::ecal::multifit::Kernel_time_computation_init, Ecal
DigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, EcalDigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const
&, EcalMultifitConditionsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, float*, float*, float*, bool*, char*>::operator()() const (in /cvmfs/cms
-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginRecoLocalCaloEcalRecProducersPluginsPortableSerialSync.so)
==855852==
==855852== Conditional jump or move depends on uninitialised value(s)
==855852==    at 0xA8813D24: void std::__introsort_loop<__gnu_cxx::__normal_iterator<reco::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, long, __gnu_cxx::__ops::_Ite
r_comp_iter<NumericSafeGreaterByPt<reco::PFJet> > >(__gnu_cxx::__normal_iterator<reco::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, __gnu_cxx::__normal_iterator<rec
o::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<NumericSafeGreaterByPt<reco::PFJet> >) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/n
week-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginJetMETCorrectionsModulesPlugins.so)
==855852==    by 0xA8813DC3: void std::__introsort_loop<__gnu_cxx::__normal_iterator<reco::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, long, __gnu_cxx::__ops::_Ite
r_comp_iter<NumericSafeGreaterByPt<reco::PFJet> > >(__gnu_cxx::__normal_iterator<reco::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, __gnu_cxx::__normal_iterator<rec
o::PFJet*, std::vector<reco::PFJet, std::allocator<reco::PFJet> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<NumericSafeGreaterByPt<reco::PFJet> >) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/n
week-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginJetMETCorrectionsModulesPlugins.so)
==855852==    by 0xA880BC90: reco::CorrectedJetProducer<reco::PFJet>::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/e
l9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginJetMETCorrectionsModulesPlugins.so)
==855852==    by 0x4A8B131: edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/sw/
x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==855852==    by 0x4A8495B: edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nwee
k-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==855852==    by 0x4A11528: std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__ex
ception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTra
its<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_
gcc12/libFWCoreFramework.so)
==855852==    by 0x4A1B997: edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/
el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so)
==855852==    by 0x4E74F27: tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64
/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/libFWCoreConcurrency.so)
==855852==    by 0x640991A: UnknownInlinedFun (task_dispatcher.h:322)
==855852==    by 0x640991A: UnknownInlinedFun (task_dispatcher.h:458)
==855852==    by 0x640991A: UnknownInlinedFun (arena.cpp:137)
==855852==    by 0x640991A: tbb::detail::r1::market::process(rml::job&) (market.cpp:599)
==855852==    by 0x640BACD: UnknownInlinedFun (private_server.cpp:271)
==855852==    by 0x640BACD: tbb::detail::r1::rml::private_worker::thread_routine(void*) (private_server.cpp:221)
==855852==    by 0x68C6801: start_thread (in /usr/lib64/libc.so.6)
==855852==  Uninitialised value was created by a stack allocation
==855852==    at 0xBA2A6EDD: alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int, alpaka_serial_sync::ecal::multifit::Kernel_time_computation_init, Ecal
DigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, EcalDigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const
&, EcalMultifitConditionsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, float*, float*, float*, bool*, char*>::operator()() const (in /cvmfs/cms
-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/cms/cmssw/CMSSW_14_1_X_2024-05-09-2300/lib/el9_amd64_gcc12/pluginRecoLocalCaloEcalRecProducersPluginsPortableSerialSync.so)
==855852==

I use something like this to get the full list

egrep -A2 "Conditional jump or move depends on uninitialised|Uninitialised value was created" /data/user/innocent/hltBug/valg4.log | less

once compiled with -g RecoLocalCalo/EcalRecProducers it becomes

==865737==  Uninitialised value was created by a stack allocation
==865737==    at 0xBA263EDD: alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int, alpaka_serial_sync::ecal::multifit::Kernel_time_computation_init, EcalDigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, EcalDigiSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, EcalMultifitConditionsSoALayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, false> const&, float*, float*, float*, bool*, char*>::operator()() const (TaskKernelCpuSerial.hpp:51)

that does not give further hint
(it should happen somewhere in
https://cmssdt.cern.ch/lxr/source/RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h?v=CMSSW_14_1_X_2024-05-05-2300#0776
If I well understood.

So I'm not able to debug further

@cmsbuild
Copy link
Contributor

cmsbuild commented May 12, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @VinInn.

@smuzaffar, @antoniovilela, @rappoccio, @Dr15Jones, @makortel, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor

mmusich commented May 12, 2024

assign reconstruction, heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction,heterogeneous

@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

@cms-sw/ecal-dpg-l2 @thomreis

@fwyzard
Copy link
Contributor

fwyzard commented May 14, 2024

Looking at the code, the only allocation I see inside the Kernel_time_computation_init kernel is the shared memory (which on the serial CPU back-end is just plain memory).

@VinInn, as a simple test, could you try adding something like

for (auto i: cms::alpakatools::uniform_elements(acc, 2*elemsPerBlock)) {
  shrSampleValues[i] = 0;
}
alpaka::syncBlockThreads(acc);

around line 804, and check if that makes valgrind happy ?

@thomreis, could you check if that has any impact on the results ?

@VinInn
Copy link
Contributor Author

VinInn commented May 14, 2024

I can try but valgrind is saying
created by a stack allocation
not
heap

@fwyzard
Copy link
Contributor

fwyzard commented May 14, 2024

Alpaka allocates "shared memory" for the CPU backend as a static data member of a class templated on the kernel type... so that may count as a stack allocation ?

@VinInn
Copy link
Contributor Author

VinInn commented May 14, 2024

most probably yes...
one may check @ TaskKernelCpuSerial.hpp:51

@VinInn
Copy link
Contributor Author

VinInn commented May 14, 2024

zeroing shrSampleValues does not help
WAIT, maybe not recompiled....

@VinInn
Copy link
Contributor Author

VinInn commented May 14, 2024

I confirm that the valgrind issue is still there.
I set the memory to "junk"

      memset(shrSampleValues,0xa5,2*elemsPerBlock*sizeof(ScalarType));
      alpaka::syncBlockThreads(acc);

at least it does not crash....

@makortel
Copy link
Contributor

Alpaka allocates "shared memory" for the CPU backend as a static data member of a class templated on the kernel type

@fwyzard Could you give more details? I'm now wondering how a static data member would work with running multiple instances of the kernel concurrently (naively I'd assume a data race). Poking around I found
https://github.com/alpaka-group/alpaka/blob/1.1.0/include/alpaka/block/shared/dyn/BlockSharedMemDynMember.hpp#L83
as the memory block for both static and dynamic shared memory (that is not class-static), but I could have easily missed something.

@fwyzard
Copy link
Contributor

fwyzard commented May 14, 2024

hi @makortel you are right, it's a regular data member, not a static one (sorry I mixed things up)

@VinInn
Copy link
Contributor Author

VinInn commented May 15, 2024

I actually overlooked all these other ones

=865737==
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA252001: UnknownInlinedFun (MultifitComputations.h:488)
==865737==    by 0xBA252001: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:226)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA25200A: UnknownInlinedFun (MultifitComputations.h:490)
==865737==    by 0xBA25200A: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:226)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA2522B0: UnknownInlinedFun (MultifitComputations.h:428)
==865737==    by 0xBA2522B0: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:226)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA2522EC: UnknownInlinedFun (MultifitComputations.h:435)
==865737==    by 0xBA2522EC: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:226)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA252EDA: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:243)
==865737==    by 0xBA252EDA: UnknownInlinedFun (invoke.h:61)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA264B7F: UnknownInlinedFun (TimeComputationKernels.h:1030)
==865737==    by 0xBA264B7F: UnknownInlinedFun (invoke.h:61)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA264C5B: UnknownInlinedFun (TimeComputationKernels.h:1071)
==865737==    by 0xBA264C5B: UnknownInlinedFun (invoke.h:61)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xA2E1F648: UnknownInlinedFun (EcalRecHit.h:128)
==865737==    by 0xA2E1F648: EcalRecHitSimpleAlgo::makeRecHit(EcalUncalibratedRecHit const&, float const&, float const&, unsigned int const&) const (EcalRecHitSimpleAlgo.h:46)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--

who may be also related to #44956

@VinInn
Copy link
Contributor Author

VinInn commented May 15, 2024

looking into the detailed traceback I see this

alpaka_cuda_async::EcalMultifitConditionsHostESProducer::produce(EcalMultifitConditionsRcd const&) (EcalMultifitConditionsHostESProducer.cc:74)

@VinInn
Copy link
Contributor Author

VinInn commented May 16, 2024

it seems that adding

diff --git a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
index ddba855853c..96a7792e9cb 100644
--- a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
+++ b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
@@ -72,6 +72,10 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
       size_t numberOfXtals = pedestalsData.size();

       auto product = std::make_unique<EcalMultifitConditionsHost>(numberOfXtals, cms::alpakatools::host());
+      {
+         alpaka::QueueCpuBlocking queue{cms::alpakatools::host()};
+         auto buff = product->buffer(); alpaka::memset(queue,buff, 0xa5);
+      }
       auto view = product->view();

       // Filling pedestals

it does not crash and all messages from Valgrind mentioned above disappear.
Of course this is not necessarily a good news...

@fwyzard
Copy link
Contributor

fwyzard commented May 16, 2024

Reading the code of the ESProducer, the only field that seems to not be filled is the rawid.

@VinInn could you check if adding

diff --git a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
index ddba855853c..d0ee97230ec 100644
--- a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
+++ b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
@@ -91,6 +91,7 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
 
       for (unsigned int i = 0; i < barrelSize; ++i) {
         auto vi = view[i];
+        vi.rawid() = 0;
 
         vi.pedestals_mean_x12() = pedestalsEB[i].mean_x12;
         vi.pedestals_rms_x12() = pedestalsEB[i].rms_x12;
@@ -113,6 +114,7 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
       }  // end Barrel loop
       for (unsigned int i = 0; i < endcapSize; ++i) {
         auto vi = view[barrelSize + i];
+        vi.rawid() = 0;
 
         vi.pedestals_mean_x12() = pedestalsEE[i].mean_x12;
         vi.pedestals_rms_x12() = pedestalsEE[i].rms_x12;

instead of the memset() also makes the valgrind messages go away ?

@thomreis should EcalMultifitConditionsHostESProducer fill the rawid ?

@fwyzard
Copy link
Contributor

fwyzard commented May 16, 2024

Mhm, grep does not find any mention of rawid in RecoLocalCalo/EcalRecProducers/ - so it doesn't seem to be used anywhere.

Then, I don't understand what un-initialised memory valgrind is complaining about :-(

@VinInn
Copy link
Contributor Author

VinInn commented May 16, 2024

no it is not enough...
Is not that there are some "padding space" for alignment and somehow it is accessed?

@fwyzard
Copy link
Contributor

fwyzard commented May 16, 2024

Yes, there is padding between the columns... but it should not be accessed, unless the indices used to access the SoA are larger than its nominal capacity.

@VinInn
Copy link
Contributor Author

VinInn commented May 16, 2024

in RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h
pedestal can be accessed with the channel index

   849            pedestal = conditionsDev.pedestals_mean_x12()[ch];

while everywhere else is accessed with the HashedIndex
@thomreis : is this intended or is a bug?

869 auto const mean_x6 = conditionsDev.pedestals_mean_x6()[hashedId];
870 auto const rms_x6 = conditionsDev.pedestals_rms_x6()[hashedId];
871 auto const gain12Over6 = conditionsDev.gain12Over6()[hashedId];
872 sample_value = (static_cast(adc) - mean_x6) * gain12Over6;
873 sample_value_error = rms_x6 * gain12Over6;

etc

in addition the hasindex is computed from the digi rawid
833 auto const hashedId = isBarrel ? ecal::reconstruction::hashedIndexEB(did.rawId())
834 : offsetForHashes + ecal::reconstruction::hashedIndexEE(did.rawId());

and if did.rawId() is junk (as apparently can be, as seen in the companion issue) hashedId can easily be anything....

@fwyzard
Copy link
Contributor

fwyzard commented May 16, 2024

Very good points.

We have the possibility of enabling runtime checks on the indices, let me dig how to do it.
I actually think we should actually enable it by default, if the cost is not too high.

@fwyzard
Copy link
Contributor

fwyzard commented May 16, 2024

Can you try running with #44987 ?

@VinInn
Copy link
Contributor Author

VinInn commented May 16, 2024

what is supposed to happen?
anyhow no message

@fwyzard
Copy link
Contributor

fwyzard commented May 16, 2024

It's supposed to fail an assert or throw an exception if there is an out of bounds access (and if I made the correct changes).

@fwyzard
Copy link
Contributor

fwyzard commented Jun 23, 2024

Did #45210 fix this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants