-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Valgrind reports plenty of "Uninitialised value" created by a stack allocation in an Alpaka-kernel in pluginRecoLocalCaloEcalRecProducersPluginsPortableSerialSync.so #44957
Comments
cms-bot internal usage |
A new Issue was created by @VinInn. @smuzaffar, @antoniovilela, @rappoccio, @Dr15Jones, @makortel, @sextonkennedy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign reconstruction, heterogeneous |
New categories assigned: reconstruction,heterogeneous @fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@cms-sw/ecal-dpg-l2 @thomreis |
Looking at the code, the only allocation I see inside the @VinInn, as a simple test, could you try adding something like for (auto i: cms::alpakatools::uniform_elements(acc, 2*elemsPerBlock)) {
shrSampleValues[i] = 0;
}
alpaka::syncBlockThreads(acc); around line 804, and check if that makes valgrind happy ? @thomreis, could you check if that has any impact on the results ? |
I can try but valgrind is saying |
Alpaka allocates "shared memory" for the CPU backend as a |
most probably yes... |
zeroing shrSampleValues does not help |
I confirm that the valgrind issue is still there.
at least it does not crash.... |
@fwyzard Could you give more details? I'm now wondering how a |
hi @makortel you are right, it's a regular data member, not a |
I actually overlooked all these other ones
who may be also related to #44956 |
looking into the detailed traceback I see this
|
it seems that adding
it does not crash and all messages from Valgrind mentioned above disappear. |
Reading the code of the ESProducer, the only field that seems to not be filled is the @VinInn could you check if adding diff --git a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
index ddba855853c..d0ee97230ec 100644
--- a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
+++ b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
@@ -91,6 +91,7 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
for (unsigned int i = 0; i < barrelSize; ++i) {
auto vi = view[i];
+ vi.rawid() = 0;
vi.pedestals_mean_x12() = pedestalsEB[i].mean_x12;
vi.pedestals_rms_x12() = pedestalsEB[i].rms_x12;
@@ -113,6 +114,7 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
} // end Barrel loop
for (unsigned int i = 0; i < endcapSize; ++i) {
auto vi = view[barrelSize + i];
+ vi.rawid() = 0;
vi.pedestals_mean_x12() = pedestalsEE[i].mean_x12;
vi.pedestals_rms_x12() = pedestalsEE[i].rms_x12; instead of the @thomreis should |
Mhm, Then, I don't understand what un-initialised memory valgrind is complaining about :-( |
no it is not enough... |
Yes, there is padding between the columns... but it should not be accessed, unless the indices used to access the SoA are larger than its nominal capacity. |
in RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h
while everywhere else is accessed with the HashedIndex 869 auto const mean_x6 = conditionsDev.pedestals_mean_x6()[hashedId]; etc in addition the hasindex is computed from the digi rawid and if did.rawId() is junk (as apparently can be, as seen in the companion issue) hashedId can easily be anything.... |
Very good points. We have the possibility of enabling runtime checks on the indices, let me dig how to do it. |
Can you try running with #44987 ? |
what is supposed to happen? |
It's supposed to fail an assert or throw an exception if there is an out of bounds access (and if I made the correct changes). |
Did #45210 fix this ? |
Debugging #44940 I run
(use the script by @mmusich in #44956 to set up the environment)
and see plenty of report of the type
I use something like this to get the full list
once compiled with -g RecoLocalCalo/EcalRecProducers it becomes
that does not give further hint
(it should happen somewhere in
https://cmssdt.cern.ch/lxr/source/RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h?v=CMSSW_14_1_X_2024-05-05-2300#0776
If I well understood.
So I'm not able to debug further
The text was updated successfully, but these errors were encountered: