Ignore errors from `alpaka::enqueue()` in `CachingAllocator::free()` #44730

makortel · 2024-04-12T18:06:36Z

PR description:

#44634 reported an HLT job failure caused by an illegal memory access on a GPU. The failure was reported as a crash instead of a caught exception because of a second exception being thrown from CachingAllocator<T>::free() by alpaka::enqueue() when objects using cached allocations were being deleted as part of the stack unwinding of the original exception.

The alpaka::enqueue() is used in CachingAllocator<T>::free() to "record" the alpaka Event in the Queue when the freed memory block is supposed to be recached. This PR changes the behavior such that if alpaka::enqueue() throws an exception, the memory block is treated as freed instead of recached.

I checked the alpaka Buffers, Queues, and Events that their destructors do not throw exceptions, but report any errors from the underlying APIs as printouts.

PR validation:

I tested the reproducer in #44634 on a GPU node with CUDA_LAUNCH_BLOCKING=1 cmsRun ..., and now the job reports the exception in a useful way

----- Begin Fatal Exception 05-Apr-2024 20:44:47 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 378940 lumi: 21 event: 5339574 stream: 0
   [1] Running path 'DST_PFScouting_JetHT_v1'
   [2] Calling method for module PFClusterSoAProducer@alpaka/'hltParticleFlowClusterHBHESoA'
Exception Message:
A std::exception was thrown.
/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp(259) 'TApi::setDevice(queue.m_spQueueImpl->m_dev.getNativeHandle())' A previous API call (not this one) set the error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
----- End Fatal Exception -------------------------------------------------

Afterwards the job still crashes, but in direct CUDA code (cms::cuda::abortOnCudaError() in SiPixelGainCalibrationForHLTGPU::~SiPixelGainCalibrationForHLTGPU()), but that is probably not worth of addressing at this time, when the direct CUDA code is expected to be removed later on.

Without CUDA_LAUNCH_BLOCKING=1 the reported exception message is no longer useful, but at least the job contains printouts from Alpaka code that include the cudaErrorIllegalAddress error name. So while not ideal, the log contains more useful information than before this PR.

The added unit test succeeds on Serial and CUDA backends (and without the change of this PR the unit test fails on CUDA backend, and succeeds on Serial backend).

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

To be backported to 14_0_X

cmsbuild · 2024-04-12T18:07:06Z

cms-bot internal usage

cmsbuild · 2024-04-12T18:13:07Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44730/39941

This PR adds an extra 16KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h modified in PR(s): Reorganise alpaka kernel loop functions #44712

cmsbuild · 2024-04-12T18:13:33Z

A new Pull Request was created by @makortel for master.

It involves the following packages:

HeterogeneousCore/AlpakaInterface (heterogeneous)

@fwyzard, @cmsbuild, @makortel can you please review it and eventually sign? Thanks.
@missirol, @rovere this is something you requested to watch as well.
@sextonkennedy, @antoniovilela, @rappoccio you are the release manager for this.

cms-bot commands are listed here

makortel · 2024-04-12T18:17:47Z

enable gpu

makortel · 2024-04-12T18:17:52Z

@cmsbuild, please test

cmsbuild · 2024-04-12T21:38:44Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-3906d3/38824/summary.html
COMMIT: 0a5eef6
CMSSW: CMSSW_14_1_X_2024-04-12-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44730/38824/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially removed 36 lines from the logs
Reco comparison results: 43 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3316263
DQMHistoTests: Total failures: 0
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3316243
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 202 log files, 165 edm output root files, 48 DQM output files
TriggerResults: no differences found

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 10 differences found in the comparisons
DQMHistoTests: Total files compared: 3
DQMHistoTests: Total histograms compared: 39740
DQMHistoTests: Total failures: 455
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 39285
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
Checked 8 log files, 10 edm output root files, 3 DQM output files
TriggerResults: no differences found

fwyzard · 2024-04-12T22:09:26Z

hi Matti, I'm just trying to follow what should happen:

a CUDA error occurs, which causes an exception
while unwinding the stack, a GPU alpaka buffer is freed
CachingAllocator::free() records an event in the current queue
since the CUDA runtime is in an error state, this fails, and raises another exception, which results in a call to terminate()

is it correct ?

Then, with these changes

the second exception raised by recording the event is caught
the block is not put back into the allocator pool

However, when the block goes out of scope, it should result in a CUDA call to free the memory.
Shouldn't this also cause a second exception, and thus a call to terminate() ?

makortel · 2024-04-15T14:33:57Z

a CUDA error occurs, which causes an exception

while unwinding the stack, a GPU alpaka buffer is freed

CachingAllocator::free() records an event in the current queue

since the CUDA runtime is in an error state, this fails, and raises another exception, which results in a call to terminate()

is it correct ?

Correct.

Then, with these changes

the second exception raised by recording the event is caught

the block is not put back into the allocator pool

However, when the block goes out of scope, it should result in a CUDA call to free the memory. Shouldn't this also cause a second exception, and thus a call to terminate() ?

The deleter used by Alpaka CUDA/HIP backend in the Alpaka buffer does not throw an exception, but prints an error message, if the cudaFree() returns an error
https://github.com/alpaka-group/alpaka/blob/a4142d3feb7686d803e1ec5f25d7b2278337f455/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp#L266
https://github.com/alpaka-group/alpaka/blob/a4142d3feb7686d803e1ec5f25d7b2278337f455/include/alpaka/core/UniformCudaHip.hpp#L110-L111

The destructors of Alpaka's Queue and Event follow the same pattern, so also they can be left to be destructed without special attention.

makortel · 2024-04-15T14:42:57Z

I'm thinking to add a unit test

fwyzard · 2024-04-15T20:42:38Z

I'm thinking to add a unit test

ok for me :)

makortel · 2024-04-16T15:01:56Z

Added the test. Without this PR the test fails on CUDA backend.

fwyzard · 2024-04-16T20:15:24Z

+heterogeneous

cmsbuild · 2024-04-16T20:18:33Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44730/39975

This PR adds an extra 24KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h modified in PR(s): Reorganise alpaka kernel loop functions #44712
- File HeterogeneousCore/AlpakaInterface/test/BuildFile.xml modified in PR(s): [RFC] Add {Copy,Move}ToDeviceCache<T> class templates and moveToDeviceAsync function template #43969

cmsbuild · 2024-04-16T20:18:57Z

Pull request #44730 was updated. can you please check and sign again.

makortel · 2024-04-16T20:30:09Z

@cmsbuild, please test

cmsbuild · 2024-04-16T23:51:06Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-3906d3/38877/summary.html
COMMIT: 65d51bf
CMSSW: CMSSW_14_1_X_2024-04-16-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/44730/38877/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially removed 94 lines from the logs
Reco comparison results: 43 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3319475
DQMHistoTests: Total failures: 0
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3319455
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 202 log files, 165 edm output root files, 48 DQM output files
TriggerResults: no differences found

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 18 differences found in the comparisons
DQMHistoTests: Total files compared: 3
DQMHistoTests: Total histograms compared: 39740
DQMHistoTests: Total failures: 787
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 38953
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
Checked 8 log files, 10 edm output root files, 3 DQM output files
TriggerResults: no differences found

rappoccio · 2024-04-18T15:20:28Z

+1

cmsbuild added this to the CMSSW_14_1_X milestone Apr 12, 2024

cmsbuild added pending-signatures tests-pending orp-pending code-checks-pending heterogeneous-pending labels Apr 12, 2024

makortel mentioned this pull request Apr 12, 2024

Ignore errors from alpaka::enqueue() in CachingAllocator::free() cms-sw/framework-team#884

Closed

2 tasks

cmsbuild added code-checks-approved and removed code-checks-pending labels Apr 12, 2024

makortel mentioned this pull request Apr 12, 2024

HLT Farm crashes in run 378940 #44634

Open

cmsbuild added tests-started and removed tests-pending labels Apr 12, 2024

cmsbuild added tests-approved and removed tests-started labels Apr 12, 2024

makortel force-pushed the alpakaCachingAllocatorFreeException branch from 0a5eef6 to eb061c6 Compare April 16, 2024 14:59

cmsbuild added tests-pending code-checks-pending and removed tests-approved code-checks-approved labels Apr 16, 2024

Ignore errors from alpaka::enqueue() in CachingAllocator::free()

65d51bf

makortel force-pushed the alpakaCachingAllocatorFreeException branch from eb061c6 to 65d51bf Compare April 16, 2024 20:14

cmsbuild added tests-pending code-checks-pending and removed tests-approved code-checks-approved labels Apr 16, 2024

cmsbuild added fully-signed tests-started heterogeneous-approved and removed pending-signatures tests-pending heterogeneous-pending labels Apr 16, 2024

cmsbuild added code-checks-approved and removed code-checks-pending labels Apr 16, 2024

cmsbuild added tests-approved and removed tests-started labels Apr 16, 2024

This was referenced Apr 17, 2024

[14_0_X] Ignore errors from alpaka::enqueue() in CachingAllocator::free() #44763

Merged

HLT farm crash in run 379617 #44769

Closed

cmsbuild added orp-approved and removed orp-pending labels Apr 18, 2024

cmsbuild merged commit f81a842 into cms-sw:master Apr 18, 2024
14 checks passed

This was referenced Apr 19, 2024

Update ROCm to version 6.1.0 cms-sw/cmsdist#9143

Merged

Update the #include files for ROCm v6.x #44777

Merged

makortel deleted the alpakaCachingAllocatorFreeException branch April 22, 2024 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore errors from `alpaka::enqueue()` in `CachingAllocator::free()` #44730

Ignore errors from `alpaka::enqueue()` in `CachingAllocator::free()` #44730

makortel commented Apr 12, 2024 •

edited

cmsbuild commented Apr 12, 2024 •

edited

cmsbuild commented Apr 12, 2024

cmsbuild commented Apr 12, 2024

makortel commented Apr 12, 2024

makortel commented Apr 12, 2024

cmsbuild commented Apr 12, 2024

fwyzard commented Apr 12, 2024

makortel commented Apr 15, 2024

makortel commented Apr 15, 2024

fwyzard commented Apr 15, 2024

makortel commented Apr 16, 2024

fwyzard commented Apr 16, 2024

cmsbuild commented Apr 16, 2024

cmsbuild commented Apr 16, 2024

makortel commented Apr 16, 2024

cmsbuild commented Apr 16, 2024

rappoccio commented Apr 18, 2024

Ignore errors from alpaka::enqueue() in CachingAllocator::free() #44730

Ignore errors from alpaka::enqueue() in CachingAllocator::free() #44730

Conversation

makortel commented Apr 12, 2024 • edited

PR description:

PR validation:

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

cmsbuild commented Apr 12, 2024 • edited

cmsbuild commented Apr 12, 2024

cmsbuild commented Apr 12, 2024

makortel commented Apr 12, 2024

makortel commented Apr 12, 2024

cmsbuild commented Apr 12, 2024

Comparison Summary

GPU Comparison Summary

fwyzard commented Apr 12, 2024

makortel commented Apr 15, 2024

makortel commented Apr 15, 2024

fwyzard commented Apr 15, 2024

makortel commented Apr 16, 2024

fwyzard commented Apr 16, 2024

cmsbuild commented Apr 16, 2024

cmsbuild commented Apr 16, 2024

makortel commented Apr 16, 2024

cmsbuild commented Apr 16, 2024

Comparison Summary

GPU Comparison Summary

rappoccio commented Apr 18, 2024

Ignore errors from `alpaka::enqueue()` in `CachingAllocator::free()` #44730

Ignore errors from `alpaka::enqueue()` in `CachingAllocator::free()` #44730

makortel commented Apr 12, 2024 •

edited

cmsbuild commented Apr 12, 2024 •

edited