Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testFWCoreConcurrencyCatch2 failing in MULTIARCH_X, ROOT632_X, ROOT6_X #45194

Closed
iarspider opened this issue Jun 11, 2024 · 13 comments · Fixed by #45402
Closed

testFWCoreConcurrencyCatch2 failing in MULTIARCH_X, ROOT632_X, ROOT6_X #45194

iarspider opened this issue Jun 11, 2024 · 13 comments · Fixed by #45402

Comments

@iarspider
Copy link
Contributor

In CMSSW_14_1_{MULTIARCH_X, ROOT632_X, ROOT6_X}, test testFWCoreConcurrencyCatch2 is failing:

src/FWCore/Concurrency/test/test_catch2_WaitingThreadPool.cc:118: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  SIGABRT - Abort (abnormal termination) signal
@iarspider
Copy link
Contributor Author

assign FWCore/Concurrency

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 11, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @iarspider.

@smuzaffar, @rappoccio, @Dr15Jones, @sextonkennedy, @makortel, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@iarspider
Copy link
Contributor Author

First occurance: CMSSW_14_1_ROOT632_X_2024-06-09-2300

@Dr15Jones
Copy link
Contributor

The problem does not appear to be reproducible (as is true for many threading tests). Was the machine running the tests under a high load?

@Dr15Jones
Copy link
Contributor

The actual failure in the log is

testFWCoreConcurrencyCatch2: src/FWCore/Utilities/interface/ReusableObjectHolder.h:92: edm::ReusableObjectHolder<T, Deleter>::~ReusableObjectHolder() [with T = edm::impl::WaitingThread; Deleter = std::default_delete<edm::impl::WaitingThread>]: Assertion `0 == m_outstandingObjects' failed.

@Dr15Jones
Copy link
Contributor

I believe there is a race-condition in edm::asyncRun when the WaitingThreadPool is being destroyed, there is no guaranteed that the threads it spawned (via WaitingThread) have actually finished being returned to the ReusableObjectHolder so they can be properly deleted.

@iarspider
Copy link
Contributor Author

Happened again in CMSSW_14_1_GEANT4_X_2024-07-07-2300:

===== Test "testFWCoreConcurrencyCatch2" ====
testFWCoreConcurrencyCatch2: src/FWCore/Utilities/interface/ReusableObjectHolder.h:92: edm::ReusableObjectHolder<T, Deleter>::~ReusableObjectHolder() [with T = edm::impl::WaitingThread; Deleter = std::default_delete<edm::impl::WaitingThread>]: Assertion `0 == m_outstandingObjects' failed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
testFWCoreConcurrencyCatch2 is a Catch v2.13.6 host application.
Run with -? for options

-------------------------------------------------------------------------------
Test WaitingThreadPool
-------------------------------------------------------------------------------
src/FWCore/Concurrency/test/test_catch2_WaitingThreadPool.cc:14
...............................................................................

src/FWCore/Concurrency/test/test_catch2_WaitingThreadPool.cc:242: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  SIGABRT - Abort (abnormal termination) signal

===============================================================================
test cases:  1 |  0 passed | 1 failed
assertions: 18 | 17 passed | 1 failed

timeout: the monitored command dumped core
/bin/sh: line 1: 3192312 Aborted                 timeout 3600 sh -c 'testFWCoreConcurrencyCatch2 '

---> test testFWCoreConcurrencyCatch2 had ERRORS
TestTime:1
^^^^ End Test testFWCoreConcurrencyCatch2 ^^^^

@makortel
Copy link
Contributor

makortel commented Jul 8, 2024

We took a deeper look with @Dr15Jones. The problem is likely the following: The code in

func_();
// Must return this WaitingThread to the ReusableObjectHolder in
// the WaitingThreadPool before resettting func_ (that holds the
// WaitingTaskWithArenaHolder, that enables the progress in the
// TBB thread pool) in order to meet the requirement of
// ReusableObjectHolder destructor that there are no outstanding
// objects.
thisPtr_.reset();
decltype(func_)().swap(func_);

expects the thisPtr_.reset() (that returns the WaitingThread object to be returned to the ReusableObjectHolder to be called before the WaitingTaskWithArenaHolder::doneWaiting() in the swap() line (that then decrements the refcount of the WaitingTask, and may lead to the contained task to be scheduled if the refcount reaches 0). If the user-provided function throws an exception, the exception is communicated in
try {
convertException::wrap([&func]() { func(); });
} catch (cms::Exception& e) {
e.addContext(errorContext());
holder.doneWaiting(std::current_exception());
}

The problem is that the holder.doneWaiting(std::current_exception()) leads to the refcount to be decreased already during

that can cause the task to be scheduled before the WaitingThread is returned to the ReusableObjectHolder.

The fix would be to add a function presetTaskAsFailed() (that WaitingTaskHolder has) to WaitingTaskWithArenaHolder, and use that to communicate the exception instead of doneWaiting(), and leave the refcount decrement to be done by the WaitingTaskWithArenaHolder destructor.

@makortel
Copy link
Contributor

makortel commented Jul 8, 2024

Hopefully fixed by #45402

@makortel
Copy link
Contributor

+core

We can reopen the issue if #45402 turns out not to be sufficient

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants