Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make events threadsafe #3042

Merged
merged 4 commits into from
Aug 13, 2021
Merged

make events threadsafe #3042

merged 4 commits into from
Aug 13, 2021

Conversation

lihui815
Copy link
Contributor

@lihui815 lihui815 commented Jul 8, 2021

Fixes Per issue #2874, race conditions can occur between EventT::Connect and EventT::Signal. In order to address this, add lock on Connect.

However, this only keeps thread-safety between Connect (insert) and the Cleanup (delete) referenced from Signal. There is still potential race condition between the rest of Signal (access) and Connect (insert). Adding a mutex lock guard here however will lock up the program as there appears to be a Connect call from a callback, and using recursive mutexes will impact performance to some extent.

Instead, since the conflict occurs between connection insertion and access, add a null check before accessing to each Signal definition.

@iche033
Copy link
Contributor

iche033 commented Jul 15, 2021

thanks for the fix. Do you have a minimal example that reproduces this segfault?

@j-rivero j-rivero self-requested a review July 27, 2021 13:36
@j-rivero
Copy link
Contributor

PR looks great to me. Do we need to include the mutex lock in the destructor too? I've done it in 0096765.

I've setup a branch to include changes in destructor, mutable for mutex and an experimental test that plays with 3 threads trying to force a possible race condition: a6e8999

These changes are in jrivero/gazebo11-tse/test branch in this same repository. If you agree, you can just pull them from my branch. Thanks!

@lihui815
Copy link
Contributor Author

lihui815 commented Aug 9, 2021

@j-rivero thanks & merged in PR

lihui815 and others added 4 commits August 9, 2021 10:55
- initial connection (insert) and cleanup (remove) locked with mutex

- signal (access) checks for object not null before attempting to access (not locking because would need recursive mutex)
Signed-off-by: Jose Luis Rivero <jrivero@osrfoundation.org>
Signed-off-by: Jose Luis Rivero <jrivero@osrfoundation.org>
Signed-off-by: Jose Luis Rivero <jrivero@osrfoundation.org>
@j-rivero
Copy link
Contributor

Umm we have a problem on Mac:

25: [ RUN      ] EventTest.RaceConditions
25: Start create connections thread
25: Start access thread
25: Start remove thread
25: libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
 25/416 Test  #25: UNIT_Event_TEST .......................................Subprocess aborted***Exception:   3.42 sec
test 26
        Start  26: check_UNIT_Event_TEST

26: Test command: /Users/jenkins/workspace/gazebo-ci-pr_any-homebrew-amd64/gazebo/tools/check_test_ran.py "/Users/jenkins/workspace/gazebo-ci-pr_any-homebrew-amd64/build/test_results/UNIT_Event_TEST.xml"
26: Test timeout computed to be: 10000000
26: Checking for test results in /Users/jenkins/workspace/gazebo-ci-pr_any-homebrew-amd64/build/test_results/UNIT_Event_TEST.xml
26: Cannot find results, writing failure results to /Users/jenkins/workspace/gazebo-ci-pr_any-homebrew-amd64/build/test_results/UNIT_Event_TEST.xml
 26/416 Test  #26: check_UNIT_Event_TEST .................................***Failed    0.06 sec

Need to look into it.

@j-rivero
Copy link
Contributor

Need to look into it.

I don't want to block this PR since it improves the current state of things. I've ticketed the issue of the test problem on Mac #3063.

@j-rivero j-rivero merged commit fc315a2 into gazebosim:gazebo11 Aug 13, 2021
j-rivero pushed a commit that referenced this pull request Aug 13, 2021
- Initial connection (insert) and cleanup (remove) locked with mutex
- Signal (access) checks for object not null before attempting to access (not locking because would need recursive mutex)
- Declare mutex as mutable so EventCount const method can block it
- Include race condition test to play with connections inside an event

(contribution by Sonia Jin from AWS, back port of #3042)
Co-authored-by: Jose Luis Rivero <jrivero@osrfoundation.org>
j-rivero added a commit that referenced this pull request Aug 16, 2021
- Initial connection (insert) and cleanup (remove) locked with mutex
- Signal (access) checks for object not null before attempting to access (not locking because would need recursive mutex)
- Declare mutex as mutable so EventCount const method can block it
- Include race condition test to play with connections inside an event

(contribution by Sonia Jin from AWS, back port of #3042)
Co-authored-by: Jose Luis Rivero <jrivero@osrfoundation.org>
Co-authored-by: sonia <lihui815@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants