Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition on Publisher shutdown #2812

Merged
merged 3 commits into from
Aug 7, 2020
Merged

Conversation

iche033
Copy link
Contributor

@iche033 iche033 commented Aug 4, 2020

The race condition problem occurs when the Publisher::OnPublishComplete callback is invoked from a different thread after the Publisher object is destroyed, causing a crash. There's been a few attempts to fixing this race condition, see this commit and this PR, but the issue persists and it is just harder to reproduce. When successfully reproduced, I get a assertion error in this line when locking the mutex.

The fix is to bind the callback function along with a shared pointer instead of raw pointer this so that the object is kept alive for a little longer until after the callback is complete. In the case of Publisher, this should be safe to do as the OnPublishComplete function is mainly just doing some bookkeeping work. Note that I added a PublisherPrivate class and stored instances of this class in a static variable to avoid breaking ABI.

To reproduce this issue:

  1. Save this model as box.sdf file:
<?xml version="1.0" ?>
<sdf version="1.6">
  <model name="box">
    <pose>0 0 0.5 0 0 0</pose>
    <link name="link">
      <collision name="collision">
        <geometry>
          <box>
            <size>1 1 1</size>
          </box>
        </geometry>
      </collision>
      <visual name="visual">
        <geometry>
          <box>
            <size>1 1 1</size>
          </box>
        </geometry>
      </visual>
    </link>
  </model>
</sdf>
  1. launch gazebo using roslaunch:
roslaunch gazebo_ros empty_world.launch verbose:=true
  1. Run the following script that spawns and deletes the box model 500 times:
#!/bin/bash
for i in {0..500}; do
    echo "$i"
    rosparam set pizza --textfile box.sdf
    rosrun gazebo_ros spawn_model -param pizza -sdf -model box -x 0 -y 0.5 -z 0
    sleep 0.5
    rosservice call gazebo/delete_model '{model_name: box}'
done
  1. Without changes in this PR, the crash occurs on spawn after a couple hundred iterations for me

Signed-off-by: Ian Chen ichen@osrfoundation.org

Signed-off-by: Ian Chen <ichen@osrfoundation.org>
@chapulina chapulina added the 9 Gazebo 9 label Aug 6, 2020
Copy link
Contributor

@chapulina chapulina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, nice ABI gymnastics!

The fix is to bind the callback function along with a shared pointer instead of raw pointer this so that the object is kept alive for a little longer

Interesting, in this case I guess the global static map helped, because if you had used a regular dataPtr as unique_ptr, it would have died together with the Publisher.

@@ -188,12 +250,14 @@ void Publisher::SendMessage()
{
// Send the latest message.
int result = this->publication->Publish(*iter,
boost::bind(&Publisher::OnPublishComplete, this, _1), *pubIter);
boost::bind(&PublisherPrivate::OnPublishComplete,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use the opportunity to change to std::bind

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. c1e40a5

@@ -268,6 +308,9 @@ void Publisher::Fini()
TopicManager::Instance()->Unadvertise(this->topic, this->id);

this->node.reset();

std::lock_guard<std::mutex> lock(pubMapMutex);
publisherMap.erase(this->id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember having a reason to destroy all publishers before the node, but I can't remember exactly why. Looking through the code, this pattern seems common:

this->pub.reset();
this->node->Fini();
this->node.reset();

If you don't have a strong reason for terminating the publisher after the node, I'd recommend switching order here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved erase before resetting node c1e40a5

Signed-off-by: Ian Chen <ichen@osrfoundation.org>
Signed-off-by: Ian Chen <ichen@osrfoundation.org>
@iche033
Copy link
Contributor Author

iche033 commented Aug 7, 2020

Interesting, in this case I guess the global static map helped, because if you had used a regular dataPtr as unique_ptr, it would have died together with the Publisher.

yep the publisher object has to be a shared ptr in a static map. I added a comment in ce2b9df

Copy link
Contributor

@chapulina chapulina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest updates look good too 👍

@iche033 iche033 merged commit 5881f82 into gazebo9 Aug 7, 2020
@iche033 iche033 deleted the publisher_complete branch August 7, 2020 23:05
@scpeters scpeters mentioned this pull request Jan 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
9 Gazebo 9
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants