-
Notifications
You must be signed in to change notification settings - Fork 762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROS2 w/ Fast-RTPS: Maximum number of subscriptions to a topic [4190] #359
Comments
The reason I think this is a problem with Fast-RTPS (and not elsewhere in the ROS 2 stack) is that I tested this with both RMW_IMPLEMENTATION=rmw_fastrtps_cpp and RMW_IMPLEMENATION=rmw_opensplice_cpp. With Fast-RTPS, I see the problem with dropped messages, while with opensplice, I do not. |
@richiware do you have any update on this? additional experiments suggest me that it's a problem in the nodes discovery process.
It's interesting to note that in the last case the lost messages are always the first published ones, once a node receive a message then it will receive also all the others. |
@clalancette Thanks for your code example. I was testing with it and I was able to get
I think this is the expected behavior of your example. I've had to change a little your code to achieve this. The problem is related to the default QoS which ROS2 uses to create publishers and subscribers. By default DurabilityQoS is VOLATILE. In this case a publisher won't send old data to a new discovered subscriber. Your example is creating several subscribers and a publisher, and immediately it starts to send data. The discovery process needs some time, and when a subscriber is discovered the first, several samples already was sent and the publisher won't resend them to the discovered subscriber. I have change the behavior setting the DurabilityQoS to TRANSIENT_LOCAL. With this the publisher will send old samples to a new discovered subscriber. SimplePublisherNode() : Node("publisher_node")
{
rmw_qos_profile_t qos;
qos.history = RMW_QOS_POLICY_HISTORY_KEEP_ALL;
qos.reliability = RMW_QOS_POLICY_RELIABILITY_RELIABLE;
qos.durability = RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL;
_publisher = this->create_publisher<std_msgs::msg::Header>("my_topic", qos);
auto period = 200ms;
_timer = this->create_wall_timer(period, std::bind(&SimplePublisherNode::publish, this));
}
SimpleSubscriberNode(std::string name) : Node(name)
{
rmw_qos_profile_t qos;
qos.history = RMW_QOS_POLICY_HISTORY_KEEP_ALL;
qos.reliability = RMW_QOS_POLICY_RELIABILITY_RELIABLE;
qos.depth = 100;
qos.durability = RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL;
_subscriber = this->create_subscription<std_msgs::msg::Header>("my_topic",
std::bind(&SimpleSubscriberNode::simple_callback, this, std::placeholders::_1), qos);
msg_count = 0;
} |
Thank you a lot @richiware for the update. Your point makes perfectly sense, however even with the updated QoS I get sometimes that 1node is not discovered in time. However this may be not the optimal solution for my use case as I need messages as soon as possible as they are published or they are useless. I find more useful to simply wait more than 30 seconds before starting to publish and still using the default QoS. If a discovery time of approx 30 seconds is expected, I think we can close. |
@alsora I didn't appreciate the long time in the discovering phase during my testing. But you're right, it's not suitable. I think the QoS of Endpoint Discovery Phase entities can be improved. I will continue working on this problem. |
I'm working in #411 to improve discovery times. |
Actually no waiting time is required if the device running the program is not connected to any network. I guess that it makes sense since nodes have to search each other only within the same machine, however it does not clearly explain why they take so much when there is a network connection. |
Could you test with release/1.7.1 branch? I want to know if my changes fixed your problem. |
@clalancette In #411 you mention the PR improves your discovery times. Does this PR, already included in new release v1.7.1, fix this issue? Thanks. |
This issue is vastly improved with Fast-RTPS 1.7.1. I ran this test 10 times, and I only saw one case where a single message was dropped. So I think that we can call this problem fixed, and you can close this issue when you are satisfied. Note that I'm going to leave ros2/rmw_fastrtps#249 open to track getting the fix into ROS 2. |
Never mind that last bit, I realized that we track master in ROS 2, so we should already have the fix. |
Thanks for your help. As a note. This performance decrease on Discovery comes from when we modified default values of reliable communication to improve the performance sending video (ros2/rmw_fastrtps#202). But it must have decreased the performance in discovery. Now discovery uses its own values. |
I finally had time to test the new release version. I only did some quick tests, i.e. 1 publisher node with 5, 10 or 15 subscriber nodes. The results are interesting, but not entirely positive. Here some results, outliers are excluded from average. Case (1) 1 publisher 5 subscribers Case (2) 1 publisher 10 subscribers Case (3) 1 publisher 15 subscribers |
Changes I did for this issue was related to reliable communications and was focused on:
With these changes @clalancette scenario improves from 30 seconds to less than 2 seconds. I think this is a bigger improvement that other scenarios worsen milliseconds. For this reason it should be better to leave this values as default ones for discovery. But it is right each scenario is different and needs its own DDS configuration. We will leave available for the user the configuration of discovery built-in endpoints, so the user can adjust it to its scenario. About the 40 seconds you see some times. I know the reason. Currently a participant answers another participant DATA(p) sending its own DATA(p) to multicast. As the number of participant increases, more DATA(p) at the same time and I saw the system discard some of them. To solve this when a DATA(p) is an answer, it should be sent by unicast. But this change wasn't so easy than the others I did. I keep in mind to make this change as soon as possible. About the 100 seconds, I don't know what is happening. Can you provide an example that reproduce this case? And thanks a lot for your help. |
Yes, now the general behavior is way better. Also very good to know that you are already aware of what is required to fix the 40 seconds issue. The experiments labeled as Bouncy were using an old version of FastRTPS, the one shipping with that ROS2 distribution, so now the issue is solved. Thank you a lot for your explanations! |
This is a copy of ros2/rmw_fastrtps#249, copied over here at @richiware 's request.
Bug report
Required Info:
-0.5.1
Steps to reproduce issue
Expected behavior
I create 1 publisher node and 20 subscriber nodes.
I add all of them to the same executor and make them spin.
Let them spin for 20 seconds.
I would expect that all the nodes receive messages
Actual behavior
Almost every time I run this simple system there is a variable number of nodes which do not receive any message.
I observed this behavior when the number of subscriptions to a topic is greater than 10.
The problem is not related to the threading implementation (e.g. I tried to run each node in a separate std::thread or in a separate process and nothing changes).
The amount of nodes that will not receive anything is not predictable. Few times the system was working perfectly.
Here you find a plot of what's going on for a varying number of subscirber nodes and always 1 publisher node.
Note that most of the times, besides the node that receive 0 messages, all the others receive 95% - 100% of messages.
The text was updated successfully, but these errors were encountered: