ROS2 w/ Fast-RTPS: Maximum number of subscriptions to a topic [4190] #359

clalancette · 2018-12-19T22:29:07Z

This is a copy of ros2/rmw_fastrtps#249, copied over here at @richiware 's request.

Bug report

Required Info:

Operating System:
- Ubuntu 16.04
Installation type:
- from source
Version or commit hash:
-0.5.1
DDS implementation:
- Fast-RTPS
Client library (if applicable):
- rclcpp

Steps to reproduce issue

#include <memory>
#include <iostream>
#include <thread>
#include <sstream>

#include "rclcpp/rclcpp.hpp"
#include "std_msgs/msg/header.hpp"

using namespace std::chrono_literals;

class SimplePublisherNode : public rclcpp::Node {

public:

    SimplePublisherNode() : Node("publisher_node")
    {
        _publisher = this->create_publisher<std_msgs::msg::Header>("my_topic");

        auto period = 200ms;
        _timer = this->create_wall_timer(period, std::bind(&SimplePublisherNode::publish, this));
    }

    void publish()
    {
        std_msgs::msg::Header::SharedPtr message(new std_msgs::msg::Header());
        _publisher->publish(message);
    }

private:
    rclcpp::Publisher<std_msgs::msg::Header>::SharedPtr _publisher;
    rclcpp::TimerBase::SharedPtr _timer;
};

class SimpleSubscriberNode : public rclcpp::Node {

public:

    SimpleSubscriberNode(std::string name) : Node(name)
    {
        _subscriber = this->create_subscription<std_msgs::msg::Header>("my_topic",
            std::bind(&SimpleSubscriberNode::simple_callback, this, std::placeholders::_1));

        msg_count = 0;
    }

    int get_count() {return msg_count;}

private:

    void simple_callback(const std_msgs::msg::Header::SharedPtr msg)
    {
        msg_count ++;
    }

    rclcpp::Subscription<std_msgs::msg::Header>::SharedPtr _subscriber;
    int msg_count;
};


int main(int argc, char ** argv)
{
    rclcpp::init(argc, argv);

    int n_subscribers = 20;

    rclcpp::executors::SingleThreadedExecutor::SharedPtr executor =
        std::make_shared<rclcpp::executors::SingleThreadedExecutor>();

    std::vector<std::shared_ptr<SimpleSubscriberNode>> subscriber_nodes;

    for (int i = 0; i < n_subscribers; i++){
        std::stringstream ss;
        ss << "node_";
        ss << i;
        std::shared_ptr<SimpleSubscriberNode> sub_node = std::make_shared<SimpleSubscriberNode>(ss.str());

        subscriber_nodes.push_back(sub_node);
        executor->add_node(sub_node);
    }

    std::shared_ptr<SimplePublisherNode> pub_node = std::make_shared<SimplePublisherNode>();
    executor->add_node(pub_node);

    std::cout<<"Starting stress test!"<<std::endl;
    std::thread thread([executor](){
        executor->spin();
    });

    thread.detach();

    std::this_thread::sleep_for(std::chrono::seconds(20));

    rclcpp::shutdown();

    for (int i = 0; i < n_subscribers; i++){
        std::shared_ptr<SimpleSubscriberNode> sub_node = subscriber_nodes[i];
        std::cout<<"Node: "<< i<< " received " << sub_node->get_count() << " messages!"<<std::endl;
    }

    return 0;

}

Expected behavior

I create 1 publisher node and 20 subscriber nodes.
I add all of them to the same executor and make them spin.
Let them spin for 20 seconds.

I would expect that all the nodes receive messages

Actual behavior

Almost every time I run this simple system there is a variable number of nodes which do not receive any message.

I observed this behavior when the number of subscriptions to a topic is greater than 10.

The problem is not related to the threading implementation (e.g. I tried to run each node in a separate std::thread or in a separate process and nothing changes).

The amount of nodes that will not receive anything is not predictable. Few times the system was working perfectly.

Here you find a plot of what's going on for a varying number of subscirber nodes and always 1 publisher node.
Note that most of the times, besides the node that receive 0 messages, all the others receive 95% - 100% of messages.

The text was updated successfully, but these errors were encountered:

clalancette · 2018-12-19T22:31:58Z

The reason I think this is a problem with Fast-RTPS (and not elsewhere in the ROS 2 stack) is that I tested this with both RMW_IMPLEMENTATION=rmw_fastrtps_cpp and RMW_IMPLEMENATION=rmw_opensplice_cpp. With Fast-RTPS, I see the problem with dropped messages, while with opensplice, I do not.

alsora · 2019-01-24T10:58:18Z

@richiware do you have any update on this?

additional experiments suggest me that it's a problem in the nodes discovery process.
There are 3 possible situations

nodes receiving 100% messages
nodes receiving 0% messages
nodes receiving X% messages

It's interesting to note that in the last case the lost messages are always the first published ones, once a node receive a message then it will receive also all the others.

richiware · 2019-01-31T16:01:02Z

@clalancette Thanks for your code example. I was testing with it and I was able to get

Starting stress test!
Node: 0 received 99 messages!
Node: 1 received 99 messages!
Node: 2 received 99 messages!
Node: 3 received 99 messages!
Node: 4 received 99 messages!
Node: 5 received 99 messages!
Node: 6 received 99 messages!
Node: 7 received 99 messages!
Node: 8 received 99 messages!
Node: 9 received 99 messages!
Node: 10 received 99 messages!
Node: 11 received 99 messages!
Node: 12 received 99 messages!
Node: 13 received 99 messages!
Node: 14 received 99 messages!
Node: 15 received 99 messages!
Node: 16 received 99 messages!
Node: 17 received 99 messages!
Node: 18 received 99 messages!
Node: 19 received 99 messages!

I think this is the expected behavior of your example. I've had to change a little your code to achieve this. The problem is related to the default QoS which ROS2 uses to create publishers and subscribers. By default DurabilityQoS is VOLATILE. In this case a publisher won't send old data to a new discovered subscriber.

Your example is creating several subscribers and a publisher, and immediately it starts to send data. The discovery process needs some time, and when a subscriber is discovered the first, several samples already was sent and the publisher won't resend them to the discovered subscriber.

I have change the behavior setting the DurabilityQoS to TRANSIENT_LOCAL. With this the publisher will send old samples to a new discovered subscriber.

    SimplePublisherNode() : Node("publisher_node")
    {
        rmw_qos_profile_t qos;
        qos.history = RMW_QOS_POLICY_HISTORY_KEEP_ALL;
        qos.reliability = RMW_QOS_POLICY_RELIABILITY_RELIABLE;
        qos.durability = RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL;
        _publisher = this->create_publisher<std_msgs::msg::Header>("my_topic", qos);

        auto period = 200ms;
        _timer = this->create_wall_timer(period, std::bind(&SimplePublisherNode::publish, this));
    }

    SimpleSubscriberNode(std::string name) : Node(name)
    {
        rmw_qos_profile_t qos;
        qos.history = RMW_QOS_POLICY_HISTORY_KEEP_ALL;
        qos.reliability = RMW_QOS_POLICY_RELIABILITY_RELIABLE;
        qos.depth = 100;
        qos.durability = RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL;
        _subscriber = this->create_subscription<std_msgs::msg::Header>("my_topic",
            std::bind(&SimpleSubscriberNode::simple_callback, this, std::placeholders::_1), qos);

        msg_count = 0;
    }

alsora · 2019-01-31T18:11:23Z

Thank you a lot @richiware for the update.

Your point makes perfectly sense, however even with the updated QoS I get sometimes that 1node is not discovered in time.
I added a short wait period between the nodes creation and when the executor->spin() method is called and this allows to receive all the messages.

However this may be not the optimal solution for my use case as I need messages as soon as possible as they are published or they are useless.

I find more useful to simply wait more than 30 seconds before starting to publish and still using the default QoS.
Actually it's strange that I already tried to use a timer of approx 20-30 seconds when I opened this issue with Bouncy, but it was not working.

If a discovery time of approx 30 seconds is expected, I think we can close.

richiware · 2019-02-01T06:54:06Z

@alsora I didn't appreciate the long time in the discovering phase during my testing. But you're right, it's not suitable. I think the QoS of Endpoint Discovery Phase entities can be improved. I will continue working on this problem.

richiware · 2019-02-13T06:29:55Z

I'm working in #411 to improve discovery times.

alsora · 2019-02-13T16:49:54Z

Actually no waiting time is required if the device running the program is not connected to any network.

I guess that it makes sense since nodes have to search each other only within the same machine, however it does not clearly explain why they take so much when there is a network connection.

richiware · 2019-02-14T11:50:40Z

Could you test with release/1.7.1 branch? I want to know if my changes fixed your problem.

richiware · 2019-02-20T09:39:31Z

@clalancette In #411 you mention the PR improves your discovery times. Does this PR, already included in new release v1.7.1, fix this issue? Thanks.

clalancette · 2019-02-20T14:47:51Z

This issue is vastly improved with Fast-RTPS 1.7.1. I ran this test 10 times, and I only saw one case where a single message was dropped. So I think that we can call this problem fixed, and you can close this issue when you are satisfied. Note that I'm going to leave ros2/rmw_fastrtps#249 open to track getting the fix into ROS 2.

clalancette · 2019-02-20T14:49:30Z

Never mind that last bit, I realized that we track master in ROS 2, so we should already have the fix.

richiware · 2019-02-20T15:27:32Z

Thanks for your help.

As a note. This performance decrease on Discovery comes from when we modified default values of reliable communication to improve the performance sending video (ros2/rmw_fastrtps#202). But it must have decreased the performance in discovery. Now discovery uses its own values.

alsora · 2019-02-20T16:33:06Z

I finally had time to test the new release version.

I only did some quick tests, i.e. 1 publisher node with 5, 10 or 15 subscriber nodes.
I compared Bouncy, Crystal Patch 1 and the current Crystal MASTER + FastRTPS new release.

The results are interesting, but not entirely positive.
To me it looks like that in simple scenarios the performance got worst, while they definitely improved in the more "bad" tests.

Here some results, outliers are excluded from average.

Case (1) 1 publisher 5 subscribers
Bouncy: < 10ms
Crystal: < 10ms
Master: 150ms

Case (2) 1 publisher 10 subscribers
Bouncy: 10ms (10% of the tests labeled as outliers with a discovery time of 2 seconds)
Crystal: 40ms (30% of the tests labeled as outliers with a discovery time of 40 seconds)
Master: 200ms

Case (3) 1 publisher 15 subscribers
Bouncy: N.A. around 80% of the tests timed out after 100 seconds with still 1 or 2 nodes to discover.
Crystal: 100ms (50% of the tests labeled as outliers with a discovery time of 40 seconds)
Master: 700ms (5% of the tests labeled as outliers with a discovery time of 40 seconds)

richiware · 2019-02-21T09:11:07Z

Changes I did for this issue was related to reliable communications and was focused on:

Better usage of network when sending meta-messages to several participants (like HEARBEAT messages). In some places we were not trying to send to multicast if there were several destinations.
Increase some waiting times to later make a better usage of the network. Like the property nackResponseDelay. I increased it to ~100ms. This is the mainly reason you see that results.

With these changes @clalancette scenario improves from 30 seconds to less than 2 seconds. I think this is a bigger improvement that other scenarios worsen milliseconds. For this reason it should be better to leave this values as default ones for discovery.

But it is right each scenario is different and needs its own DDS configuration. We will leave available for the user the configuration of discovery built-in endpoints, so the user can adjust it to its scenario.

About the 40 seconds you see some times. I know the reason. Currently a participant answers another participant DATA(p) sending its own DATA(p) to multicast. As the number of participant increases, more DATA(p) at the same time and I saw the system discard some of them. To solve this when a DATA(p) is an answer, it should be sent by unicast. But this change wasn't so easy than the others I did. I keep in mind to make this change as soon as possible.

About the 100 seconds, I don't know what is happening. Can you provide an example that reproduce this case?

And thanks a lot for your help.

alsora · 2019-02-21T10:22:28Z

Yes, now the general behavior is way better.
That was just a comparison of the results, knowing the reason why the minimum discovery time has increased is definitely good.

Also very good to know that you are already aware of what is required to fix the 40 seconds issue.

The experiments labeled as Bouncy were using an old version of FastRTPS, the one shipping with that ROS2 distribution, so now the issue is solved.

Thank you a lot for your explanations!

richiware changed the title ~~ROS2 w/ Fast-RTPS: Maximum number of subscriptions to a topic~~ ROS2 w/ Fast-RTPS: Maximum number of subscriptions to a topic [4190] Dec 21, 2018

clalancette mentioned this issue Feb 20, 2019

Maximum number of subscriptions to a topic ros2/rmw_fastrtps#249

Closed

richiware closed this as completed Feb 20, 2019

clalancette mentioned this issue Mar 15, 2019

Crystal Fast-RTPS fix backport tracking issue ros2/ros2#669

Closed

richiware mentioned this issue Jun 21, 2024

Enumerations are traslated to C++ signed integer enumerations [21215] eProsima/Fast-DDS-docs#821

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROS2 w/ Fast-RTPS: Maximum number of subscriptions to a topic [4190] #359

ROS2 w/ Fast-RTPS: Maximum number of subscriptions to a topic [4190] #359

clalancette commented Dec 19, 2018

clalancette commented Dec 19, 2018

alsora commented Jan 24, 2019

richiware commented Jan 31, 2019

alsora commented Jan 31, 2019 •

edited

Loading

richiware commented Feb 1, 2019

richiware commented Feb 13, 2019

alsora commented Feb 13, 2019

richiware commented Feb 14, 2019

richiware commented Feb 20, 2019

clalancette commented Feb 20, 2019

clalancette commented Feb 20, 2019

richiware commented Feb 20, 2019

alsora commented Feb 20, 2019

richiware commented Feb 21, 2019

alsora commented Feb 21, 2019

ROS2 w/ Fast-RTPS: Maximum number of subscriptions to a topic [4190] #359

ROS2 w/ Fast-RTPS: Maximum number of subscriptions to a topic [4190] #359

Comments

clalancette commented Dec 19, 2018

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

clalancette commented Dec 19, 2018

alsora commented Jan 24, 2019

richiware commented Jan 31, 2019

alsora commented Jan 31, 2019 • edited Loading

richiware commented Feb 1, 2019

richiware commented Feb 13, 2019

alsora commented Feb 13, 2019

richiware commented Feb 14, 2019

richiware commented Feb 20, 2019

clalancette commented Feb 20, 2019

clalancette commented Feb 20, 2019

richiware commented Feb 20, 2019

alsora commented Feb 20, 2019

richiware commented Feb 21, 2019

alsora commented Feb 21, 2019

alsora commented Jan 31, 2019 •

edited

Loading