Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROS2 w/ Fast-RTPS: Maximum number of subscriptions to a topic [4190] #359

Closed
clalancette opened this issue Dec 19, 2018 · 15 comments
Closed

Comments

@clalancette
Copy link
Contributor

This is a copy of ros2/rmw_fastrtps#249, copied over here at @richiware 's request.

Bug report

Required Info:

  • Operating System:
    • Ubuntu 16.04
  • Installation type:
    • from source
  • Version or commit hash:
    -0.5.1
  • DDS implementation:
    • Fast-RTPS
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

#include <memory>
#include <iostream>
#include <thread>
#include <sstream>

#include "rclcpp/rclcpp.hpp"
#include "std_msgs/msg/header.hpp"

using namespace std::chrono_literals;

class SimplePublisherNode : public rclcpp::Node {

public:

    SimplePublisherNode() : Node("publisher_node")
    {
        _publisher = this->create_publisher<std_msgs::msg::Header>("my_topic");

        auto period = 200ms;
        _timer = this->create_wall_timer(period, std::bind(&SimplePublisherNode::publish, this));
    }

    void publish()
    {
        std_msgs::msg::Header::SharedPtr message(new std_msgs::msg::Header());
        _publisher->publish(message);
    }

private:
    rclcpp::Publisher<std_msgs::msg::Header>::SharedPtr _publisher;
    rclcpp::TimerBase::SharedPtr _timer;
};

class SimpleSubscriberNode : public rclcpp::Node {

public:

    SimpleSubscriberNode(std::string name) : Node(name)
    {
        _subscriber = this->create_subscription<std_msgs::msg::Header>("my_topic",
            std::bind(&SimpleSubscriberNode::simple_callback, this, std::placeholders::_1));

        msg_count = 0;
    }

    int get_count() {return msg_count;}

private:

    void simple_callback(const std_msgs::msg::Header::SharedPtr msg)
    {
        msg_count ++;
    }

    rclcpp::Subscription<std_msgs::msg::Header>::SharedPtr _subscriber;
    int msg_count;
};


int main(int argc, char ** argv)
{
    rclcpp::init(argc, argv);

    int n_subscribers = 20;

    rclcpp::executors::SingleThreadedExecutor::SharedPtr executor =
        std::make_shared<rclcpp::executors::SingleThreadedExecutor>();

    std::vector<std::shared_ptr<SimpleSubscriberNode>> subscriber_nodes;

    for (int i = 0; i < n_subscribers; i++){
        std::stringstream ss;
        ss << "node_";
        ss << i;
        std::shared_ptr<SimpleSubscriberNode> sub_node = std::make_shared<SimpleSubscriberNode>(ss.str());

        subscriber_nodes.push_back(sub_node);
        executor->add_node(sub_node);
    }

    std::shared_ptr<SimplePublisherNode> pub_node = std::make_shared<SimplePublisherNode>();
    executor->add_node(pub_node);

    std::cout<<"Starting stress test!"<<std::endl;
    std::thread thread([executor](){
        executor->spin();
    });

    thread.detach();

    std::this_thread::sleep_for(std::chrono::seconds(20));

    rclcpp::shutdown();

    for (int i = 0; i < n_subscribers; i++){
        std::shared_ptr<SimpleSubscriberNode> sub_node = subscriber_nodes[i];
        std::cout<<"Node: "<< i<< " received " << sub_node->get_count() << " messages!"<<std::endl;
    }

    return 0;

}

Expected behavior

I create 1 publisher node and 20 subscriber nodes.
I add all of them to the same executor and make them spin.
Let them spin for 20 seconds.

I would expect that all the nodes receive messages

Actual behavior

Almost every time I run this simple system there is a variable number of nodes which do not receive any message.

I observed this behavior when the number of subscriptions to a topic is greater than 10.

The problem is not related to the threading implementation (e.g. I tried to run each node in a separate std::thread or in a separate process and nothing changes).

The amount of nodes that will not receive anything is not predictable. Few times the system was working perfectly.

Here you find a plot of what's going on for a varying number of subscirber nodes and always 1 publisher node.
Note that most of the times, besides the node that receive 0 messages, all the others receive 95% - 100% of messages.

pub_sub_performance

@clalancette
Copy link
Contributor Author

The reason I think this is a problem with Fast-RTPS (and not elsewhere in the ROS 2 stack) is that I tested this with both RMW_IMPLEMENTATION=rmw_fastrtps_cpp and RMW_IMPLEMENATION=rmw_opensplice_cpp. With Fast-RTPS, I see the problem with dropped messages, while with opensplice, I do not.

@richiware richiware changed the title ROS2 w/ Fast-RTPS: Maximum number of subscriptions to a topic ROS2 w/ Fast-RTPS: Maximum number of subscriptions to a topic [4190] Dec 21, 2018
@alsora
Copy link
Contributor

alsora commented Jan 24, 2019

@richiware do you have any update on this?

additional experiments suggest me that it's a problem in the nodes discovery process.
There are 3 possible situations

  • nodes receiving 100% messages
  • nodes receiving 0% messages
  • nodes receiving X% messages

It's interesting to note that in the last case the lost messages are always the first published ones, once a node receive a message then it will receive also all the others.

@richiware
Copy link
Member

@clalancette Thanks for your code example. I was testing with it and I was able to get

Starting stress test!
Node: 0 received 99 messages!
Node: 1 received 99 messages!
Node: 2 received 99 messages!
Node: 3 received 99 messages!
Node: 4 received 99 messages!
Node: 5 received 99 messages!
Node: 6 received 99 messages!
Node: 7 received 99 messages!
Node: 8 received 99 messages!
Node: 9 received 99 messages!
Node: 10 received 99 messages!
Node: 11 received 99 messages!
Node: 12 received 99 messages!
Node: 13 received 99 messages!
Node: 14 received 99 messages!
Node: 15 received 99 messages!
Node: 16 received 99 messages!
Node: 17 received 99 messages!
Node: 18 received 99 messages!
Node: 19 received 99 messages!

I think this is the expected behavior of your example. I've had to change a little your code to achieve this. The problem is related to the default QoS which ROS2 uses to create publishers and subscribers. By default DurabilityQoS is VOLATILE. In this case a publisher won't send old data to a new discovered subscriber.

Your example is creating several subscribers and a publisher, and immediately it starts to send data. The discovery process needs some time, and when a subscriber is discovered the first, several samples already was sent and the publisher won't resend them to the discovered subscriber.

I have change the behavior setting the DurabilityQoS to TRANSIENT_LOCAL. With this the publisher will send old samples to a new discovered subscriber.

    SimplePublisherNode() : Node("publisher_node")
    {
        rmw_qos_profile_t qos;
        qos.history = RMW_QOS_POLICY_HISTORY_KEEP_ALL;
        qos.reliability = RMW_QOS_POLICY_RELIABILITY_RELIABLE;
        qos.durability = RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL;
        _publisher = this->create_publisher<std_msgs::msg::Header>("my_topic", qos);

        auto period = 200ms;
        _timer = this->create_wall_timer(period, std::bind(&SimplePublisherNode::publish, this));
    }

    SimpleSubscriberNode(std::string name) : Node(name)
    {
        rmw_qos_profile_t qos;
        qos.history = RMW_QOS_POLICY_HISTORY_KEEP_ALL;
        qos.reliability = RMW_QOS_POLICY_RELIABILITY_RELIABLE;
        qos.depth = 100;
        qos.durability = RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL;
        _subscriber = this->create_subscription<std_msgs::msg::Header>("my_topic",
            std::bind(&SimpleSubscriberNode::simple_callback, this, std::placeholders::_1), qos);

        msg_count = 0;
    }

@alsora
Copy link
Contributor

alsora commented Jan 31, 2019

Thank you a lot @richiware for the update.

Your point makes perfectly sense, however even with the updated QoS I get sometimes that 1node is not discovered in time.
I added a short wait period between the nodes creation and when the executor->spin() method is called and this allows to receive all the messages.

However this may be not the optimal solution for my use case as I need messages as soon as possible as they are published or they are useless.

I find more useful to simply wait more than 30 seconds before starting to publish and still using the default QoS.
Actually it's strange that I already tried to use a timer of approx 20-30 seconds when I opened this issue with Bouncy, but it was not working.

If a discovery time of approx 30 seconds is expected, I think we can close.

@richiware
Copy link
Member

@alsora I didn't appreciate the long time in the discovering phase during my testing. But you're right, it's not suitable. I think the QoS of Endpoint Discovery Phase entities can be improved. I will continue working on this problem.

@richiware
Copy link
Member

I'm working in #411 to improve discovery times.

@alsora
Copy link
Contributor

alsora commented Feb 13, 2019

Actually no waiting time is required if the device running the program is not connected to any network.

I guess that it makes sense since nodes have to search each other only within the same machine, however it does not clearly explain why they take so much when there is a network connection.

@richiware
Copy link
Member

Could you test with release/1.7.1 branch? I want to know if my changes fixed your problem.

@richiware
Copy link
Member

@clalancette In #411 you mention the PR improves your discovery times. Does this PR, already included in new release v1.7.1, fix this issue? Thanks.

@clalancette
Copy link
Contributor Author

This issue is vastly improved with Fast-RTPS 1.7.1. I ran this test 10 times, and I only saw one case where a single message was dropped. So I think that we can call this problem fixed, and you can close this issue when you are satisfied. Note that I'm going to leave ros2/rmw_fastrtps#249 open to track getting the fix into ROS 2.

@clalancette
Copy link
Contributor Author

Never mind that last bit, I realized that we track master in ROS 2, so we should already have the fix.

@richiware
Copy link
Member

Thanks for your help.

As a note. This performance decrease on Discovery comes from when we modified default values of reliable communication to improve the performance sending video (ros2/rmw_fastrtps#202). But it must have decreased the performance in discovery. Now discovery uses its own values.

@alsora
Copy link
Contributor

alsora commented Feb 20, 2019

I finally had time to test the new release version.

I only did some quick tests, i.e. 1 publisher node with 5, 10 or 15 subscriber nodes.
I compared Bouncy, Crystal Patch 1 and the current Crystal MASTER + FastRTPS new release.

The results are interesting, but not entirely positive.
To me it looks like that in simple scenarios the performance got worst, while they definitely improved in the more "bad" tests.

Here some results, outliers are excluded from average.

Case (1) 1 publisher 5 subscribers
Bouncy: < 10ms
Crystal: < 10ms
Master: 150ms

Case (2) 1 publisher 10 subscribers
Bouncy: 10ms (10% of the tests labeled as outliers with a discovery time of 2 seconds)
Crystal: 40ms (30% of the tests labeled as outliers with a discovery time of 40 seconds)
Master: 200ms

Case (3) 1 publisher 15 subscribers
Bouncy: N.A. around 80% of the tests timed out after 100 seconds with still 1 or 2 nodes to discover.
Crystal: 100ms (50% of the tests labeled as outliers with a discovery time of 40 seconds)
Master: 700ms (5% of the tests labeled as outliers with a discovery time of 40 seconds)

@richiware
Copy link
Member

Changes I did for this issue was related to reliable communications and was focused on:

  • Better usage of network when sending meta-messages to several participants (like HEARBEAT messages). In some places we were not trying to send to multicast if there were several destinations.
  • Increase some waiting times to later make a better usage of the network. Like the property nackResponseDelay. I increased it to ~100ms. This is the mainly reason you see that results.

With these changes @clalancette scenario improves from 30 seconds to less than 2 seconds. I think this is a bigger improvement that other scenarios worsen milliseconds. For this reason it should be better to leave this values as default ones for discovery.

But it is right each scenario is different and needs its own DDS configuration. We will leave available for the user the configuration of discovery built-in endpoints, so the user can adjust it to its scenario.

About the 40 seconds you see some times. I know the reason. Currently a participant answers another participant DATA(p) sending its own DATA(p) to multicast. As the number of participant increases, more DATA(p) at the same time and I saw the system discard some of them. To solve this when a DATA(p) is an answer, it should be sent by unicast. But this change wasn't so easy than the others I did. I keep in mind to make this change as soon as possible.

About the 100 seconds, I don't know what is happening. Can you provide an example that reproduce this case?

And thanks a lot for your help.

@alsora
Copy link
Contributor

alsora commented Feb 21, 2019

Yes, now the general behavior is way better.
That was just a comparison of the results, knowing the reason why the minimum discovery time has increased is definitely good.

Also very good to know that you are already aware of what is required to fix the 40 seconds issue.

The experiments labeled as Bouncy were using an old version of FastRTPS, the one shipping with that ROS2 distribution, so now the issue is solved.

Thank you a lot for your explanations!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants