Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

router stops after some time #375

Open
guni9191 opened this issue May 22, 2023 · 7 comments
Open

router stops after some time #375

guni9191 opened this issue May 22, 2023 · 7 comments
Assignees
Labels
need more info More information required

Comments

@guni9191
Copy link

hi, while I was using DDS-router some time about 10 minutes, the dds-router is not able to gracefully end its process and just stopped. both azure and pc's dds-router application is stuck and ^c is not working properly.

my test environment is azure ubuntu 20.04 that has public ip and ubuntu pc, tcp connected. I'm sending some ros2 topics, and am not sending anything that is heavy such as videos.

Can anybody guess why it suddenly stops?

@jparisu jparisu self-assigned this May 22, 2023
@jparisu
Copy link
Contributor

jparisu commented May 22, 2023

Hi @guni9191 ,
Glad to know you are working with DDS Router.
About your problem, with further information we are not able to give you an answer or a solution.
In order to help you, we would need more information regarding your scenario:

Error case

The error is occurring only when trying to close the DDS Router application, or the application just stops and then you are not able to stop it?
There is an echo participant that would be helpful to debug if the router is frozen, or if it is only incapable of stopping: https://eprosima-dds-router.readthedocs.io/en/latest/rst/user_manual/participants/echo.html

DDS network

Please, let us know the data types and rates that you are using, and also the QoS of your topics.
Some restrictive QoS with huge data loads may slow down the application drastically.

Network architecture

Are you working in local, WAN, in the same host? What is you bandwidth?

All the information that you are able to give us will help us to solve your problem.

@jparisu jparisu added the need more info More information required label May 22, 2023
@guni9191
Copy link
Author

guni9191 commented May 23, 2023

The error is occurring only when trying to close the DDS Router application, or the application just stops and then you are not able to stop it?
There is an echo participant that would be helpful to debug if the router is frozen, or if it is only incapable of stopping: https://eprosima-dds-router.readthedocs.io/en/latest/rst/user_manual/participants/echo.html

=> The application just stops and then I am not able to stop it. I have tried echo participant you have introduced me and it stops showing information too. when ^c is pressed, "Stopping DDS Router" only shows up.

data types and rates that you are using

=> i am using custom ROS2 msg types, 25 topics (nine 2hz, five 5hz, nine 1hz, two 0.1hz). i'm not sure about the data length, but my wireshark detects that the packets are 7052 frames/sec and a single frame contains 1304 bytes. Most of the qos setting is ROS2 QOS default setting, except one topic uses liveliness qos. This is very unusually large amout of data since my local rtps frame only have 300bytes on average, and not much frames(only about 200 compared to 7052frames). Also from wireshark i see a single frame that contains multiple duplicated messages (tcp payload). is this normal?

Are you working in local, WAN, in the same host? What is you bandwidth?

=> not sure about how to check the bandwidth but i'm guessing it's at least 100mbps. there seems to be some kind of firewall for my wifi but not sure about my environment. As i've said earlier i'm using azure cloud server so it's WAN. I was testing the round trip time by using system stamp, and when it stops, the rtt reaches to almost 20seconds.

@guni9191
Copy link
Author

guni9191 commented May 23, 2023

also my config for the tcp client is as below
version: v3.0 # 0

allowlist:

  • name: rt/*
    type: A_msgs*
  • name: rt/*
    type: B_msgs*
  • name: rt/*
    type: C_msgs*
    ...

participants:

  • name: SimpleParticipant # 3
    kind: local # 4
    domain: 0 # 5

  • name: WanParticipant # 6
    kind: wan # 7
    connection-addresses: # 8

    • ip: azure_cloud_server_public_ip
      port: my_port
      transport: tcp

@jparisu
Copy link
Contributor

jparisu commented May 23, 2023

@guni9191, thank you for the detailed information.
So far, we do not know what can be producing this issue. We will try to extend our battery test.

If you could help us further, it will be important to know if the freeze is produced due to CPU usage and/or memory usage. An htop analysis will be interesting.
Also, if the application stops due to a deadlock, would be interesting to get the back-trace (using gdb for instance) to know if it is a transport issue, or it is something related with the DDS Router application.

Finally, I guess the large size of your frames is related with TCP. Would you be able to run it with UDP?

@guni9191
Copy link
Author

@jparisu

  1. htop analysis didn't show the cpu and ram usage difference.
  2. The problem might be related to network bandwidth and latency.
    -> It seems that if i use faster 1100Mbps wifi instead of 433Mbps wifi the stopping behavior did not occur. Also, since there were not much frames generated when using 1100Mbps, i'm guessing that tcp packets with low bandwidth is more likely to disassemble and reassemble, generating much more unnecessary frames and heavy traffic.
  3. I think it is not udp/tcp matter, although using udp made things twice as faster. i am not able to test the application since i cannot modify router port-forwarding in my test environment. testing simple pub/sub in my house showed twice the faster rtt time though.

Can you guys test fastdds router in heavy traffic, low bandwidth environment? as far as i know, dds should work robustly in such a difficult situation, and most of all, the application should not stop. thx in advance for your response

@guni9191
Copy link
Author

guni9191 commented Jun 8, 2023

@jparisu
I think i found the reason why. As I have expected, it was the bandwidth problem.

Let me explain how I've found out.

  • let's say there are PC A and B and B has public address.
  • Both of them are running fastdds router
  • "PC A" runs a ros2 node "node A" that publishes "topic A" in total bandwidth of 4mbps
  • "PC B" runs a ros2 node "node B" that subscribes "topic A" and then publishes same size "topic B"
  • "PC A" runs an another ros2 node "node C" that subscribes "topic B"

To limit the bandwidth intentionally, I have used "wondershaper" tool and limited "PC A" bandwidth with downspeed 6mbps and upload speed 2mbps.

then, "node C" in "PC A" got some of the message from "topic B", and eventually it stopped receiving any messages. When I tried to stop fastdds router of "PC A" in this state, I got the message "Stopping DDS Router" but it did not stop gracefully. If i close "node A" the router stopped correctly, but closing "node C" didn't stop router from gracefully stopping.

Can you guess why the "node C" gradually stopped from subscribing topics and router ^C message also got stuck? If my environment have such a limited bandwidth, then is there another way to avoid this behavior?

@jparisu
Copy link
Contributor

jparisu commented Jun 8, 2023

Hi @guni9191

I think we know what could happen in your scenario. We see two problems here:

Bandwidth

In an scenario with a limited bandwidth, it could happen that the DDS Router receives messages faster than it can route them. This will slow the whole application, arriving to a point where some messages have to be discarded for memory issues. Check the following documentation: https://eprosima-dds-router.readthedocs.io/en/latest/rst/user_manual/configuration.html#maximum-history-depth
In this case, there is few that you neither us could do to improve this. Try to limit the amount of topics that are forwarded to reduce the traffic: https://eprosima-dds-router.readthedocs.io/en/latest/rst/user_manual/configuration.html#id1 .

DDS Router closure

We think we found a bug in the DDS Router thread management that makes application to not close until all messages have been forwarded. Thus, if messages arrive faster than they are delivered, this behavior could happen. (We are not sure about this but it could be the case).

New DDS Router update

It is not related with this issue, but we have importantly update the DDS Router so the core logic is moved to a different repository (https://github.com/eProsima/DDS-Pipe).
This issue should be fixed in this new version. The release of it is still not ready, but the Router can be used equally as before by adding the new dependency and compiling again.
If you want to try it out, it would help us a lot.

Comment

Are you using different domains or Discovery-Server in order to force different nodes to communicate through the router?
I suppose you are, as if you weren't you would be experiencing a loop in the routers that would replicate to infinity all your messages. Just in case, check this: https://eprosima-dds-router.readthedocs.io/en/latest/rst/user_manual/configuration.html#participant-configuration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need more info More information required
Projects
None yet
Development

No branches or pull requests

2 participants