Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When these processes start at the same time, many dropped parckets were generated by the 127.0.0.1 network #4668

Open
1 task done
TechVortexZ opened this issue Apr 8, 2024 · 8 comments
Labels
in progress Issue or PR which is being reviewed

Comments

@TechVortexZ
Copy link

Is there an already existing issue for this?

  • I have searched the existing issues

Expected behavior

  1. There are 20 processes and a total of 130 topics running on the same machine
  2. QOS:both UDP and SHM are enabled;udp_transport->interfaceWhiteList.push_back(127.0.0.1);
    This means that discovery traffic uses a 127.0.0.1 for udp communication and user data uses shm communication.
  3. When these processes start at the same time,we expect no packet loss on the 127.0.0.1 that can be seen by the ifconfig lo

Current behavior

When these processes start at the same time,There are many packet loss on the 127.0.0.1 that can be seen by the ifconfig lo
image
image

We have tried many ways, but nothing has worked:

  1. Increase the buffer sizes of network adapters
    sudo sysctl -w net.core.wmem_max=209715200 //200M
    sudo sysctl -w net.core.rmem_max=209715200 //200M

  2. Increase the socket buffer size in the QOS
    "send_socket_buffer_size": 209715200, //200M
    "listen_socket_buffer_size": 209715200

  3. Increase the txqueuelen length
    ip link set txqueuelen 10000 dev lo

Can you help me solve this problem?

Steps to reproduce

above

Fast DDS version/commit

v2.12.0

Platform/Architecture

Ubuntu Focal 20.04 arm64

Transport layer

Default configuration, UDPv4 & SHM

Additional context

No response

XML configuration file

No response

Relevant log output

No response

Network traffic capture

No response

@TechVortexZ TechVortexZ added the triage Issue pending classification label Apr 8, 2024
@elianalf
Copy link
Contributor

elianalf commented Apr 8, 2024

Hi @TechVortexZ, thanks for using Fast DDS.
You might consider that 20 processes and 130 topics are enough to make the network very busy so the loss can be related to this. If the loss is mostly in the discovery phase, you can try changing the initial announcement period: decreasing it will allow participants to be discovered more quickly, while increasing it will reduce the frequency of sending metatraffic packages, leading to a less busy network. Please let us know if you can get better performance with one of these solutions.
Also, please note that version 2.12.x is end of life, so you may want to consider upgrading to our latest version 2.14.x.

@elianalf elianalf added in progress Issue or PR which is being reviewed and removed triage Issue pending classification labels Apr 8, 2024
@TechVortexZ
Copy link
Author

Hi @TechVortexZ, thanks for using Fast DDS. You might consider that 20 processes and 130 topics are enough to make the network very busy so the loss can be related to this. If the loss is mostly in the discovery phase, you can try changing the initial announcement period: increasing it will allow participants to be discovered more quickly, while decreasing it will reduce the frequency of sending metatraffic packages, leading to a less busy network. Please let us know if you can get better performance with one of these solutions. Also, please note that version 2.12.x is end of life, so you may want to consider upgrading to our latest version 2.14.x.

Hi @elianalf, We decrease initial announcement period
"initial_announce_count": 5,
"initial_announce_period": 100ms,
But there are still lost packets.

When we modify this configuration "avoid_builtin_multicast": false,, there are no lost packets. Can you tell me the function of this parameter, why to solve this problem.

@TechVortexZ
Copy link
Author

TechVortexZ commented Apr 9, 2024

However,I noticed that the pdp message interval is not 100ms when start, I set initial_announce_period": 100ms,This is why?
HPo8yc3ZdO

@elianalf
Copy link
Contributor

elianalf commented Apr 9, 2024

Hi,

When we modify this configuration "avoid_builtin_multicast": false,, there are no lost packets. Can you tell me the function of this parameter, why to solve this problem.

The avoid_builtin_multicast=false setting enables the use of multicast also during Endpoints Discovery Phase (EDP). It reduces the number of sent packages during EDP because each multicast data is sent at the same time to all participants, thereby reducing the traffic.
You could also try re-enabling it by avoiding_builtin_multicast=true and setting the TTL parameter in UDPv4TransportDescriptor to 0. This way you will be sure that your traffic is local. In order to do that, you will also need to set use_builtin_transports=false and add a SharedMemTransportDescriptor and a UDPv4TransportDescriptor to user transport.

DomainParticipantQos participant_qos;
participant_qos.transport().use_builtin_transports = false;
auto shm_transport = std::make_shared<SharedMemTransportDescriptor>();
participant_qos.transport().user_transports.push_back(shm_transport);
auto udp_transport = std::make_shared<UDPv4TransportDescriptor>();
udp_transport->TTL = 0;
participant_qos.transport().user_transports.push_back(udp_transport);

However,I noticed that the pdp message interval is not 100ms when start, I set initial_announce_period": 100ms,This is why?

I would need more information about the screenshot. From the information I have, I can tell you that initial_announce_period set the specific period for each participant, maybe the timestamps that you are looking at are from different participants, so the difference is not 100ms.

@TechVortexZ
Copy link
Author

Hi @elianalf [thanks for your reply.
I set avoiding_builtin_multicast=true and set udp_transport->TTL = 0;,also enable udp and shm.
As you provided the reference code, there are still lost packets.

I would need more information about the screenshot. From the information I have, I can tell you that initial_announce_period set the specific period for each participant, maybe the timestamps that you are looking at are from different participants, so the difference is not 100ms.

Here are more screenshots to illustrate the pdp message sent by the same particpant.

ecEq0yZeRx
HQ0OXUQIuZ
IQzGWHmEmR

@elianalf
Copy link
Contributor

Hi,

I set avoiding_builtin_multicast=true and set udp_transport->TTL = 0;,also enable udp and shm.
As you provided the reference code, there are still lost packets.

If your application requires to work only in local host and you obtain better performance setting avoid_builtin_multicast=false, then that is a possible solution. That variable is set to true by default because disabling multicast during EDP on big network can be more secure.

Here are more screenshots to illustrate the pdp message sent by the same particpant.

All these packages are not only initialAnnouncements packages. Each participant sends an initialAnnouncements package every initial_announce_period, but every time it discovers a participant it begins sending Data(p) packages to each multicast locator and to all known participants unicast locators. So between two initialAnnouncements packages, there might be many other Data(p). That is why the frequency of the packages you highlight is higher.

@TechVortexZ
Copy link
Author

but every time it discovers a participant it begins sending Data(p) packages to each multicast locator and to all known participants unicast locators. So between two initialAnnouncements packages, there might be many other Data(p). That is why the frequency of the packages you highlight is higher.

Hi @elianalf thanks for your reply. Your answer above is right.

I want to ask the last question.
I found an article on the fastdds website: https://www.eprosima.com/index.php/resources-all/scalability/fast-rtps-discovery-mechanisms-analysis. One of the conclusions in this article is that SDP causes network congestion:

Because of all the previous, it is concluded that the SDP produces network congestion in those cases where a high number of participants are involved in the communication. This leads to a higher packet loss and therefore to a reduction of the overall performance. The protocol implementation is open to optimizations, such as eliminating the duplicate announcements when new participants are discovered (which could lead to a PDP traffic reduction of around 28%), or limiting the announcement reply to a discovered participant to just that new participant (which could cut another 25% of the traffic in the testing scenarios).

It says that fastdds will provide optimization measures to reduce duplicate announcements, What are these optimization measures?

@elianalf
Copy link
Contributor

Hi,
The article refers to Discover Server Mechanism. For any other information, I would recommend you to refer to the Documentation and not to the website because it is more detailed and constantly updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in progress Issue or PR which is being reviewed
Projects
None yet
Development

No branches or pull requests

2 participants