When these processes start at the same time, many dropped parckets were generated by the 127.0.0.1 network #4668

TechVortexZ · 2024-04-08T01:54:51Z

Is there an already existing issue for this?

I have searched the existing issues

Expected behavior

There are 20 processes and a total of 130 topics running on the same machine
QOS：both UDP and SHM are enabled；udp_transport->interfaceWhiteList.push_back(127.0.0.1);
This means that discovery traffic uses a 127.0.0.1 for udp communication and user data uses shm communication.
When these processes start at the same time，we expect no packet loss on the 127.0.0.1 that can be seen by the ifconfig lo

Current behavior

When these processes start at the same time，There are many packet loss on the 127.0.0.1 that can be seen by the ifconfig lo

We have tried many ways, but nothing has worked:

Increase the buffer sizes of network adapters
sudo sysctl -w net.core.wmem_max=209715200 //200M
sudo sysctl -w net.core.rmem_max=209715200 //200M
Increase the socket buffer size in the QOS
"send_socket_buffer_size": 209715200, //200M
"listen_socket_buffer_size": 209715200
Increase the txqueuelen length
ip link set txqueuelen 10000 dev lo

Can you help me solve this problem?

Steps to reproduce

above

Fast DDS version/commit

v2.12.0

Platform/Architecture

Ubuntu Focal 20.04 arm64

Transport layer

Default configuration, UDPv4 & SHM

Additional context

No response

XML configuration file

No response

Relevant log output

No response

Network traffic capture

No response

The text was updated successfully, but these errors were encountered:

elianalf · 2024-04-08T08:59:33Z

Hi @TechVortexZ, thanks for using Fast DDS.
You might consider that 20 processes and 130 topics are enough to make the network very busy so the loss can be related to this. If the loss is mostly in the discovery phase, you can try changing the initial announcement period: decreasing it will allow participants to be discovered more quickly, while increasing it will reduce the frequency of sending metatraffic packages, leading to a less busy network. Please let us know if you can get better performance with one of these solutions.
Also, please note that version 2.12.x is end of life, so you may want to consider upgrading to our latest version 2.14.x.

TechVortexZ · 2024-04-09T06:29:41Z

Hi @TechVortexZ, thanks for using Fast DDS. You might consider that 20 processes and 130 topics are enough to make the network very busy so the loss can be related to this. If the loss is mostly in the discovery phase, you can try changing the initial announcement period: increasing it will allow participants to be discovered more quickly, while decreasing it will reduce the frequency of sending metatraffic packages, leading to a less busy network. Please let us know if you can get better performance with one of these solutions. Also, please note that version 2.12.x is end of life, so you may want to consider upgrading to our latest version 2.14.x.

Hi @elianalf, We decrease initial announcement period
"initial_announce_count": 5,
"initial_announce_period": 100ms,
But there are still lost packets.

When we modify this configuration "avoid_builtin_multicast": false,, there are no lost packets. Can you tell me the function of this parameter, why to solve this problem.

TechVortexZ · 2024-04-09T09:05:16Z

However,I noticed that the pdp message interval is not 100ms when start, I set initial_announce_period": 100ms,This is why?

elianalf · 2024-04-09T10:03:07Z

Hi,

When we modify this configuration "avoid_builtin_multicast": false,, there are no lost packets. Can you tell me the function of this parameter, why to solve this problem.

The avoid_builtin_multicast=false setting enables the use of multicast also during Endpoints Discovery Phase (EDP). It reduces the number of sent packages during EDP because each multicast data is sent at the same time to all participants, thereby reducing the traffic.
You could also try re-enabling it by avoiding_builtin_multicast=true and setting the TTL parameter in UDPv4TransportDescriptor to 0. This way you will be sure that your traffic is local. In order to do that, you will also need to set use_builtin_transports=false and add a SharedMemTransportDescriptor and a UDPv4TransportDescriptor to user transport.

DomainParticipantQos participant_qos;
participant_qos.transport().use_builtin_transports = false;
auto shm_transport = std::make_shared<SharedMemTransportDescriptor>();
participant_qos.transport().user_transports.push_back(shm_transport);
auto udp_transport = std::make_shared<UDPv4TransportDescriptor>();
udp_transport->TTL = 0;
participant_qos.transport().user_transports.push_back(udp_transport);

However,I noticed that the pdp message interval is not 100ms when start, I set initial_announce_period": 100ms,This is why?

I would need more information about the screenshot. From the information I have, I can tell you that initial_announce_period set the specific period for each participant, maybe the timestamps that you are looking at are from different participants, so the difference is not 100ms.

TechVortexZ · 2024-04-10T06:15:32Z

Hi @elianalf [thanks for your reply.
I set avoiding_builtin_multicast=true and set udp_transport->TTL = 0;,also enable udp and shm.
As you provided the reference code, there are still lost packets.

I would need more information about the screenshot. From the information I have, I can tell you that initial_announce_period set the specific period for each participant, maybe the timestamps that you are looking at are from different participants, so the difference is not 100ms.

Here are more screenshots to illustrate the pdp message sent by the same particpant.

elianalf · 2024-04-11T10:08:57Z

Hi,

I set avoiding_builtin_multicast=true and set udp_transport->TTL = 0;,also enable udp and shm.
As you provided the reference code, there are still lost packets.

If your application requires to work only in local host and you obtain better performance setting avoid_builtin_multicast=false, then that is a possible solution. That variable is set to true by default because disabling multicast during EDP on big network can be more secure.

Here are more screenshots to illustrate the pdp message sent by the same particpant.

All these packages are not only initialAnnouncements packages. Each participant sends an initialAnnouncements package every initial_announce_period, but every time it discovers a participant it begins sending Data(p) packages to each multicast locator and to all known participants unicast locators. So between two initialAnnouncements packages, there might be many other Data(p). That is why the frequency of the packages you highlight is higher.

TechVortexZ · 2024-04-15T08:02:19Z

but every time it discovers a participant it begins sending Data(p) packages to each multicast locator and to all known participants unicast locators. So between two initialAnnouncements packages, there might be many other Data(p). That is why the frequency of the packages you highlight is higher.

Hi @elianalf thanks for your reply. Your answer above is right.

I want to ask the last question.
I found an article on the fastdds website: https://www.eprosima.com/index.php/resources-all/scalability/fast-rtps-discovery-mechanisms-analysis. One of the conclusions in this article is that SDP causes network congestion:

Because of all the previous, it is concluded that the SDP produces network congestion in those cases where a high number of participants are involved in the communication. This leads to a higher packet loss and therefore to a reduction of the overall performance. The protocol implementation is open to optimizations, such as eliminating the duplicate announcements when new participants are discovered (which could lead to a PDP traffic reduction of around 28%), or limiting the announcement reply to a discovered participant to just that new participant (which could cut another 25% of the traffic in the testing scenarios).

It says that fastdds will provide optimization measures to reduce duplicate announcements, What are these optimization measures?

elianalf · 2024-04-15T14:40:55Z

Hi,
The article refers to Discover Server Mechanism. For any other information, I would recommend you to refer to the Documentation and not to the website because it is more detailed and constantly updated.

TechVortexZ added the triage Issue pending classification label Apr 8, 2024

elianalf added in progress Issue or PR which is being reviewed and removed triage Issue pending classification labels Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When these processes start at the same time, many dropped parckets were generated by the 127.0.0.1 network #4668

When these processes start at the same time, many dropped parckets were generated by the 127.0.0.1 network #4668

TechVortexZ commented Apr 8, 2024

elianalf commented Apr 8, 2024 •

edited

Loading

TechVortexZ commented Apr 9, 2024

TechVortexZ commented Apr 9, 2024 •

edited

Loading

elianalf commented Apr 9, 2024

TechVortexZ commented Apr 10, 2024

elianalf commented Apr 11, 2024

TechVortexZ commented Apr 15, 2024

elianalf commented Apr 15, 2024

When these processes start at the same time, many dropped parckets were generated by the 127.0.0.1 network #4668

When these processes start at the same time, many dropped parckets were generated by the 127.0.0.1 network #4668

Comments

TechVortexZ commented Apr 8, 2024

Is there an already existing issue for this?

Expected behavior

Current behavior

Steps to reproduce

Fast DDS version/commit

Platform/Architecture

Transport layer

Additional context

XML configuration file

Relevant log output

Network traffic capture

elianalf commented Apr 8, 2024 • edited Loading

TechVortexZ commented Apr 9, 2024

TechVortexZ commented Apr 9, 2024 • edited Loading

elianalf commented Apr 9, 2024

TechVortexZ commented Apr 10, 2024

elianalf commented Apr 11, 2024

TechVortexZ commented Apr 15, 2024

elianalf commented Apr 15, 2024

elianalf commented Apr 8, 2024 •

edited

Loading

TechVortexZ commented Apr 9, 2024 •

edited

Loading