Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock appears be caused by write data and discovery entity threads using tcp transport on the writer side. #4203

Open
1 task done
chunyisong opened this issue Jan 4, 2024 · 15 comments
Labels
need more info Issue that requires more info from contributor

Comments

@chunyisong
Copy link

chunyisong commented Jan 4, 2024

Is there an already existing issue for this?

  • I have searched the existing issues

Expected behavior

Discovering new entities and writing data should not be stucked!

Current behavior

Deadlock appears be caused by write data and discovery entity threads using tcp transport on the writer side.

I write a simple test program test-dds to reproduce this bug.

To reproduce this bug, open two different consoles:
In the first one for publisher: ./test-dds pub
Then edit DEFAULT_FASTRTPS_PROFILES.xml ,change listen port of tcpv4 to 0 or other different port number.
In the second one for subscriber: ./test-dds sub

Then the deadlock will most likely occur.If no stuckting,restart second console.
From console of publisher,_currentMatchedPubs and _totalPubOkDatas log are not changed.
From console of subscriber,_currentMatchedSubs and _totalSubValidDatas logs are alse not changed.

Additionally,other tests produce same stucked dealock:

  1. Change Reliability of writers or readers
  2. Using discovery server (This is actual deployment.Here using init peers for simplicity.)
  3. Cmake options to compile fastdds,such is FASTDDS_STATISTICS or STRICT_REALTIME.
  4. Less topics and datas lead to a lower probability of deadlock.(such as ./test-dds sub 100)
  5. Open multi consoles of sub lead to hight probability of deadlock.

Fast DDS version/commit

FastDDS v2.13.0/v2.13.1

Platform/Architecture

Ubuntu Focal 20.04 amd64

Transport layer

TCPv4

XML configuration file

DEFAULT_FASTRTPS_PROFILES.xml

Relevant log output

image

Network traffic capture

No response

@chunyisong chunyisong added the triage Issue pending classification label Jan 4, 2024
@Mario-DL
Copy link
Member

Hi @chunyisong

Thanks for the report. Could you please check if with the latest release v2.13.1 the issue persists ? Some improvements were made in tcp transport, check the release notes

@Mario-DL Mario-DL added need more info Issue that requires more info from contributor and removed triage Issue pending classification labels Jan 15, 2024
@chunyisong
Copy link
Author

chunyisong commented Jan 15, 2024

Hi @Mario-DL

I tested test-dds with fastdds v2.13.1.Unfortunately,stucked deadlock reappeared! However,with this version, the deadlock is more difficult to trigger.Only starting more subscribers (200 readers per sub) and one publisher(200 writers) and not killing writers can not reproduce the issue (after about 30 trials of simple test,may be lucky).But following steps more likely trigger deadlock:

  1. Start discovery server in one console (fast-discovery-server -i 0 -t 10.8.8.6 -q 17480)
  2. Edit DEFAULT_FASTRTPS_PROFILES.xml ,change listen port of tcpv4 to 0 ;
  3. Start two subscriber(200 readers per sub) in two new consoles (./test-dds sub);
  4. Start one publisher (200 writers) in new console (./test-dds pub);
  5. Wait 30 seconds and kill publisher process and restart publisher,then more likely deadlock may occur (if no deadlock, kill and restart publisher once more),**and write operation are stucked as follows:
    image
    Actually,the publisher process is killed and restarted once,so only 200 writer topics,but the subscribers still had 400 writers.

Additionally,other issues as follows through tests:

  1. Sometimes "Matching unexisting participant from writer" error occured (line 1062 in /workspace/fastdds/src/fastrtps/src/cpp/rtps/builtin/discovery/database/DiscoveryDataBase.cpp ) after killing publisher.
  2. Sometimes or When Discovery server error occured ,the server will never drop killed parcipant.
  3. After killing publisher, DataReaderListeners almost never be called to notify discovery or match info.

@JesusPoderoso
Copy link
Contributor

Hi @chunyisong, thanks for your report!
We will try to reproduce it in the following weeks and come back to you with some feedback.

@JesusPoderoso JesusPoderoso added in progress Issue or PR which is being reviewed and removed need more info Issue that requires more info from contributor labels Jan 23, 2024
@chunyisong
Copy link
Author

chunyisong commented Jan 31, 2024

Today,I have reviewed some issues related to tcp,some issues (colsed but may still exist, #4099 #4026 #4033 #3621 #3496 ) may be about the same tcp deadlock.

@JesusPoderoso
Copy link
Contributor

JesusPoderoso commented Mar 21, 2024

Hi @chunyisong, thanks for your patience.
We've just released Fast DDS v2.14.0 with some TCP improvements and fixes (see release notes). I think that the TCPSendResources cleanup may have fixed your issue. Could you check if it persists, please?

@JesusPoderoso JesusPoderoso added need more info Issue that requires more info from contributor and removed in progress Issue or PR which is being reviewed labels Mar 21, 2024
@chunyisong
Copy link
Author

chunyisong commented Mar 25, 2024 via email

@chunyisong
Copy link
Author

chunyisong commented May 6, 2024

@JesusPoderoso Sorry for late test.I test fastdds today with master using same method,the problem still exists.
Below image is one process for a sub,one process for a pub ,one process for discover server,when kill pub process and restart pub immediately.After a while,sub did removed matched writers,but pub did not.And strangly pub rmoved connection to discovery server!
image
Below image kill sub side(add statistic for participants)
image

Note:xml using Reliable/No initPeers/TCP

@chunyisong
Copy link
Author

chunyisong commented May 8, 2024

Below image shows two subs on hostA(220.8), a discovery server and tow pubs on host B(0.202).

  1. start 5 process
  2. After a minite,kill and immediately restart pub1/pub2/sub1/sub2 one by one as quick as you can.
  3. Then the left most sub and the right most pub actually already are stucked,even though loging thread still print statistic messages.The sub matched No pub,but the pub matched all subs but No data published!
  4. After a minite,kill the sub and pub in the middle
  5. Then try to kill discovery server ,but the server stucked and can not kill by ctrl-c!
  6. Then try to kill stucked pub and sub ,but erver stucked and can not kill by ctrl-c too!
    1715140996744

@JesusPoderoso May be pdp or epd issues.
Note: profile.xml should be modified to use different host.

@chunyisong
Copy link
Author

chunyisong commented May 20, 2024

@JesusPoderoso Tested with fastdds 2.14.1 today, and deadlock easily appeared.
I tested on localhost with discovery server follows:

  1. Start discover server in a console : bin/fast-discovery-server -i 0 -t 127.0.0.1 -q 17480
  2. Start two pub processes with 200 topics in two consoles : ./test-dds pub
  3. Start two sub processes with same 200 topics in sequence as quickly as possible in two consoles: ./test-dds sub
  4. Unfortunately I met deadlock just during first two tests.Now kill discovery server (ctrl-c) will be stucked.
  5. If luckily all processes transfer data normally during a test,then kill discovery server,and after one minite restart it,and deadlock appeared. After some minites,I killed specific one pub that published data count was unchanged then discovery server process ended automatically.
    Below image shows the deadlock of step 5.The right most pub processes log shows that the _totalPubOkDatas:11440323 is unchange.Now,discovery server can be killed normally if I first killed the stucked pub.
    image

@JesusPoderoso
Copy link
Contributor

Hi @chunyisong, thanks for the reproducer!
We are taking a look at it and will bring some feedback.

@chunyisong
Copy link
Author

@JesusPoderoso According to document of max_blocking_time,writer should return with timeout.
But in fact,writers will always get stucked with the deadlock, and will never return.So, this situation is not in line with the design.

@wangzm-R
Copy link

wangzm-R commented Jun 3, 2024

image

the deadlock stack;

use 2.11.2, if deadlock occurs, it cannot recover, but use 2.14.0, if a deadlock occurs, then recreate the subscriber, deadlock will exit after about 15 minutes ;

@chunyisong
Copy link
Author

image

the deadlock stack;

use 2.11.2, if deadlock occurs, it cannot recover, but use 2.14.0, if a deadlock occurs, then recreate the subscriber, deadlock will exit after about 15 minutes ;

Hi @wangzm-R , did you try v2.14.1?

My test with v2.14.1 still can not recover until the sucked reader/writer is killed (after that the sucked discover server will recover).

@wangzm-R
Copy link

wangzm-R commented Jun 5, 2024

image
the deadlock stack;
use 2.11.2, if deadlock occurs, it cannot recover, but use 2.14.0, if a deadlock occurs, then recreate the subscriber, deadlock will exit after about 15 minutes ;

Hi @wangzm-R , did you try v2.14.1?

My test with v2.14.1 still can not recover until the sucked reader/writer is killed (after that the sucked discover server will recover).

The publisher is on linux, the subscriber is on linux, the subscriber is on the window, when any one subscriber device is powered down (the subscriber kill process is invalid), the publisher thread will block until the subscriber restarts;

when subscriber device is powered down , about after 1min30sec, the publisher thread will block at write;

@chunyisong
Copy link
Author

Today I test fastdds v2.14.2 with one publisher connected to two subscribers,using 2000 topics, and the deadlock almost always appeared.Also test built-in LARGE-DATA mode and custom larg-data.All attempts have shown no signs of improvement.

Through testing, it is suspected that the problem is occurring at TCP EDP stage.
@JesusPoderoso Has there been any progress on this issue?

Tested with fastdds 2.14.1 today, and deadlock easily appeared. I tested on localhost with discovery server follows:

  1. Start discover server in a console : bin/fast-discovery-server -i 0 -t 127.0.0.1 -q 17480
  2. Start two pub processes with 200 topics in two consoles : ./test-dds pub
  3. Start two sub processes with same 200 topics in sequence as quickly as possible in two consoles: ./test-dds sub
  4. Unfortunately I met deadlock just during first two tests.Now kill discovery server (ctrl-c) will be stucked.
  5. If luckily all processes transfer data normally during a test,then kill discovery server,and after one minite restart it,and deadlock appeared. After some minites,I killed specific one pub that published data count was unchanged then discovery server process ended automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need more info Issue that requires more info from contributor
Projects
None yet
Development

No branches or pull requests

4 participants