Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection problem between runtime and gateway #255

Open
lurenpluto opened this issue May 7, 2023 · 3 comments
Open

Connection problem between runtime and gateway #255

lurenpluto opened this issue May 7, 2023 · 3 comments
Assignees
Labels
BDT bucky data transfer protocol bug Something isn't working SN SN Server

Comments

@lurenpluto
Copy link
Member

After switching active OOD, there is a situation where runtime and OOD cannot be connected, and pings connected through bdt fail

OOD: 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p
runtime: 5bnZHzZXgN4JsbKaFsqZ75HczHHh5dsrcAh9Ake4wKwA

Runtime side

Last connection on runtime side

Details

4904: [2023-05-06 20:47:15.658443 +08:00] INFO [ThreadId(4)] [component\cyfs-lib\src\requestor\bdt.rs:181] http-bdt request to 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p:84 success! during=208ms, seq=TempSeq(4211563353)

The next connections to the gateway are all timeout errors.

Details

4933: [2023-05-06 20:47:26.396892 +08:00] WARN [ThreadId(8)] [component\cyfs-lib\src\requestor\bdt.rs:110] connect to 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p:84 failed! with_desc=true, during=5009ms, err: (Timeout, future has timed out, None)

And after restarting the runtime side, it also can't connect to the gateway, so it looks like there is a problem on the gateway side

OOD side

Last request log received from runtime

Details

[2023-05-06 20:47:18.905150 +08:00] INFO [ThreadId(7)] [component\cyfs-stack\src\interface\http_bdt_listener.rs:167] recv bdt http request. source=5bnZHzZXgN4JsbKaFsqZ75HczHHh5dsrcAh9Ake4wKwA, seq=TempSeq(4211563355), method=POST, url=http:// 5bnzhzxqdrwun6nkzmgksuiradspunuwlbfyg91qc55p:84/non/, len=Some(26)

It looks like the OOD is running and for some reason other devices can't connect to itself

logs.zip

@lurenpluto lurenpluto added bug Something isn't working BDT bucky data transfer protocol labels May 7, 2023
@jing-git
Copy link
Collaborator

jing-git commented May 8, 2023

For 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p, the final ping for keepalive is:
[2023-05-06 20:46:43.015978 +08:00] INFO [ThreadId(19)] [component\cyfs-bdt\src\sn\client\ping\clients.rs:461] PingClients{local:5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p} ping-resp, sn: 5bnZVFY5EYo6LXxrUKahLTEYqSExZZ7tkFvEDwfyojMt/L4udp120.24.55.87:8070, seq: 4211558552.
[2023-05-06 20:47:08.043060 +08:00] INFO [ThreadId(19)] [component\cyfs-bdt\src\sn\client\ping\clients.rs:461] PingClients{local:5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p} ping-resp, sn: 5bnZVFY5EYo6LXxrUKahLTEYqSExZZ7tkFvEDwfyojMt/L4udp120.24.55.87:8070, seq: 4211558553.

And 5bnZHzZXgN4JsbKaFsqZ75HczHHh5dsrcAh9Ake4wKwA connect 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p
failure starts from:
[2023-05-06 20:57:30.856635 +08:00] ERROR [ThreadId(7)] [component\cyfs-bdt\src\tunnel\builder\connect_stream\builder.rs:328] ConnectStreamBuilder{stream:StreamContainer {sequence:TempSeq(4211602500), local:5bnZHzZXgN4JsbKaFsqZ75HczHHh5dsrcAh9Ake4wKwA, remote:5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p, port:84, id:1638150600 }} call sn session failed, sn=5bnZVFY5EYo6LXxrUKahLTEYqSExZZ7tkFvEDwfyojMt, err=err: (NotFound, sn response error, None)

The sn-miner use client_ping_timeout(default is 5min) to purge clients' cache, so, it's because 5bn..55p hasn't pinged from keep alive for too long, SN thinks it has dropped, resulting in a NotFound when 5bn..KwA calls, thus the connection fails.

@lurenpluto
Copy link
Member Author

For 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p, the final ping for keepalive is: [2023-05-06 20:46:43.015978 +08:00] INFO [ThreadId(19)] [component\cyfs-bdt\src\sn\client\ping\clients.rs:461] PingClients{local:5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p} ping-resp, sn: 5bnZVFY5EYo6LXxrUKahLTEYqSExZZ7tkFvEDwfyojMt/L4udp120.24.55.87:8070, seq: 4211558552. [2023-05-06 20:47:08.043060 +08:00] INFO [ThreadId(19)] [component\cyfs-bdt\src\sn\client\ping\clients.rs:461] PingClients{local:5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p} ping-resp, sn: 5bnZVFY5EYo6LXxrUKahLTEYqSExZZ7tkFvEDwfyojMt/L4udp120.24.55.87:8070, seq: 4211558553.

And 5bnZHzZXgN4JsbKaFsqZ75HczHHh5dsrcAh9Ake4wKwA connect 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p failure starts from: [2023-05-06 20:57:30.856635 +08:00] ERROR [ThreadId(7)] [component\cyfs-bdt\src\tunnel\builder\connect_stream\builder.rs:328] ConnectStreamBuilder{stream:StreamContainer {sequence:TempSeq(4211602500), local:5bnZHzZXgN4JsbKaFsqZ75HczHHh5dsrcAh9Ake4wKwA, remote:5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p, port:84, id:1638150600 }} call sn session failed, sn=5bnZVFY5EYo6LXxrUKahLTEYqSExZZ7tkFvEDwfyojMt, err=err: (NotFound, sn response error, None)

The sn-miner use client_ping_timeout(default is 25s) to purge clients' cache, so, it's because 5bn..55p hasn't pinged from keep alive for too long, SN thinks it has dropped, resulting in a NotFound when 5bn..KwA calls, thus the connection fails.

So this problem, is also bdt's sn ping stopped unexpectedly, resulting in the problem of being considered offline by the SN server?

It should be the same or similar problem as the one below
#250

So we need to review the SN ping logic inside the bdt stack to see what would cause the ping loop to be aborted

@jing-git
Copy link
Collaborator

jing-git commented May 24, 2023

All tasks are blocked, causing the bdt-stack is unable to communicate properly:

[2023-05-06 20:47:19.246450 +08:00] WARN [ThreadId(7)] [component\cyfs-stack\src\interface\http_bdt_listener.rs:198] bdt http request complete with error! status=501, seq=TempSeq(4211563355), during=341ms
[2023-05-06 20:48:00.881778 +08:00] INFO [ThreadId(12)] [component\cyfs-debug\src\check\dead.rs:145] process still alive ThreadId(12), 1.1.1.83-beta (23-05-04)
[2023-05-06 20:49:00.882825 +08:00] INFO [ThreadId(12)] [component\cyfs-debug\src\check\dead.rs:145] process still alive ThreadId(12), 1.1.1.83-beta (23-05-04)
[2023-05-06 20:50:00.883364 +08:00] INFO [ThreadId(12)] [component\cyfs-debug\src\check\dead.rs:145] process still alive ThreadId(12), 1.1.1.83-beta (23-05-04)
[2023-05-06 20:51:00.885583 +08:00] INFO [ThreadId(12)] [component\cyfs-debug\src\check\dead.rs:145] process still alive ThreadId(12), 1.1.1.83-beta (23-05-04)
[2023-05-06 20:52:00.887406 +08:00] INFO [ThreadId(12)] [component\cyfs-debug\src\check\dead.rs:145] process still alive ThreadId(12), 1.1.1.83-beta (23-05-04)
[2023-05-06 20:52:00.887488 +08:00] ERROR [ThreadId(12)] [component\cyfs-debug\src\check\dead.rs:122] task system dead timeout, now will exit process! last_active=13327850820839775

After 5 minutes and 30 seconds, when querying 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p through sn, it will definitely not be found, and during this period, there will be no communication with it:

SN remove 5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p's cache item, so not found:
[2023-05-06 20:57:30.856635 +08:00] ERROR [ThreadId(7)] [component\cyfs-bdt\src\tunnel\builder\connect_stream\builder.rs:328] ConnectStreamBuilder{stream:StreamContainer {sequence:TempSeq(4211602500), local:5bnZHzZXgN4JsbKaFsqZ75HczHHh5dsrcAh9Ake4wKwA, remote:5bnZHzXqdRwun6NkzMgksUirAdspUnUwLBFYG91QC55p, port:84, id:1638150600 }} call sn session failed, sn=5bnZVFY5EYo6LXxrUKahLTEYqSExZZ7tkFvEDwfyojMt, err=err: (NotFound, sn response error, None)

Task is blocked, so all the packets can no be handle:
[2023-05-06 20:49:24.056021 +08:00] INFO [ThreadId(6)] [component\cyfs-bdt\src\tunnel\udp.rs:148] UdpTunnel{local:L4udp192.168.100.75:8051,remote:L4udp192.168.100.75:8050} dead for connecting timeout

When the check\dead.rs detect this situation, the gateway will be restart? If so, the device an be connected after 5min.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BDT bucky data transfer protocol bug Something isn't working SN SN Server
Projects
Status: 🐞Discovered Bugs
Development

No branches or pull requests

2 participants