Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SC FMS is stuck on infinity connection retries #4234

Closed
dincho opened this issue Dec 15, 2023 · 12 comments · Fixed by #4323
Closed

SC FMS is stuck on infinity connection retries #4234

dincho opened this issue Dec 15, 2023 · 12 comments · Fixed by #4323
Assignees
Labels
area/statechannels Issues or PRs related to state channels kind/bug Issues or PRs related to a bug

Comments

@dincho
Copy link
Member

dincho commented Dec 15, 2023

2023-12-15 10:49:34.131 [warning] <0.9386.2130>@aesc_fsm:noise_accept:4221 Noise accept failed with {exception,{eacces,443}}
2023-12-15 10:49:34.131 [error] <0.1767.0>@aesc_listeners:new_listener_:132 Cannot open State Channel listener on 443: eacces

When the remote FSM is n/a the local one keeps retrying forever, no backoff no stop.

@dincho dincho added kind/bug Issues or PRs related to a bug area/statechannels Issues or PRs related to state channels labels Dec 15, 2023
@dincho dincho changed the title SC FMS is stuff on infinity connection retries SC FMS is stuck on infinity connection retries Dec 15, 2023
@uwiger
Copy link
Member

uwiger commented Dec 15, 2023

What is the application?
It's the responder doing the accept, and if it fails (there is a default of 3 retries), it may propagate out to something that keeps trying to spawn a server-side responder.

Not saying that this is what happens, only that it would help with more context.

@dincho
Copy link
Member Author

dincho commented Dec 18, 2023

I don't have much details unfortunately other than the FSM was initialized with the wrong port (443) and I've seen this message looping forever. After restart it is gone

@davidyuk
Copy link
Member

davidyuk commented Feb 14, 2024

seems I reported the same issue in #4235
I can't use FSM on testnet and mainnet because of {"event":"died"}

@mitchelli
Copy link
Contributor

mitchelli commented Mar 28, 2024

I reproduced this issue with @davidyuk snippet by changing the port to 443 (a privileged port on Linux, doesn't have this issue on the Mac) and changing the role to responder.

17:52:34.621 [debug] No listener on port 443 yet - create one
17:52:34.621 [error] Cannot open State Channel listener on 443: eacces
17:52:34.621 [warning] Noise accept failed with {exception,{eacces,443}}
17:52:34.621 [debug] No listener on port 443 yet - create one
17:52:34.621 [error] Cannot open State Channel listener on 443: eacces
17:52:34.621 [warning] Noise accept failed with {exception,{eacces,443}}
17:52:34.621 [debug] No listener on port 443 yet - create one
17:52:34.622 [error] Cannot open State Channel listener on 443: eacces
17:52:34.622 [warning] Noise accept failed with {exception,{eacces,443}}
17:52:34.630 [debug] No listener on port 443 yet - create one
17:52:34.631 [error] Cannot open State Channel listener on 443: eacces
17:52:34.631 [warning] Noise accept failed with {exception,{eacces,443}}
17:52:34.631 [debug] No listener on port 443 yet - create one
17:52:34.631 [error] Cannot open State Channel listener on 443: eacces
17:52:34.631 [warning] Noise accept failed with {exception,{eacces,443}}
17:52:34.631 [debug] No listener on port 443 yet - create one
17:52:34.631 [error] Cannot open State Channel listener on 443: eacces
17:52:34.631 [warning] Noise accept failed with {exception,{eacces,443}}
17:52:34.631 [debug] No listener on port 443 yet - create one
17:52:34.631 [error] Cannot open State Channel listener on 443: eacces
17:52:34.631 [warning] Noise accept failed with {exception,{eacces,443}}
17:52:34.631 [debug] No listener on port 443 yet - create one
17:52:34.631 [error] Cannot open State Channel listener on 443: eacces
17:52:34.631 [warning] Noise accept failed with {exception,{eacces,443}}
17:52:34.631 [debug] No listener on port 443 yet - create one
17:52:34.631 [error] Cannot open State Channel listener on 443: eacces
17:52:34.631 [warning] Noise accept failed with {exception,{eacces,443}}
17:52:34.631 [debug] No listener on port 443 yet - create one
17:52:34.631 [error] Cannot open State Channel listener on 443: eacces
17:52:34.632 [warning] Noise accept failed with {exception,{eacces,443}}
17:52:34.632 [debug] No listener on port 443 yet - create one
17:52:34.632 [error] Cannot open State Channel listener on 443: eacces
17:52:34.632 [warning] Noise accept failed with {exception,{eacces,443}}
17:52:34.632 [debug] No listener on port 443 yet - create one
17:52:34.632 [error] Cannot open State Channel listener on 443: eacces
17:52:34.632 [warning] Noise accept failed with {exception,{eacces,443}}
17:52:34.657 [debug] No listener on port 443 yet - create one
17:52:34.657 [error] Cannot open State Channel listener on 443: eacces
17:52:34.657 [warning] Noise accept failed with {exception,{eacces,443}}
17:52:34.657 [debug] No listener on port 443 yet - create one
17:52:34.657 [error] Cannot open State Channel listener on 443: eacces
17:52:34.657 [warning] Noise accept failed with {exception,{eacces,443}}

I'll keep on digging.

Can reproduce on the Mac by using nc to listen on the requested port:

18:23:17.911 [error] Cannot open State Channel listener on 443: eaddrinuse
18:23:17.937 [warning] Noise accept failed with {exception,{eaddrinuse,443}}
18:23:17.937 [error] Cannot open State Channel listener on 443: eaddrinuse
18:23:17.937 [warning] Noise accept failed with {exception,{eaddrinuse,443}}
18:23:17.937 [error] Cannot open State Channel listener on 443: eaddrinuse
18:23:17.937 [warning] Noise accept failed with {exception,{eaddrinuse,443}}
18:23:17.937 [error] Cannot open State Channel listener on 443: eaddrinuse
18:23:17.937 [warning] Noise accept failed with {exception,{eaddrinuse,443}}
18:23:17.937 [error] Cannot open State Channel listener on 443: eaddrinuse
18:23:17.937 [warning] Noise accept failed with {exception,{eaddrinuse,443}}
18:23:17.938 [error] Cannot open State Channel listener on 443: eaddrinuse
18:23:17.938 [warning] Noise accept failed with {exception,{eaddrinuse,443}}
18:23:17.938 [error] Cannot open State Channel listener on 443: eaddrinuse
18:23:17.938 [warning] Noise accept failed with {exception,{eaddrinuse,443}}
18:23:17.938 [error] Cannot open State Channel listener on 443: eaddrinuse
18:23:17.938 [warning] Noise accept failed with {exception,{eaddrinuse,443}}
18:23:17.938 [error] Cannot open State Channel listener on 443: eaddrinuse
18:23:17.938 [warning] Noise accept failed with {exception,{eaddrinuse,443}}
18:23:17.938 [error] Cannot open State Channel listener on 443: eaddrinuse
18:23:17.938 [warning] Noise accept failed with {exception,{eaddrinuse,443}}

@mitchelli mitchelli self-assigned this Mar 28, 2024
@davidyuk
Copy link
Member

My snippet expected to work fine on localhost, I have issues on public endpoints (wss://testnet.aeternity.io/channel and wss://mainnet.aeternity.io/channel)

@mitchelli
Copy link
Contributor

I was reproducing the issue @dincho observed above. I am not sure if they are related, I will have a look at that next,

@dincho
Copy link
Member Author

dincho commented Apr 1, 2024

@mitchelli could you please double check if the socket configuration includes SO_REUSEADDR ?

@dincho
Copy link
Member Author

dincho commented Apr 1, 2024

In general the clients/server are expected to have long running open connections which should work fine, but in case they need to reconnect (fast) for some reason, it's probably too fast and the socket state is still BUSY (i.e. not properly closed, or the close timeout not reached).

@mitchelli
Copy link
Contributor

I think the issue is that the server is trying to open a port for listening that it is not allowed to. On Linux, the user running the node doesn't have permission to listen on privileged ports. On Mac I reproduced it by running another process that listened on port that the node wanted to listen on before the node tried to listen. It could be that API is being used incorrectly. I don't know enough about state channels. I'm not sure why the client can tell the node to listen on a certain port.

@dincho
Copy link
Member Author

dincho commented Apr 2, 2024

I'm not sure why the client can tell the node to listen on a certain port.

Ah, now I remember!

This is how it works currently as the SC FSMs needs a communication channel. So, the initiator FSM tries to bind on the corresponding port send via the API (WS) call.

First of, the OP error (trying to bind/listen) on a given port should have a backoff time and max tries then die.
So the fix of this should be pretty clear.

However, the general issue (new ticket?) with that approach is that while this concept might be acceptable for local/testing/playgrounds it does not work for production systems. No sane administrator would allow an app to bind on any random ports, moreover controlled by and user/external API, furthermore this means a port range open in the network firewall, which is also no-go.

The only way this could work in production, is to actually remove host/port parameters (in the WS API) and make them configurable (server side). It must also work on single port (multiplexing), that is a single FSM/node should be able to accept N number of remote FSM Noise connections, regardless of the responders etc.

Currently there are controlling channel WS API (/channel) to the FSMs and port 3114 (FSM noise), knowing the actual FSM host is another issue I should try to solve somehow.

@uwiger
Copy link
Member

uwiger commented Apr 3, 2024

I'm not sure why the client can tell the node to listen on a certain port.

It can't. But I think there is a bug in there.

The casino SC demo revealed a few weaknesses in the SC connection handling, and there was a PR (#4011) to address this. The PR wasn't well tested (the demo project was put on ice, I think), but eventually merged anyway. I later noticed that the SC Market demo also broke, but haven't had the time to debug it.

In the SC Market case, it may have something to do with using a timeout on the listener side and constantly restarting them. I don't really think there is any reason to use a timeout on the listeners, but if one does, it needs to play nice with supervision/restarts and of course log reporting.

Any yes, a SC responder needs to be able to multiplex acceptors on a single listen socket, which is harder than it sounds since it has to match a session with the appropriate responder, as well as ensure that reconnecting clients find the same responder as before, including potentially matching noise crypto keys.

@mitchelli
Copy link
Contributor

First of, the OP error (trying to bind/listen) on a given port should have a backoff time and max tries then die. So the fix of this should be pretty clear.

There was a bug in the function clause which I fixed, now it tries three times and fails but the attempts are very close together. I am not sure if the behaviour is correct:


16:08:27.779 [error] Cannot open State Channel listener on 443: eaddrinuse
16:08:27.779 [warning] Noise accept failed with {exception,{eaddrinuse,443}}
16:08:27.779 [error] Cannot open State Channel listener on 443: eaddrinuse
16:08:27.779 [warning] Noise accept failed with {exception,{eaddrinuse,443}}
16:08:27.779 [error] Cannot open State Channel listener on 443: eaddrinuse
16:08:27.779 [warning] Noise accept failed with {exception,{eaddrinuse,443}}
16:08:27.779 [error] Failed with failed_spawning_noise, Initiator: <<154,122,127,162,221,11,106,64,140,166,149,255,64,126,209,240,176,59,34,163,60,214,74,1,14,146,205,98,247,71,45,16>>, Responder: <<206,167,173,228,112,201,249,157,157,78,64,8,128,168,111,29,73,187,68,75,98,241,26,158,187,100,187,207,235,115,254,243>>
16:08:27.780 [error] Failed to start noise session: Error = failed_spawning_noise
16:08:27.780 [error] CRASH REPORT Process <0.1953.0> with 0 neighbours exited with reason: failed_noise_session_start in gen_statem:init_result/8 line 1023
16:08:27.780 [info] Handler critical error: failed_noise_session_start

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/statechannels Issues or PRs related to state channels kind/bug Issues or PRs related to a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants