New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SC FMS is stuck on infinity connection retries #4234
Comments
What is the application? Not saying that this is what happens, only that it would help with more context. |
I don't have much details unfortunately other than the FSM was initialized with the wrong port (443) and I've seen this message looping forever. After restart it is gone |
seems I reported the same issue in #4235 |
I reproduced this issue with @davidyuk snippet by changing the port to 443 (a privileged port on Linux, doesn't have this issue on the Mac) and changing the role to responder.
I'll keep on digging. Can reproduce on the Mac by using nc to listen on the requested port:
|
My snippet expected to work fine on localhost, I have issues on public endpoints (wss://testnet.aeternity.io/channel and wss://mainnet.aeternity.io/channel) |
I was reproducing the issue @dincho observed above. I am not sure if they are related, I will have a look at that next, |
@mitchelli could you please double check if the socket configuration includes |
In general the clients/server are expected to have long running open connections which should work fine, but in case they need to reconnect (fast) for some reason, it's probably too fast and the socket state is still BUSY (i.e. not properly closed, or the close timeout not reached). |
I think the issue is that the server is trying to open a port for listening that it is not allowed to. On Linux, the user running the node doesn't have permission to listen on privileged ports. On Mac I reproduced it by running another process that listened on port that the node wanted to listen on before the node tried to listen. It could be that API is being used incorrectly. I don't know enough about state channels. I'm not sure why the client can tell the node to listen on a certain port. |
Ah, now I remember! This is how it works currently as the SC FSMs needs a communication channel. So, the initiator FSM tries to bind on the corresponding port send via the API (WS) call. First of, the OP error (trying to bind/listen) on a given port should have a backoff time and max tries then die. However, the general issue (new ticket?) with that approach is that while this concept might be acceptable for local/testing/playgrounds it does not work for production systems. No sane administrator would allow an app to bind on any random ports, moreover controlled by and user/external API, furthermore this means a port range open in the network firewall, which is also no-go. The only way this could work in production, is to actually remove host/port parameters (in the WS API) and make them configurable (server side). It must also work on single port (multiplexing), that is a single FSM/node should be able to accept N number of remote FSM Noise connections, regardless of the responders etc. Currently there are controlling channel WS API (/channel) to the FSMs and port 3114 (FSM noise), knowing the actual FSM host is another issue I should try to solve somehow. |
It can't. But I think there is a bug in there. The casino SC demo revealed a few weaknesses in the SC connection handling, and there was a PR (#4011) to address this. The PR wasn't well tested (the demo project was put on ice, I think), but eventually merged anyway. I later noticed that the SC Market demo also broke, but haven't had the time to debug it. In the SC Market case, it may have something to do with using a timeout on the listener side and constantly restarting them. I don't really think there is any reason to use a timeout on the listeners, but if one does, it needs to play nice with supervision/restarts and of course log reporting. Any yes, a SC responder needs to be able to multiplex acceptors on a single listen socket, which is harder than it sounds since it has to match a session with the appropriate responder, as well as ensure that reconnecting clients find the same responder as before, including potentially matching |
There was a bug in the function clause which I fixed, now it tries three times and fails but the attempts are very close together. I am not sure if the behaviour is correct:
|
When the remote FSM is n/a the local one keeps retrying forever, no backoff no stop.
The text was updated successfully, but these errors were encountered: