-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes connections to a gateway are only established after 10s #4058
Comments
Just hit this now on a fresh connection. Client logs:
Ping output:
Environment: Staging |
Yeah, I've been hitting it pretty regularly. It is always exactly 10s. I am almost certain it is a problem on the gateway but I haven't looked at the logs yet. |
👍 Moving to high prior since it affects all users. I can paste gateway logs here next time I encounter it. |
After some more testing trying to reproduce, it seems to happen only with gateways that have been up for some time, making it tricky to debug. Perhaps there's a timer or something that doesn't get reset that affects new clients connecting. |
Yeah, it seems to happen when the gateway has been idle for a bit. |
Are we logging |
@thomaseizinger Just hit this again. Here are the relevant logs that capture the 10 seconds from initial intent from the client until the pings started flowing:
|
From the logs above, it seems that the delay may be caused by some sort of WireGuard state? Looks like it didn't work until the first keepalive was sent after the session was set up. |
I thought that too but we do have a session much earlier:
|
This is when the pings started working:
Is it possible it somehow missed a window or some other lookup failed causing it to wait until that Maybe happens with gateways after clients have connected and disconnected, maybe from the same IP / peer? |
Well, it doesn't happen if the same peer connects again, that is the weird bit. But our ICE timeout is around ~10 seconds and after that, we clean up all the state to this peer. I am not aware of any state that we keep around beyond that which would make it faster to connect again after. It is almost as if wireguard is buffering the data we want to send and needs to send / receive a keep-alive before it will actually encapsulate the packet and we can send it over. But I don't understand how that is related at all to the cold-start symptom. |
@thomaseizinger It just happened again, this time I think I got better logs. I think there's a clue -- it looks like the state quickly goes to Once the Relay auth succeeds, the ping started working. This is about 10 seconds from authorization to success. Could it be that connections are held up by any pending Relay allocations even if they've been directly established? |
One more set of logs, if it's helpful from first ping until the time where pings were flowing. This is time it's similar to the previously posted logs, where it waited until the first
|
What maybe happens is that the first handshake flows over a relay and then flaps to another socket. I wonder if #4164 would help with this. |
Can you get logs from the gateway too? |
@thomaseizinger All of these are from the staging gateway in AWS. |
Ah yes, didn't look properly sorry 😅 /me sips coffee |
I've had a look at the logs but I can't identify anything useful. I think we need to further narrow down, when this happens. One thing we could try is restarting the gateway and check if it happens on a completely new connection. Another interesting thing to know would be if we are actually sending data out and it is dropped by the gateway or whether we are not actually sending it. Activating |
Tested with a fresh gateway restart, doesn't seem to affect the issue. In fact it never happens with a fresh gateway. Should we lengthen the WireGuard timeout to 25s to see if it's that? That would help distinguish WireGuard timeouts from str0m ones? I wonder if it's reproducible in CI somehow. I can enable the trace logs for the staging gateway. |
I think we should first start with
I doubt this, str0m never interacts with our traffic, it is just a state machine that generates STUN messages. We can see from the logs that we have a connection and we've handshaked the wireguard tunnel. The causes I can see are:
The coupling with the wireguard keepalive makes me think it is a problem of WG buffering or rejecting our packet, either on the client or the gateway. It could also be that wireguard discards it? I'd have to look at the code again but I'd expect there to be logs for that. |
@thomaseizinger I think your hypothesis that this is a client issue is correct. Just happened for me again, and I happened to have
|
So there is definitely an issue somewhere in connection setup. I left Firezone running overnight, came in and ran a ping, only this time it never established a connection. I waited about 20s. The second time I started the ping it worked within 1s, so it must have taken quite a while for the connection to set up (maybe 30s?) Perhaps a way to test this in CI is to expose our timers as a knob to adjust? Then we can advanced some of the timings and try to replicate it. I think it may have to do with connections that get "cleaned up"? Here are the client (with wire=trace) and logs from gateway. |
Very interesting. There are a few things going on here.
|
Yeah I stopped after it started working. So it seems the question is... why would the gateway's connection fail? I'm confident the portal wasn't being deployed when this occurred. Would it be possible this could be a portal bug? Maybe we should be pulling portal logs? |
I should have mentioned the time between canceling the failed ping and starting the one that succeeded was maybe 10s At '32 I started, At around '57 I think I started again and it worked right away (we hit the 20s timeout from the first ping and resetup the connection in the meantime?) Around '02 I stopped the ping. |
I think I found some issues in the |
See:
The above logs with To further debug this specific issue, we need |
Previously, we would lose one message to the portal upon failing to send it. We now mitigate this in two ways: 1. We also check the error from `poll_ready` and don't even pop a message off from our buffer. 2. If sending still fails, we re-queue it to the front of the buffer. In certain scenarios as discovered in logs from #4058, this might have caused a loss of the "answer" message from a gateway to the client, resulting in a state mismatch where the gateway thinks the connection is established and the client times out on waiting for the answer.
Hit it again this morning. Adding another set of
|
Have been seeing this much less lately. Here's what I think might be an instance of this bug or a similar one that showed up as a flaky iperf3 test: https://github.com/firezone/firezone/actions/runs/8487328603/job/23255094840?pr=4399#step:8:36 Notice the 10s it took the client to be @conectado and I discovered that a race condition is possible if you start curl before the TUN device comes up fully: curl will use a socket opt to override the source IP of the outgoing packet based on the routing table at time of invocation. Maybe iperf3 does something similar. Perhaps not related to this issue but worth noting. We could devise a large test matrix to try and replicate this in CI by erroring out if the connection takes longer than a second or so to establish. That should always happen in such a low latency environment like CI. |
This seems to have been fixed, haven't seen it some time! |
In my own testing, I noticed that a connection to a gateway sometimes takes ~10s. Subsequent connections from a restarted client are much quicker which leads me to believe that it is somehow related to how we cache candidates across connections.
The text was updated successfully, but these errors were encountered: