fix(snownet,relay): re-use channels to peers in cooldown period#6276
fix(snownet,relay): re-use channels to peers in cooldown period#6276
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
|
| Report | Tue, August 13, 2024 at 16:47:40 UTC |
| Project | Firezone |
| Branch | fix/connlib/use-same-port-cooldown-channel |
| Testbed | github-actions |
Click to view all benchmark results
| Benchmark | Throughput | Throughput Results bits/s | (Δ%) | Throughput Lower Boundary bits/s | (%) |
|---|---|---|---|
| direct-tcp-client2server | ✅ (view plot) | 243,534,207.23 (+0.22%) | 237,660,744.52 (97.59%) |
| direct-tcp-server2client | ✅ (view plot) | 250,583,427.92 (+0.38%) | 242,043,801.68 (96.59%) |
| direct-udp-client2server | ✅ (view plot) | 275,695,376.80 (-4.16%) | 268,674,100.22 (97.45%) |
| direct-udp-server2client | ✅ (view plot) | 399,101,218.69 (+0.15%) | 385,524,238.88 (96.60%) |
| relayed-tcp-client2server | ✅ (view plot) | 249,879,916.75 (+1.32%) | 239,747,421.03 (95.95%) |
| relayed-tcp-server2client | ✅ (view plot) | 262,850,561.60 (+1.82%) | 248,175,083.12 (94.42%) |
| relayed-udp-client2server | ✅ (view plot) | 230,589,010.04 (+0.23%) | 219,130,216.44 (95.03%) |
| relayed-udp-server2client | ✅ (view plot) | 332,359,942.79 (-1.72%) | 317,811,222.45 (95.62%) |
Bencher - Continuous Benchmarking
View Public Perf Page
Docs | Repo | Chat | Help
When we receive an `AllocationMismatch` error response from the relay, it means that our local state is toast and needs to be invalidated. - If we attempted to allocate, the corrective action is to delete the active allocation. - If we attempted to refresh or bind a channel, the corrective action is to make a new allocation. In the case of a channel binding, we also re-schedule the target peer to be rebound to ensure upper layers don't need to retry that. For example, if this happens during a connection setup, we still want to eventually succeed in binding the channel to ensure STUN messages as part of ICE can pass over it without having to first run into an ICE timeout and retry the entire connection. In certain network configurations, we observed that the NAT between connlib and the relay may have fairly short session timers. Currently, allocations have a lifetime of 10 minutes and are refreshed every 5 minutes. If there is no other traffic from connlib during those 5 minutes, the NAT session might get cut and attempting to use the allocation to e.g. bind a channel doesn't work because the relay doesn't recognise the 3-tuple. We deem these situations quite rare. Instead of keeping the NAT session alive with additional traffic, we instead implement this corrective action here which transparently creates a new allocation using our new 3-tuple. Resolves: #6265.
|
To make sure I understand this, the 5 minutes cooldown was the problem? So the binding has expired but we could only use the same channel? |
Yes, the client would try to rebind the same channel, but since the Relay considered it cooling down, would reject. |
For the sake of correctness: The client would try to bind a new channel for a peer that still had a channel in cooldown phase. That is not allowed because it would make it ambigous, which channel to forward data to for incoming traffic from that peer. The fix is to re-activate the already existing channel. |
This is a test-failure detected in https://github.com/firezone/firezone/actions/runs/10426492110/job/28879531621. In the relay, we have fast-path lookup maps to for incoming traffic from peers. This improves throughput as any incoming packet only needs to look-up a single routing entry. Unfortunately, this creates duplication in how the data must be stored. In #6276, we correctly identified that channels must be re-bound on the relay when a client sends `CHANNEL_BIND` message whilst the channel is cooling down. What we failed to identify (and what as now caught by the tests) is that we also need to re-insert the entry into the fast-path lookup map to actually allow data from flowing through the channel.
For efficiency reasons, TURN's data channels don't have any authentication or integrity metadata. Instead, the operate using a short 2-byte channel number to identify the target peer of the data.
To avoid abuse, channel bindings are at most valid for 10 minutes before they need to be refreshed. In case they expire, there is a 5 minute cooldown period, before the same channel number can be bound to a different peer and before the same peer can be bound to a different channel.
We had a similar issue in the past (#5613) where channels got rebound early. Whilst that was fixed and is no longer happening, a case that we didn't consider is what happens if we want to bind a channel to a peer that still has a channel bound but is currently cooling down (i.e. in the 5 minute period after its expiry).
In that case,
snownetwould wrongly assume that there is no channel to this peer and try to bind a new one. That would get rejected by the relay with a bad request.To fix this, we simply need to check whether we still have a channel to this peer and if yes, return the same channel number. On the relay, we need to ensure that we consider a channel as
boundagain when it is being refreshed.We ensure that this doesn't regress in two ways:
ChannelBindingsstructIdletransition to idle for 6 instead of 5 minutes. This ensures that a combination of 2 idle transitions puts the channel bindings into the 10-15 minute time window where rebinding the peer to a different channel fails.Related: #6265.