Skip to content

fix(snownet,relay): re-use channels to peers in cooldown period#6276

Merged
jamilbk merged 4 commits intomainfrom
fix/connlib/use-same-port-cooldown-channel
Aug 13, 2024
Merged

fix(snownet,relay): re-use channels to peers in cooldown period#6276
jamilbk merged 4 commits intomainfrom
fix/connlib/use-same-port-cooldown-channel

Conversation

@thomaseizinger
Copy link
Member

@thomaseizinger thomaseizinger commented Aug 13, 2024

For efficiency reasons, TURN's data channels don't have any authentication or integrity metadata. Instead, the operate using a short 2-byte channel number to identify the target peer of the data.

To avoid abuse, channel bindings are at most valid for 10 minutes before they need to be refreshed. In case they expire, there is a 5 minute cooldown period, before the same channel number can be bound to a different peer and before the same peer can be bound to a different channel.

We had a similar issue in the past (#5613) where channels got rebound early. Whilst that was fixed and is no longer happening, a case that we didn't consider is what happens if we want to bind a channel to a peer that still has a channel bound but is currently cooling down (i.e. in the 5 minute period after its expiry).

In that case, snownet would wrongly assume that there is no channel to this peer and try to bind a new one. That would get rejected by the relay with a bad request.

To fix this, we simply need to check whether we still have a channel to this peer and if yes, return the same channel number. On the relay, we need to ensure that we consider a channel as bound again when it is being refreshed.

We ensure that this doesn't regress in two ways:

  • We add a unit-test for the ChannelBindings struct
  • We modify the Idle transition to idle for 6 instead of 5 minutes. This ensures that a combination of 2 idle transitions puts the channel bindings into the 10-15 minute time window where rebinding the peer to a different channel fails.

Related: #6265.

@vercel
Copy link

vercel bot commented Aug 13, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
firezone ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 13, 2024 4:38pm

@thomaseizinger thomaseizinger changed the title fix(connlib,relay): re-use channels to peers in cooldown period fix(snownet,relay): re-use channels to peers in cooldown period Aug 13, 2024
@thomaseizinger thomaseizinger marked this pull request as ready for review August 13, 2024 06:37
@github-actions
Copy link

github-actions bot commented Aug 13, 2024

🐰Bencher

ReportTue, August 13, 2024 at 16:47:40 UTC
ProjectFirezone
Branchfix/connlib/use-same-port-cooldown-channel
Testbedgithub-actions
Click to view all benchmark results
BenchmarkThroughputThroughput Results
bits/s | (Δ%)
Throughput Lower Boundary
bits/s | (%)
direct-tcp-client2server✅ (view plot)243,534,207.23 (+0.22%)237,660,744.52 (97.59%)
direct-tcp-server2client✅ (view plot)250,583,427.92 (+0.38%)242,043,801.68 (96.59%)
direct-udp-client2server✅ (view plot)275,695,376.80 (-4.16%)268,674,100.22 (97.45%)
direct-udp-server2client✅ (view plot)399,101,218.69 (+0.15%)385,524,238.88 (96.60%)
relayed-tcp-client2server✅ (view plot)249,879,916.75 (+1.32%)239,747,421.03 (95.95%)
relayed-tcp-server2client✅ (view plot)262,850,561.60 (+1.82%)248,175,083.12 (94.42%)
relayed-udp-client2server✅ (view plot)230,589,010.04 (+0.23%)219,130,216.44 (95.03%)
relayed-udp-server2client✅ (view plot)332,359,942.79 (-1.72%)317,811,222.45 (95.62%)

Bencher - Continuous Benchmarking
View Public Perf Page
Docs | Repo | Chat | Help

When we receive an `AllocationMismatch` error response from the relay,
it means that our local state is toast and needs to be invalidated.

- If we attempted to allocate, the corrective action is to delete the
active allocation.
- If we attempted to refresh or bind a channel, the corrective action is
to make a new allocation.

In the case of a channel binding, we also re-schedule the target peer to
be rebound to ensure upper layers don't need to retry that. For example,
if this happens during a connection setup, we still want to eventually
succeed in binding the channel to ensure STUN messages as part of ICE
can pass over it without having to first run into an ICE timeout and
retry the entire connection.

In certain network configurations, we observed that the NAT between
connlib and the relay may have fairly short session timers. Currently,
allocations have a lifetime of 10 minutes and are refreshed every 5
minutes. If there is no other traffic from connlib during those 5
minutes, the NAT session might get cut and attempting to use the
allocation to e.g. bind a channel doesn't work because the relay doesn't
recognise the 3-tuple.

We deem these situations quite rare. Instead of keeping the NAT session
alive with additional traffic, we instead implement this corrective
action here which transparently creates a new allocation using our new
3-tuple.

Resolves: #6265.
@jamilbk jamilbk enabled auto-merge August 13, 2024 16:37
auto-merge was automatically disabled August 13, 2024 16:57

Pull Request is not mergeable

@jamilbk jamilbk enabled auto-merge August 13, 2024 17:00
@jamilbk jamilbk added this pull request to the merge queue Aug 13, 2024
Merged via the queue into main with commit 6e86a4d Aug 13, 2024
@jamilbk jamilbk deleted the fix/connlib/use-same-port-cooldown-channel branch August 13, 2024 17:12
@conectado
Copy link
Contributor

To make sure I understand this, the 5 minutes cooldown was the problem? So the binding has expired but we could only use the same channel?

@jamilbk
Copy link
Member

jamilbk commented Aug 13, 2024

To make sure I understand this, the 5 minutes cooldown was the problem? So the binding has expired but we could only use the same channel?

Yes, the client would try to rebind the same channel, but since the Relay considered it cooling down, would reject.

@thomaseizinger
Copy link
Member Author

thomaseizinger commented Aug 13, 2024

To make sure I understand this, the 5 minutes cooldown was the problem? So the binding has expired but we could only use the same channel?

Yes, the client would try to rebind the same channel, but since the Relay considered it cooling down, would reject.

For the sake of correctness: The client would try to bind a new channel for a peer that still had a channel in cooldown phase. That is not allowed because it would make it ambigous, which channel to forward data to for incoming traffic from that peer.

The fix is to re-activate the already existing channel.

github-merge-queue bot pushed a commit that referenced this pull request Aug 16, 2024
This is a test-failure detected in
https://github.com/firezone/firezone/actions/runs/10426492110/job/28879531621.

In the relay, we have fast-path lookup maps to for incoming traffic from
peers. This improves throughput as any incoming packet only needs to
look-up a single routing entry. Unfortunately, this creates duplication
in how the data must be stored.

In #6276, we correctly identified that channels must be re-bound on the
relay when a client sends `CHANNEL_BIND` message whilst the channel is
cooling down. What we failed to identify (and what as now caught by the
tests) is that we also need to re-insert the entry into the fast-path
lookup map to actually allow data from flowing through the channel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants