"Peer is already bound to another channel" #3979

jamilbk · 2024-03-05T19:26:01Z

Describe the bug

Gateway stops allowing connections

To Reproduce

Run a Gateway for a few days with heavy usage

Expected behavior

Connections resume once the backoffs subside

Screenshots / Logs

https://firezonehq.slack.com/archives/C0691K7382G/p1709611053234289

Platform (please complete the following information)

Component (i.e. macOS client / Linux client / Gateway / Admin portal): Gateway
Firezone Version (e.g. 1.0.0 or N/A): 1.0.0-pre.10
OS and version: (e.g. Ubuntu 22.04 or N/A): N/A
Deployment method: (e.g. Docker / Systemd / App Store or N/A): Docker

Additional context

thomaseizinger · 2024-03-06T01:09:08Z

Connections resume once the backoffs subside

Once a backoff completely fails, we won't restart it by itself. It will be reset upon the next successful interaction with the Allocation which typically happens when a client wants to establish a connection. We always refresh all Allocations that we are about to use for the new connection.

I think this works as expected. We have to give up talking to the relay at some point if all we are getting is timeouts.

jamilbk · 2024-03-06T01:17:05Z

Closing as can't reproduce

thomaseizinger · 2024-03-06T01:23:40Z

The more interesting logs are the following:

2024-03-05T03:47:12.619241Z  WARN decapsulate{from=34.102.56.34:3478 num_bytes=40}:handle_input{relay=34.102.56.34:3478 id=TransactionId(0x0DECA0E42F2DC59A20E0B138) method=channel bind class=error response rtt=222.323839ms}: snownet::allocation: STUN request failed error=Bad Request
2024-03-05T03:47:12.729503Z  WARN decapsulate{from=35.195.41.65:3478 num_bytes=40}:handle_input{relay=35.195.41.65:3478 id=TransactionId(0xE1BEFC40CCBCE67558E4E78D) method=channel bind class=error response rtt=221.969838ms}: snownet::allocation: STUN request failed error=Bad Request

I would be interesting to see the related logs on the relay for these two transaction IDs.

thomaseizinger · 2024-03-06T01:24:21Z

I can only see IPv4 traffic in these logs so likely, all the timeouts are for the IPv6 relays. I am improving the logs to make that more obvious.

jamilbk · 2024-03-06T01:33:20Z

@thomaseizinger Here are the matching logs -- looks like this particular case went to europe-west-1d:

https://console.cloud.google.com/logs/query;query=%2528resource.type%3D%22gce_instance%22%20AND%20resource.labels.instance_id%3D%227104780589755144343%22%2529%20OR%20%2528resource.type%3D%22global%22%20AND%20jsonPayload.instance.id%3D%227104780589755144343%22%2529%0A--Show%20similar%20entries%0AjsonPayload.message%3D~%22channel%20bind%20failed:%20Bad%20Request%22%0A--End%20of%20show%20similar%20entries;cursorTimestamp=2024-03-05T04:36:31.272864290Z;startTime=2024-03-05T03:40:39.000Z;endTime=2024-03-05T04:44:09.000Z?project=firezone-prod

jamilbk · 2024-03-06T01:37:25Z

"Channel is already bound to a different peer"

https://console.cloud.google.com/logs/query;cursorTimestamp=2024-03-05T03:47:12.660460740Z;endTime=2024-03-05T04:44:09.000Z;query=%2528resource.type%3D%22gce_instance%22%20AND%20resource.labels.instance_id%3D%227104780589755144343%22%2529%20OR%20%2528resource.type%3D%22global%22%20AND%20jsonPayload.instance.id%3D%227104780589755144343%22%2529%0Atimestamp%3D%222024-03-05T03:47:12.660400443Z%22%0AinsertId%3D%221d44m19fc6xajc%22;startTime=2024-03-05T03:40:39.000Z?project=firezone-prod

thomaseizinger · 2024-03-06T01:48:40Z

Interesting. I'll look into that. Appears that we are not advancing the channel numbers correctly somewhere.

jamilbk · 2024-03-06T20:37:53Z

@thomaseizinger Yeah it looks like it's happening a lot:

https://console.cloud.google.com/logs/query;query=%2528resource.type%3D%22gce_instance%22%20AND%20jsonPayload.message%3D~%22Channel%20is%20already%20bound%20to%20a%20different%20peer%22%2529;cursorTimestamp=2024-03-06T19:49:41.006205674Z;duration=P14D?project=firezone-prod

jamilbk · 2024-03-11T19:31:29Z

This is still an issue.

More recent logs:

https://console.cloud.google.com/logs/query;query=%2528resource.type%3D%22gce_instance%22%20AND%20resource.labels.instance_id%3D%227607456471087978648%22%2529%20OR%20%2528resource.type%3D%22global%22%20AND%20jsonPayload.instance.id%3D%227607456471087978648%22%2529;cursorTimestamp=2024-03-11T18:55:51.005111792Z;duration=PT1H?hl=en&project=firezone-prod

jamilbk · 2024-03-11T19:51:29Z

The result of this is that the gateway becomes unable to establish connections. The other error seen during this scenario is "Channel is already bound to a different peer".

The two cases are checked here and here.

It seems like for this to happen, either:

There's a channel mapping bug in the lookup, or
Channel allocations are re-using existing numbers?

) Previously, the relay neither scheduled a `Wake` command nor did it register a `TimedAction` to expire a channel binding. Such an action was only scheduled after the first refresh. This PR fixes this and adds a test that asserts we can re-bind the same channel to a different peer after 15 minutes. Resolves: #3979.

jamilbk added needs triage Issues opened by the public or need further labeling area/connlib Firezone's core connectivity library area/gateway Issues involving the Firezone Gateway kind/bug Something isn't working labels Mar 5, 2024

jamilbk closed this as not planned Won't fix, can't repro, duplicate, stale Mar 6, 2024

jamilbk reopened this Mar 11, 2024

jamilbk changed the title ~~"Unable to queue allocate because we've exceeded our backoffs"~~ "Channel is already bound to a different peer" Mar 11, 2024

jamilbk changed the title ~~"Channel is already bound to a different peer"~~ "Peer is already bound to another channel" Mar 11, 2024

jamilbk added the business_value/critical Required by 100% of our customer base label Mar 11, 2024

jamilbk mentioned this issue Mar 12, 2024

Investigate why eur-west1-d Relay is being offered to client in NE #4088

Closed

thomaseizinger mentioned this issue Mar 12, 2024

fix(relay): actually expire channels which allows re-binding them #4094

Merged

thomaseizinger closed this as completed in #4094 Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Peer is already bound to another channel" #3979

"Peer is already bound to another channel" #3979

jamilbk commented Mar 5, 2024

thomaseizinger commented Mar 6, 2024

jamilbk commented Mar 6, 2024

thomaseizinger commented Mar 6, 2024

thomaseizinger commented Mar 6, 2024

jamilbk commented Mar 6, 2024

jamilbk commented Mar 6, 2024

thomaseizinger commented Mar 6, 2024

jamilbk commented Mar 6, 2024 •

edited

jamilbk commented Mar 11, 2024

jamilbk commented Mar 11, 2024

"Peer is already bound to another channel" #3979

"Peer is already bound to another channel" #3979

Comments

jamilbk commented Mar 5, 2024

Describe the bug

To Reproduce

Expected behavior

Screenshots / Logs

Platform (please complete the following information)

Additional context

thomaseizinger commented Mar 6, 2024

jamilbk commented Mar 6, 2024

thomaseizinger commented Mar 6, 2024

thomaseizinger commented Mar 6, 2024

jamilbk commented Mar 6, 2024

jamilbk commented Mar 6, 2024

thomaseizinger commented Mar 6, 2024

jamilbk commented Mar 6, 2024 • edited

jamilbk commented Mar 11, 2024

jamilbk commented Mar 11, 2024

jamilbk commented Mar 6, 2024 •

edited