Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Peer is already bound to another channel" #3979

Closed
jamilbk opened this issue Mar 5, 2024 · 10 comments · Fixed by #4094
Closed

"Peer is already bound to another channel" #3979

jamilbk opened this issue Mar 5, 2024 · 10 comments · Fixed by #4094
Labels
area/connlib Firezone's core connectivity library area/gateway Issues involving the Firezone Gateway business_value/critical Required by 100% of our customer base kind/bug Something isn't working needs triage Issues opened by the public or need further labeling

Comments

@jamilbk
Copy link
Member

jamilbk commented Mar 5, 2024

Describe the bug

Gateway stops allowing connections

To Reproduce

Run a Gateway for a few days with heavy usage

Expected behavior

Connections resume once the backoffs subside

Screenshots / Logs

https://firezonehq.slack.com/archives/C0691K7382G/p1709611053234289

Platform (please complete the following information)

  • Component (i.e. macOS client / Linux client / Gateway / Admin portal): Gateway
  • Firezone Version (e.g. 1.0.0 or N/A): 1.0.0-pre.10
  • OS and version: (e.g. Ubuntu 22.04 or N/A): N/A
  • Deployment method: (e.g. Docker / Systemd / App Store or N/A): Docker

Additional context

@jamilbk jamilbk added needs triage Issues opened by the public or need further labeling area/connlib Firezone's core connectivity library area/gateway Issues involving the Firezone Gateway kind/bug Something isn't working labels Mar 5, 2024
@thomaseizinger
Copy link
Member

Connections resume once the backoffs subside

Once a backoff completely fails, we won't restart it by itself. It will be reset upon the next successful interaction with the Allocation which typically happens when a client wants to establish a connection. We always refresh all Allocations that we are about to use for the new connection.

I think this works as expected. We have to give up talking to the relay at some point if all we are getting is timeouts.

@jamilbk
Copy link
Member Author

jamilbk commented Mar 6, 2024

Closing as can't reproduce

@jamilbk jamilbk closed this as not planned Won't fix, can't repro, duplicate, stale Mar 6, 2024
@thomaseizinger
Copy link
Member

The more interesting logs are the following:

2024-03-05T03:47:12.619241Z  WARN decapsulate{from=34.102.56.34:3478 num_bytes=40}:handle_input{relay=34.102.56.34:3478 id=TransactionId(0x0DECA0E42F2DC59A20E0B138) method=channel bind class=error response rtt=222.323839ms}: snownet::allocation: STUN request failed error=Bad Request
2024-03-05T03:47:12.729503Z  WARN decapsulate{from=35.195.41.65:3478 num_bytes=40}:handle_input{relay=35.195.41.65:3478 id=TransactionId(0xE1BEFC40CCBCE67558E4E78D) method=channel bind class=error response rtt=221.969838ms}: snownet::allocation: STUN request failed error=Bad Request

I would be interesting to see the related logs on the relay for these two transaction IDs.

@thomaseizinger
Copy link
Member

I can only see IPv4 traffic in these logs so likely, all the timeouts are for the IPv6 relays. I am improving the logs to make that more obvious.

@thomaseizinger
Copy link
Member

Interesting. I'll look into that. Appears that we are not advancing the channel numbers correctly somewhere.

@jamilbk jamilbk reopened this Mar 11, 2024
@jamilbk jamilbk changed the title "Unable to queue allocate because we've exceeded our backoffs" "Channel is already bound to a different peer" Mar 11, 2024
@jamilbk jamilbk changed the title "Channel is already bound to a different peer" "Peer is already bound to another channel" Mar 11, 2024
@jamilbk jamilbk added the business_value/critical Required by 100% of our customer base label Mar 11, 2024
@jamilbk
Copy link
Member Author

jamilbk commented Mar 11, 2024

The result of this is that the gateway becomes unable to establish connections. The other error seen during this scenario is "Channel is already bound to a different peer".

The two cases are checked here and here.

It seems like for this to happen, either:

  • There's a channel mapping bug in the lookup, or
  • Channel allocations are re-using existing numbers?

github-merge-queue bot pushed a commit that referenced this issue Mar 12, 2024
)

Previously, the relay neither scheduled a `Wake` command nor did it
register a `TimedAction` to expire a channel binding. Such an action was
only scheduled after the first refresh.

This PR fixes this and adds a test that asserts we can re-bind the same
channel to a different peer after 15 minutes.

Resolves: #3979.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connlib Firezone's core connectivity library area/gateway Issues involving the Firezone Gateway business_value/critical Required by 100% of our customer base kind/bug Something isn't working needs triage Issues opened by the public or need further labeling
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants