Skip to content

Direct peer endpoint can become inactive despite successful probes #124

@jstarstech

Description

@jstarstech

Summary

Direct peer endpoints can become inactive or fail to stay selected even when the peers can reach each other directly
and manual probe reports ok.

I observed this with two router nodes on the same L2 LAN:

  • node-2: 192.168.99.10:27931
  • node-7: 192.168.99.254:27931
  • both nodes could ping each other on the LAN
  • both nodes listened on UDP 0.0.0.0:27931
  • nylon probe node-7 from node-2 returned ok
  • nylon probe node-2 from node-7 returned ok

But after some time, the direct endpoint became inactive and traffic to 10.100.200.7/32 was routed via another node
instead of using the direct LAN endpoint.

Observed behavior

On node-2, the direct node-7 endpoint could become inactive:

node-7 (...)
best metric  latest handshake  wireguard endpoint
INF          ...               192.168.99.254:27931

routes
10.100.200.7/32    node-1   node-7

After restarting both nodes, the direct endpoint initially worked:

node-7 (...)
best metric  wireguard endpoint
1000000      192.168.99.254:27931 [active,best]

routes
10.100.200.7/32    node-7   node-7

But after a few seconds it could degrade again and route through another node.

Expected behavior

If two router nodes can exchange probe packets directly, the direct endpoint should remain active and route selection should prefer the direct endpoint when its metric is better than the indirect path.

Suspected root cause

There appear to be two related issues in the probe path:

  1. Probe tokens are registered after SendNylon() completes. On very low latency LAN links, a pong can arrive before the token is inserted into PingBuf, so handleProbePong() does not account for it.
  2. handleProbePing() renews inbound endpoints, but when an inactive endpoint is revived or a new direct endpoint is created, route computation is not refreshed immediately after the endpoint becomes active.

This can make manual probe output misleading: probe ok confirms the send path worked, but it does not necessarily mean the pong was processed and route state was refreshed.

Proposed fix

  • Register the probe token in PingBuf before sending the probe packet.
  • Delete the token if sending fails.
  • Build a clean pong response instead of mutating the received probe packet.
  • Recompute routes when an inbound probe revives an inactive direct endpoint or creates a new endpoint.

I have a patch for this and will open a PR. (I used AI to investigate and resolve the issue)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions