Summary
Direct peer endpoints can become inactive or fail to stay selected even when the peers can reach each other directly
and manual probe reports ok.
I observed this with two router nodes on the same L2 LAN:
- node-2:
192.168.99.10:27931
- node-7:
192.168.99.254:27931
- both nodes could ping each other on the LAN
- both nodes listened on UDP
0.0.0.0:27931
nylon probe node-7 from node-2 returned ok
nylon probe node-2 from node-7 returned ok
But after some time, the direct endpoint became inactive and traffic to 10.100.200.7/32 was routed via another node
instead of using the direct LAN endpoint.
Observed behavior
On node-2, the direct node-7 endpoint could become inactive:
node-7 (...)
best metric latest handshake wireguard endpoint
INF ... 192.168.99.254:27931
routes
10.100.200.7/32 node-1 node-7
After restarting both nodes, the direct endpoint initially worked:
node-7 (...)
best metric wireguard endpoint
1000000 192.168.99.254:27931 [active,best]
routes
10.100.200.7/32 node-7 node-7
But after a few seconds it could degrade again and route through another node.
Expected behavior
If two router nodes can exchange probe packets directly, the direct endpoint should remain active and route selection should prefer the direct endpoint when its metric is better than the indirect path.
Suspected root cause
There appear to be two related issues in the probe path:
- Probe tokens are registered after SendNylon() completes. On very low latency LAN links, a pong can arrive before the token is inserted into PingBuf, so handleProbePong() does not account for it.
- handleProbePing() renews inbound endpoints, but when an inactive endpoint is revived or a new direct endpoint is created, route computation is not refreshed immediately after the endpoint becomes active.
This can make manual probe output misleading: probe ok confirms the send path worked, but it does not necessarily mean the pong was processed and route state was refreshed.
Proposed fix
- Register the probe token in PingBuf before sending the probe packet.
- Delete the token if sending fails.
- Build a clean pong response instead of mutating the received probe packet.
- Recompute routes when an inbound probe revives an inactive direct endpoint or creates a new endpoint.
I have a patch for this and will open a PR. (I used AI to investigate and resolve the issue)
Summary
Direct peer endpoints can become inactive or fail to stay selected even when the peers can reach each other directly
and manual probe reports
ok.I observed this with two router nodes on the same L2 LAN:
192.168.99.10:27931192.168.99.254:279310.0.0.0:27931nylon probe node-7from node-2 returnedoknylon probe node-2from node-7 returnedokBut after some time, the direct endpoint became inactive and traffic to
10.100.200.7/32was routed via another nodeinstead of using the direct LAN endpoint.
Observed behavior
On node-2, the direct node-7 endpoint could become inactive:
After restarting both nodes, the direct endpoint initially worked:
But after a few seconds it could degrade again and route through another node.
Expected behavior
If two router nodes can exchange probe packets directly, the direct endpoint should remain active and route selection should prefer the direct endpoint when its metric is better than the indirect path.
Suspected root cause
There appear to be two related issues in the probe path:
This can make manual probe output misleading: probe ok confirms the send path worked, but it does not necessarily mean the pong was processed and route state was refreshed.
Proposed fix
I have a patch for this and will open a PR. (I used AI to investigate and resolve the issue)