Skip to content

GET operation fails immediately when no peers available instead of retrying #1858

@sanity

Description

@sanity

Description

When a peer has no other peers available to forward a GET request to, the operation fails immediately with "reached max retries" instead of actually retrying. This prevents content retrieval in sparse network conditions.

Confirmed Bug

The GET operation code has a logic flaw: when k_closest_potentially_caching() returns an empty list (no available peers), it immediately returns failure WITHOUT:

  1. Actually incrementing the retry counter
  2. Implementing any delay mechanism
  3. Attempting the configured MAX_RETRIES (10)

Evidence from Logs

User's peer logs:

2025-09-26T02:48:53.726028Z  INFO freenet::operations::connect: Immediately requesting more peer connections from gateway

[3.5 minutes pass - user attempts to join River room]

2025-09-26T02:52:23.003037Z  INFO freenet::operations::get: Seek contract, tx: 01K61YZ7JT1PP4WGM8N4ZRDWG2, key: 9L1N9DyVwofcib7PpQqjsmdkcEWLBf8PUVGMLx9LDW1H, target: v6MWKgqHiBMNcGtG

2025-09-26T02:52:23.195452Z  WARN freenet::operations::get: Neither contract or contract value for contract found at peer v6MWKgqHiBMNcGtG, retrying with other peers

2025-09-26T02:52:23.195511Z ERROR freenet::operations::get: Failed getting a value for contract 9L1N9DyVwofcib7PpQqjsmdkcEWLBf8PUVGMLx9LDW1H, reached max retries

Note: Entire operation failed in ~192ms, clearly not attempting 10 retries.

Gateway logs (same transaction):

2025-09-26T02:52:23.092836Z  WARN freenet::operations::get: No other peers found while trying to get the contract, tx: 01K61YZ7JT1PP4WGM8N4ZRDWG2, key: 9L1N9DyVwofcib7PpQqjsmdkcEWLBf8PUVGMLx9LDW1H, this_peer: v6MWKgqHiBMNcGtG
    at crates/core/src/operations/get.rs:1135

Root Cause

In get.rs lines 695-744, when new_candidates.is_empty():

  • If there's a requester peer: returns failure immediately
  • If original requester: logs "reached max retries" and fails
  • Never actually increments retry counter or delays

Why the Gateway Had No Peers (Hypothesis)

Several possible explanations:

  1. Version incompatibility: The v0.1.27 release had just happened, possibly leaving the gateway isolated from older versions
  2. Network bootstrap issue: The gateway may have lost connections and hadn't re-established them
  3. Actual sparse network: Simply very few peers online
  4. Connection failures: Network issues preventing peer connections

We cannot determine the exact cause from the logs alone.

Impact

This bug affects any scenario where peers temporarily have no connections:

  • Network bootstrap/startup
  • Version transitions (if versions are incompatible)
  • Network partitions
  • Connection losses
  • Small/test networks

Proposed Solution

The GET operation needs proper retry logic:

  1. Always increment retry counter, even when no peers available
  2. Implement exponential backoff delays (100ms, 200ms, 400ms, etc.)
  3. Actually attempt MAX_RETRIES times before failing
  4. Consider scheduling retries rather than blocking

This would allow time for:

  • Peer connections to establish
  • Network conditions to improve
  • Bootstrap processes to complete

Additional Issues

  1. Misleading error message: "reached max retries" when no retries were attempted
  2. No grace period: Fails immediately even though peer requested more connections
  3. Test gap: Tests use pre-populated networks, missing this edge case

Related Code References

[AI-assisted debugging and comment]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions