Skip to content

[client] Admin readOnlyGateway never refreshes metadata on network errors, causing permanent RPC failure during server rolling upgrades #3389

@loserwang1024

Description

@loserwang1024

Search before asking

  • I searched in the issues and found nothing similar.

Fluss version

0.9.0 (latest release)

Please describe the bug 🐞

During a rolling upgrade of Fluss servers, the Admin client suffers from two problems:

Exception Type Meaning Refresh Strategy
NetworkException / TimeoutException Node unreachable (IP changed) Refresh cluster-level metadata (server list)
NotLeaderOrFollowerException / LeaderNotAvailableException Leader switched Refresh table/partition-level metadata (leader assignment)

Stale metadata never refreshed — permanent failure

example: Admin#listPartitionInfos、getLatestLakeSnapshot、getLatestKvSnapshots、getTableInfo

readOnlyGateway is backed by metadataUpdater::getRandomTabletServer as its node supplier. When all tablet server IPs change (e.g., pods restarting with new IPs in Kubernetes), the cached Cluster still holds the old IPs.

The RPC fails with a network error, but nothing triggers updateMetadata() — the CompletableFuture simply completes exceptionally and the stale Cluster remains unchanged. Subsequent calls keep resolving the same stale nodes, making this a permanent failure that no amount of caller-side retries can fix.

Stale leader routing — transient failure

Example: Admin#listOffsets

  // FlussAdmin.java:529
  metadataUpdater.updateTableOrPartitionMetadata(physicalTablePath.getTablePath(), null);
  // ... then prepareListOffsetsRequests() calls leaderFor() and sends to that leader

Each call starts by refreshing metadata, so it picks up the latest leader assignment. However, if a leader-follower switch occurs in the small window between the metadata refresh and the actual RPC send, the request is routed to the old leader (now a follower).

This is a transient failure — a caller-side retry will trigger a fresh metadata update at the beginning of the next call, resolve the new leader, and succeed.

Solution

⏺ Therefore, we plan to address this in two steps:

  1. Urgently fix Problem 1 — refresh metadata upon network errors on the readOnlyGateway path, so the client can recover from stale server addresses.
  2. Problem 2 is less urgent — we can address it later by introducing a more robust metadata mechanism at the framework level.
  • solve the ip change of coordinator server
  • solve the leader change.

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions