[client] Admin readOnlyGateway never refreshes metadata on network errors, causing permanent RPC failure during server rolling upgrades

### Search before asking

- [x] I searched in the [issues](https://github.com/apache/fluss/issues) and found nothing similar.


### Fluss version

0.9.0 (latest release)

### Please describe the bug 🐞

During a rolling upgrade of Fluss servers, the Admin client suffers from two problems:
| Exception Type | Meaning | Refresh Strategy |
  |---|---|---|
  | `NetworkException` / `TimeoutException` | Node unreachable (IP changed) | Refresh cluster-level metadata (server list) |
  | `NotLeaderOrFollowerException` / `LeaderNotAvailableException` | Leader switched | Refresh table/partition-level metadata (leader assignment) |

 ###  Stale metadata never refreshed — permanent failure

example: Admin#listPartitionInfos、getLatestLakeSnapshot、getLatestKvSnapshots、getTableInfo

`readOnlyGateway` is backed by `metadataUpdater::getRandomTabletServer` as its node supplier. When all tablet server IPs change (e.g., pods restarting with new IPs in Kubernetes), the cached Cluster still holds the old IPs. 

The RPC fails with a network error, but nothing  triggers updateMetadata() — the CompletableFuture simply completes exceptionally and the stale Cluster remains unchanged. **Subsequent calls keep resolving the same stale nodes, making this a permanent failure that no amount of caller-side retries can fix.**

###  Stale leader routing — transient failure

  Example: Admin#listOffsets

```java
  // FlussAdmin.java:529
  metadataUpdater.updateTableOrPartitionMetadata(physicalTablePath.getTablePath(), null);
  // ... then prepareListOffsetsRequests() calls leaderFor() and sends to that leader
```
  Each call starts by refreshing metadata, so it picks up the latest leader assignment. However, if a leader-follower switch occurs in the small window between the metadata refresh and the actual RPC send, the request is routed to the old leader (now a follower). 

**This is a transient failure — a caller-side retry will trigger a fresh metadata update at the beginning of the next call, resolve the new leader, and succeed.**



### Solution


⏺ Therefore, we plan to address this in two steps:
  1. Urgently fix Problem 1 — refresh metadata upon network errors on the readOnlyGateway path, so the client can recover from stale server addresses.
  2. Problem 2 is less urgent — we can address it later by introducing a more robust metadata mechanism at the framework level.

-  solve the ip change of coordinator server
- solve the leader change.



### Are you willing to submit a PR?

- [x] I'm willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[client] Admin readOnlyGateway never refreshes metadata on network errors, causing permanent RPC failure during server rolling upgrades #3389

Search before asking

Fluss version

Please describe the bug 🐞

Stale metadata never refreshed — permanent failure

Stale leader routing — transient failure

Solution

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Exception Type	Meaning	Refresh Strategy
`NetworkException` / `TimeoutException`	Node unreachable (IP changed)	Refresh cluster-level metadata (server list)
`NotLeaderOrFollowerException` / `LeaderNotAvailableException`	Leader switched	Refresh table/partition-level metadata (leader assignment)

[client] Admin readOnlyGateway never refreshes metadata on network errors, causing permanent RPC failure during server rolling upgrades #3389

Description

Search before asking

Fluss version

Please describe the bug 🐞

Stale metadata never refreshed — permanent failure

Stale leader routing — transient failure

Solution

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions