rs-dapi-client: per-error-class ban policy — don't ban on rate-limit or client-side decode errors

## Summary

`rs-dapi-client` bans an HPMN endpoint on **any retryable gRPC error**, with no per-`Code` differentiation. This conflates two unrelated concepts — *"is this request retryable?"* and *"is this node unhealthy?"* — and causes healthy nodes to be culled from the rotation under three common failure modes that are **not node faults**:

1. **Rate-limiting** (`Code::ResourceExhausted` + `x-envoy-ratelimited: true`) — the node is signalling "back off", not "I am sick". Banning it concentrates load on the remaining endpoints, which trip the same envoy limit faster — classic cascading failure.
2. **Client-side codec / protocol-version mismatches** (`Code::Unknown` with a server-emitted decoding error) — the *node* is healthy and *responded with a parseable error message about our request*. Banning it punishes every HPMN in turn for what is a single client-side bug.
3. **Server-internal transient errors** (`Code::Internal`) — not all internal errors mean the node is durably bad; some are transient and self-heal.

## Reproduction evidence

A funded e2e run against testnet on 2026-05-20 (`/tmp/marvin-e2e-G91LEw.txt`, ~16,500 lines) showed:

| Error class in logs | Count | Currently bans? | Should ban? |
|---|---:|:---:|:---:|
| `Code::Unknown` + `"could not decode data contracts query"` | **459** | ✅ yes | ❌ no — server-emitted error message from a healthy node about a client-side request encoding |
| `Code::ResourceExhausted` / `x-envoy-ratelimited: true` | 65 | ✅ yes | ❌ no — back off, don't ban |
| `Code::Unavailable` + transport / TLS / connect errors | small | ✅ yes | ✅ yes (genuine node-level fault) |
| `NoAvailableAddresses` (downstream symptom) | 653 | n/a | n/a |

**29 distinct HPMN IPs** (out of 31 evo seeds in `dash-network-seeds`) were banned within ~80 seconds, even though most were healthy and the dominant trigger (459/527 real bans) was the **same** client-side decoding mismatch on every node. The end state was `NoAvailableAddresses` for every subsequent call across 14 test threads, taking down ~36 e2e tests in one cluster.

## Root cause in the code

`packages/rs-dapi-client/src/dapi_client.rs:187-190`:

```rust
Err(error) => {
    if error.can_retry() {
        if let Some(address) = error.address.as_ref() {
            if applied_settings.ban_failed_address {
                if address_list.ban(address) {
                    tracing::warn!(?address, ?error, "ban address {address} due to error: {error}");
```

…and `packages/rs-dapi-client/src/transport/grpc.rs:101-119`:

```rust
impl CanRetry for dapi_grpc::tonic::Status {
    fn can_retry(&self) -> bool {
        matches!(
            self.code(),
            Ok | DataLoss | Cancelled | Unknown
                | DeadlineExceeded | ResourceExhausted
                | Aborted | Internal | Unavailable
        )
    }
}
```

The ban-or-not decision is driven entirely by `can_retry()`. `Unknown`, `ResourceExhausted`, and `Internal` all return `true`, so the very first error of any of those classes from a given node triggers a ban. There is no error-class → policy mapping; there is no exponential-backoff alternative to ban; there is no notion of "this error is the client's fault, not the node's."

## Proposed policy

Split `can_retry()` (semantic retryability) from a new policy classifier `node_health_from_error()`, with three buckets:

| Bucket | Codes | Behaviour |
|---|---|---|
| **Genuine node fault** | `Unavailable` (with transport / TLS / connect cause), `DeadlineExceeded` (after threshold), `Internal` only when accompanied by Tenderdash transport indicators | **Ban** with current cooldown — existing behavior, scoped down. |
| **Throttle signal** | `ResourceExhausted` (especially with `x-envoy-ratelimited: true`) | **Per-endpoint backoff** (short cooldown, NOT a ban). Optionally combine with a client-side concurrency/RPS budget to avoid re-tripping. |
| **Client-side or protocol fault** | `Unknown` with server-emitted human-readable message about a request, `InvalidArgument`, codec mismatches | **Do not ban**. Surface as a caller error. The node is healthy; banning won't help and silently masks the bug. |

Implementation sketch: introduce a `BanPolicy` trait next to `CanRetry`, with the default impl deriving the bucket from `Code` + an optional metadata sniff (e.g. `x-envoy-ratelimited`, `server: envoy`, presence of `drive-error-data-bin`). `dapi_client.rs:187` then calls `policy.classify(error)` instead of `error.can_retry()` for the ban decision, while `can_retry()` continues to control retry orchestration.

A test harness should cover: rate-limit burst across multiple endpoints (assert no bans, assert backoff observed), uniform client-side decoding errors across all endpoints (assert no bans, assert error propagated to caller), genuine transport faults (assert bans as today).

## Impact if fixed

- The 2026-05-20 e2e run would have surfaced the data-contracts query decoding mismatch as a **direct caller error**, instead of cascading into a fleet-wide ban and 36 environmental-looking test failures.
- Production SDK consumers under transient testnet/mainnet rate-limit pressure would back off instead of banning their entire fleet.
- TLS cert misconfigurations and connect failures (the *legitimate* ban triggers) would continue to work as today.

## Related

- Found during analysis of [#3549](https://github.com/dashpay/platform/pull/3549) e2e run; not blocking that PR (this is a pre-existing `rs-dapi-client` policy bug).
- Adjacent surface: [#3350](https://github.com/dashpay/platform/pull/3350) improves *error message readability* (Tenderdash CBOR `data` field) but does not address the ban-policy classification.

<sub>🤖 Co-authored by [Claudius the Magnificent](https://github.com/lklimek/claudius) AI Agent</sub>


Bucket	Codes	Behaviour
Genuine node fault	`Unavailable` (with transport / TLS / connect cause), `DeadlineExceeded` (after threshold), `Internal` only when accompanied by Tenderdash transport indicators	Ban with current cooldown — existing behavior, scoped down.
Throttle signal	`ResourceExhausted` (especially with `x-envoy-ratelimited: true`)	Per-endpoint backoff (short cooldown, NOT a ban). Optionally combine with a client-side concurrency/RPS budget to avoid re-tripping.
Client-side or protocol fault	`Unknown` with server-emitted human-readable message about a request, `InvalidArgument`, codec mismatches	Do not ban. Surface as a caller error. The node is healthy; banning won't help and silently masks the bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rs-dapi-client: per-error-class ban policy — don't ban on rate-limit or client-side decode errors #3695

Summary

Reproduction evidence

Root cause in the code

Proposed policy

Impact if fixed

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Error class in logs	Count	Currently bans?	Should ban?
`Code::Unknown` + `"could not decode data contracts query"`	459	✅ yes	❌ no — server-emitted error message from a healthy node about a client-side request encoding
`Code::ResourceExhausted` / `x-envoy-ratelimited: true`	65	✅ yes	❌ no — back off, don't ban
`Code::Unavailable` + transport / TLS / connect errors	small	✅ yes	✅ yes (genuine node-level fault)
`NoAvailableAddresses` (downstream symptom)	653	n/a	n/a

rs-dapi-client: per-error-class ban policy — don't ban on rate-limit or client-side decode errors #3695

Description

Summary

Reproduction evidence

Root cause in the code

Proposed policy

Impact if fixed

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions