Skip to content

rs-dapi-client: per-error-class ban policy — don't ban on rate-limit or client-side decode errors #3695

@lklimek

Description

@lklimek

Summary

rs-dapi-client bans an HPMN endpoint on any retryable gRPC error, with no per-Code differentiation. This conflates two unrelated concepts — "is this request retryable?" and "is this node unhealthy?" — and causes healthy nodes to be culled from the rotation under three common failure modes that are not node faults:

  1. Rate-limiting (Code::ResourceExhausted + x-envoy-ratelimited: true) — the node is signalling "back off", not "I am sick". Banning it concentrates load on the remaining endpoints, which trip the same envoy limit faster — classic cascading failure.
  2. Client-side codec / protocol-version mismatches (Code::Unknown with a server-emitted decoding error) — the node is healthy and responded with a parseable error message about our request. Banning it punishes every HPMN in turn for what is a single client-side bug.
  3. Server-internal transient errors (Code::Internal) — not all internal errors mean the node is durably bad; some are transient and self-heal.

Reproduction evidence

A funded e2e run against testnet on 2026-05-20 (/tmp/marvin-e2e-G91LEw.txt, ~16,500 lines) showed:

Error class in logs Count Currently bans? Should ban?
Code::Unknown + "could not decode data contracts query" 459 ✅ yes ❌ no — server-emitted error message from a healthy node about a client-side request encoding
Code::ResourceExhausted / x-envoy-ratelimited: true 65 ✅ yes ❌ no — back off, don't ban
Code::Unavailable + transport / TLS / connect errors small ✅ yes ✅ yes (genuine node-level fault)
NoAvailableAddresses (downstream symptom) 653 n/a n/a

29 distinct HPMN IPs (out of 31 evo seeds in dash-network-seeds) were banned within ~80 seconds, even though most were healthy and the dominant trigger (459/527 real bans) was the same client-side decoding mismatch on every node. The end state was NoAvailableAddresses for every subsequent call across 14 test threads, taking down ~36 e2e tests in one cluster.

Root cause in the code

packages/rs-dapi-client/src/dapi_client.rs:187-190:

Err(error) => {
    if error.can_retry() {
        if let Some(address) = error.address.as_ref() {
            if applied_settings.ban_failed_address {
                if address_list.ban(address) {
                    tracing::warn!(?address, ?error, "ban address {address} due to error: {error}");

…and packages/rs-dapi-client/src/transport/grpc.rs:101-119:

impl CanRetry for dapi_grpc::tonic::Status {
    fn can_retry(&self) -> bool {
        matches!(
            self.code(),
            Ok | DataLoss | Cancelled | Unknown
                | DeadlineExceeded | ResourceExhausted
                | Aborted | Internal | Unavailable
        )
    }
}

The ban-or-not decision is driven entirely by can_retry(). Unknown, ResourceExhausted, and Internal all return true, so the very first error of any of those classes from a given node triggers a ban. There is no error-class → policy mapping; there is no exponential-backoff alternative to ban; there is no notion of "this error is the client's fault, not the node's."

Proposed policy

Split can_retry() (semantic retryability) from a new policy classifier node_health_from_error(), with three buckets:

Bucket Codes Behaviour
Genuine node fault Unavailable (with transport / TLS / connect cause), DeadlineExceeded (after threshold), Internal only when accompanied by Tenderdash transport indicators Ban with current cooldown — existing behavior, scoped down.
Throttle signal ResourceExhausted (especially with x-envoy-ratelimited: true) Per-endpoint backoff (short cooldown, NOT a ban). Optionally combine with a client-side concurrency/RPS budget to avoid re-tripping.
Client-side or protocol fault Unknown with server-emitted human-readable message about a request, InvalidArgument, codec mismatches Do not ban. Surface as a caller error. The node is healthy; banning won't help and silently masks the bug.

Implementation sketch: introduce a BanPolicy trait next to CanRetry, with the default impl deriving the bucket from Code + an optional metadata sniff (e.g. x-envoy-ratelimited, server: envoy, presence of drive-error-data-bin). dapi_client.rs:187 then calls policy.classify(error) instead of error.can_retry() for the ban decision, while can_retry() continues to control retry orchestration.

A test harness should cover: rate-limit burst across multiple endpoints (assert no bans, assert backoff observed), uniform client-side decoding errors across all endpoints (assert no bans, assert error propagated to caller), genuine transport faults (assert bans as today).

Impact if fixed

  • The 2026-05-20 e2e run would have surfaced the data-contracts query decoding mismatch as a direct caller error, instead of cascading into a fleet-wide ban and 36 environmental-looking test failures.
  • Production SDK consumers under transient testnet/mainnet rate-limit pressure would back off instead of banning their entire fleet.
  • TLS cert misconfigurations and connect failures (the legitimate ban triggers) would continue to work as today.

Related

  • Found during analysis of #3549 e2e run; not blocking that PR (this is a pre-existing rs-dapi-client policy bug).
  • Adjacent surface: #3350 improves error message readability (Tenderdash CBOR data field) but does not address the ban-policy classification.

🤖 Co-authored by Claudius the Magnificent AI Agent

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions