Summary
rs-dapi-client bans an HPMN endpoint on any retryable gRPC error, with no per-Code differentiation. This conflates two unrelated concepts — "is this request retryable?" and "is this node unhealthy?" — and causes healthy nodes to be culled from the rotation under three common failure modes that are not node faults:
- Rate-limiting (
Code::ResourceExhausted + x-envoy-ratelimited: true) — the node is signalling "back off", not "I am sick". Banning it concentrates load on the remaining endpoints, which trip the same envoy limit faster — classic cascading failure.
- Client-side codec / protocol-version mismatches (
Code::Unknown with a server-emitted decoding error) — the node is healthy and responded with a parseable error message about our request. Banning it punishes every HPMN in turn for what is a single client-side bug.
- Server-internal transient errors (
Code::Internal) — not all internal errors mean the node is durably bad; some are transient and self-heal.
Reproduction evidence
A funded e2e run against testnet on 2026-05-20 (/tmp/marvin-e2e-G91LEw.txt, ~16,500 lines) showed:
| Error class in logs |
Count |
Currently bans? |
Should ban? |
Code::Unknown + "could not decode data contracts query" |
459 |
✅ yes |
❌ no — server-emitted error message from a healthy node about a client-side request encoding |
Code::ResourceExhausted / x-envoy-ratelimited: true |
65 |
✅ yes |
❌ no — back off, don't ban |
Code::Unavailable + transport / TLS / connect errors |
small |
✅ yes |
✅ yes (genuine node-level fault) |
NoAvailableAddresses (downstream symptom) |
653 |
n/a |
n/a |
29 distinct HPMN IPs (out of 31 evo seeds in dash-network-seeds) were banned within ~80 seconds, even though most were healthy and the dominant trigger (459/527 real bans) was the same client-side decoding mismatch on every node. The end state was NoAvailableAddresses for every subsequent call across 14 test threads, taking down ~36 e2e tests in one cluster.
Root cause in the code
packages/rs-dapi-client/src/dapi_client.rs:187-190:
Err(error) => {
if error.can_retry() {
if let Some(address) = error.address.as_ref() {
if applied_settings.ban_failed_address {
if address_list.ban(address) {
tracing::warn!(?address, ?error, "ban address {address} due to error: {error}");
…and packages/rs-dapi-client/src/transport/grpc.rs:101-119:
impl CanRetry for dapi_grpc::tonic::Status {
fn can_retry(&self) -> bool {
matches!(
self.code(),
Ok | DataLoss | Cancelled | Unknown
| DeadlineExceeded | ResourceExhausted
| Aborted | Internal | Unavailable
)
}
}
The ban-or-not decision is driven entirely by can_retry(). Unknown, ResourceExhausted, and Internal all return true, so the very first error of any of those classes from a given node triggers a ban. There is no error-class → policy mapping; there is no exponential-backoff alternative to ban; there is no notion of "this error is the client's fault, not the node's."
Proposed policy
Split can_retry() (semantic retryability) from a new policy classifier node_health_from_error(), with three buckets:
| Bucket |
Codes |
Behaviour |
| Genuine node fault |
Unavailable (with transport / TLS / connect cause), DeadlineExceeded (after threshold), Internal only when accompanied by Tenderdash transport indicators |
Ban with current cooldown — existing behavior, scoped down. |
| Throttle signal |
ResourceExhausted (especially with x-envoy-ratelimited: true) |
Per-endpoint backoff (short cooldown, NOT a ban). Optionally combine with a client-side concurrency/RPS budget to avoid re-tripping. |
| Client-side or protocol fault |
Unknown with server-emitted human-readable message about a request, InvalidArgument, codec mismatches |
Do not ban. Surface as a caller error. The node is healthy; banning won't help and silently masks the bug. |
Implementation sketch: introduce a BanPolicy trait next to CanRetry, with the default impl deriving the bucket from Code + an optional metadata sniff (e.g. x-envoy-ratelimited, server: envoy, presence of drive-error-data-bin). dapi_client.rs:187 then calls policy.classify(error) instead of error.can_retry() for the ban decision, while can_retry() continues to control retry orchestration.
A test harness should cover: rate-limit burst across multiple endpoints (assert no bans, assert backoff observed), uniform client-side decoding errors across all endpoints (assert no bans, assert error propagated to caller), genuine transport faults (assert bans as today).
Impact if fixed
- The 2026-05-20 e2e run would have surfaced the data-contracts query decoding mismatch as a direct caller error, instead of cascading into a fleet-wide ban and 36 environmental-looking test failures.
- Production SDK consumers under transient testnet/mainnet rate-limit pressure would back off instead of banning their entire fleet.
- TLS cert misconfigurations and connect failures (the legitimate ban triggers) would continue to work as today.
Related
- Found during analysis of #3549 e2e run; not blocking that PR (this is a pre-existing
rs-dapi-client policy bug).
- Adjacent surface: #3350 improves error message readability (Tenderdash CBOR
data field) but does not address the ban-policy classification.
🤖 Co-authored by Claudius the Magnificent AI Agent
Summary
rs-dapi-clientbans an HPMN endpoint on any retryable gRPC error, with no per-Codedifferentiation. This conflates two unrelated concepts — "is this request retryable?" and "is this node unhealthy?" — and causes healthy nodes to be culled from the rotation under three common failure modes that are not node faults:Code::ResourceExhausted+x-envoy-ratelimited: true) — the node is signalling "back off", not "I am sick". Banning it concentrates load on the remaining endpoints, which trip the same envoy limit faster — classic cascading failure.Code::Unknownwith a server-emitted decoding error) — the node is healthy and responded with a parseable error message about our request. Banning it punishes every HPMN in turn for what is a single client-side bug.Code::Internal) — not all internal errors mean the node is durably bad; some are transient and self-heal.Reproduction evidence
A funded e2e run against testnet on 2026-05-20 (
/tmp/marvin-e2e-G91LEw.txt, ~16,500 lines) showed:Code::Unknown+"could not decode data contracts query"Code::ResourceExhausted/x-envoy-ratelimited: trueCode::Unavailable+ transport / TLS / connect errorsNoAvailableAddresses(downstream symptom)29 distinct HPMN IPs (out of 31 evo seeds in
dash-network-seeds) were banned within ~80 seconds, even though most were healthy and the dominant trigger (459/527 real bans) was the same client-side decoding mismatch on every node. The end state wasNoAvailableAddressesfor every subsequent call across 14 test threads, taking down ~36 e2e tests in one cluster.Root cause in the code
packages/rs-dapi-client/src/dapi_client.rs:187-190:…and
packages/rs-dapi-client/src/transport/grpc.rs:101-119:The ban-or-not decision is driven entirely by
can_retry().Unknown,ResourceExhausted, andInternalall returntrue, so the very first error of any of those classes from a given node triggers a ban. There is no error-class → policy mapping; there is no exponential-backoff alternative to ban; there is no notion of "this error is the client's fault, not the node's."Proposed policy
Split
can_retry()(semantic retryability) from a new policy classifiernode_health_from_error(), with three buckets:Unavailable(with transport / TLS / connect cause),DeadlineExceeded(after threshold),Internalonly when accompanied by Tenderdash transport indicatorsResourceExhausted(especially withx-envoy-ratelimited: true)Unknownwith server-emitted human-readable message about a request,InvalidArgument, codec mismatchesImplementation sketch: introduce a
BanPolicytrait next toCanRetry, with the default impl deriving the bucket fromCode+ an optional metadata sniff (e.g.x-envoy-ratelimited,server: envoy, presence ofdrive-error-data-bin).dapi_client.rs:187then callspolicy.classify(error)instead oferror.can_retry()for the ban decision, whilecan_retry()continues to control retry orchestration.A test harness should cover: rate-limit burst across multiple endpoints (assert no bans, assert backoff observed), uniform client-side decoding errors across all endpoints (assert no bans, assert error propagated to caller), genuine transport faults (assert bans as today).
Impact if fixed
Related
rs-dapi-clientpolicy bug).datafield) but does not address the ban-policy classification.🤖 Co-authored by Claudius the Magnificent AI Agent