Skip to content

[Bug] Property repair gossip stalls indefinitely when shard.repair query times out — size-1 buffer drops every subsequent scheduled round #13852

@hanahmily

Description

@hanahmily

Symptom

The scheduled Property Repair Test workflow failed once on a clean main (SHA 2078f683) while the same SHA passed 6 runs immediately before and 8+ runs immediately after — clearly flaky, but the logs reveal a real defect that the flakiness exposes.

CI run: https://github.com/apache/skywalking-banyandb/actions/runs/25017558589/job/73269526641

The full_data integrated test (test/property_repair/full_data/integrated_test.go:158) timed out after 2 h waiting for any node to show both total_propagation_count and repair_success_count increase. Final metrics polled by the test:

node propagation 0→ repair_success 0→
data-node-1 0 0
data-node-2 0 0
data-node-3 2 0

Propagation moved on data-node-3, but no repair ever succeeded.

Evidence from container logs

The cron fires background repair gossip scheduled on data-node-1 every 3 minutes from 20:33:00 onwards.

Starting 20:39:00, data-node-3 begins logging at ERROR:

ready to added propagation into group shard perf-test-group(0) in a same round, but it's full

— emitted at banyand/property/gossip/server.go:346, every 3 minutes for ~50 min (18 consecutive rejections). At 21:33:00 the wording flips to … in a new round, but it's full (line 335), meaning the scheduleInterval/2 TTL on the originator finally rotated, but the per-group/shard buffer is still occupied.

Around the same 21:33:00 mark, data-node-1 and data-node-3 dump tens of:

query older properties failed: iterate document match iterator: context deadline exceeded
query older properties failed: iterate document match iterator: context canceled
failed to repair property perf-test-group/perf-test-property/property-XXXX/...

These originate from banyand/property/db/shard.go:370s.search(ctx, iq, nil, 100) inside shard.repair.

Root cause

The first repair round after the replicas update (1 → 2, with ~5,000 stale properties to reconcile) gets stuck inside the per-property older-version lookup against the inverted index in shard.repair. The per-property query exceeds its context deadline on this runner.

Compounding the slowness, the gossip handler maintains a buffered channel of size 1 per group/shard (banyand/property/gossip/server.go:318):

groupShard = &groupWithShardPropagation{
    channel:        make(chan *handlingRequest, 1),
    originalNodeID: request.Context.OriginNode,
    latestTime:     time.Now(),
}

and the same-/new-round paths drop new requests when the channel is full instead of cancelling-and-replacing or queueing:

select {
case groupShard.channel <- handlingRequestData:
    q.notifyNewRequest()
default:
    q.s.log.Error().Msgf("ready to added propagation into group shard %s(%d) in a same round, but it's full", request.Group, request.ShardId)
}

Net effect: a single slow search wedges the only buffer slot; every subsequent 3-minute cron tick is silently dropped; the test's 2 h budget elapses with effectively one half-completed round.

Why this is flaky in CI

GitHub Actions runners have variable disk/index latency. On most runs the older-property lookup completes inside its deadline. On a slow runner, one wedged round amplifies into a full test stall.

Suggested fix directions

  1. Don't drop scheduled rounds. Make the per-group/shard queue either:
    • cancel-and-replace when a new round arrives (preempt the stale in-flight round), or
    • allow a small bounded queue and explicitly fail-fast if the in-flight round is older than scheduleInterval.
  2. Bound the older-property lookup. shard.repair runs s.search per property; its context inherits an aggressive deadline that the very first post-replicas-bump reconcile can't meet. Either give the initial reconcile a longer budget, batch the lookups, or make the deadline configurable.
  3. Surface the stall. When a round is rejected as "but it's full", emit a metric and consider failing the round rather than silently logging — the current behavior makes the cluster look healthy in metrics while no progress is being made.

Environment

  • Branch / SHA: main @ 2078f683ae967f297d6d7ce23639c9d3f99f50d4
  • Workflow: .github/workflows/property-repair.yml (scheduled)
  • 3-node data-node-{1,2,3} cluster + liaison, prometheus
  • Test parameters: 5,000 properties written at replicas=1, then replicas updated to 2

Logs / artifacts

CI run: https://github.com/apache/skywalking-banyandb/actions/runs/25017558589/job/73269526641
Artifact: test-data- zip from the run (data-node-{1,2,3}.log, liaison.log, prometheus.log).

Metadata

Metadata

Assignees

Labels

bugSomething isn't working and you are sure it's a bug!databaseBanyanDB - SkyWalking native database

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions