Symptom
The scheduled Property Repair Test workflow failed once on a clean main (SHA 2078f683) while the same SHA passed 6 runs immediately before and 8+ runs immediately after — clearly flaky, but the logs reveal a real defect that the flakiness exposes.
CI run: https://github.com/apache/skywalking-banyandb/actions/runs/25017558589/job/73269526641
The full_data integrated test (test/property_repair/full_data/integrated_test.go:158) timed out after 2 h waiting for any node to show both total_propagation_count and repair_success_count increase. Final metrics polled by the test:
| node |
propagation 0→ |
repair_success 0→ |
| data-node-1 |
0 |
0 |
| data-node-2 |
0 |
0 |
| data-node-3 |
2 |
0 |
Propagation moved on data-node-3, but no repair ever succeeded.
Evidence from container logs
The cron fires background repair gossip scheduled on data-node-1 every 3 minutes from 20:33:00 onwards.
Starting 20:39:00, data-node-3 begins logging at ERROR:
ready to added propagation into group shard perf-test-group(0) in a same round, but it's full
— emitted at banyand/property/gossip/server.go:346, every 3 minutes for ~50 min (18 consecutive rejections). At 21:33:00 the wording flips to … in a new round, but it's full (line 335), meaning the scheduleInterval/2 TTL on the originator finally rotated, but the per-group/shard buffer is still occupied.
Around the same 21:33:00 mark, data-node-1 and data-node-3 dump tens of:
query older properties failed: iterate document match iterator: context deadline exceeded
query older properties failed: iterate document match iterator: context canceled
failed to repair property perf-test-group/perf-test-property/property-XXXX/...
These originate from banyand/property/db/shard.go:370 → s.search(ctx, iq, nil, 100) inside shard.repair.
Root cause
The first repair round after the replicas update (1 → 2, with ~5,000 stale properties to reconcile) gets stuck inside the per-property older-version lookup against the inverted index in shard.repair. The per-property query exceeds its context deadline on this runner.
Compounding the slowness, the gossip handler maintains a buffered channel of size 1 per group/shard (banyand/property/gossip/server.go:318):
groupShard = &groupWithShardPropagation{
channel: make(chan *handlingRequest, 1),
originalNodeID: request.Context.OriginNode,
latestTime: time.Now(),
}
and the same-/new-round paths drop new requests when the channel is full instead of cancelling-and-replacing or queueing:
select {
case groupShard.channel <- handlingRequestData:
q.notifyNewRequest()
default:
q.s.log.Error().Msgf("ready to added propagation into group shard %s(%d) in a same round, but it's full", request.Group, request.ShardId)
}
Net effect: a single slow search wedges the only buffer slot; every subsequent 3-minute cron tick is silently dropped; the test's 2 h budget elapses with effectively one half-completed round.
Why this is flaky in CI
GitHub Actions runners have variable disk/index latency. On most runs the older-property lookup completes inside its deadline. On a slow runner, one wedged round amplifies into a full test stall.
Suggested fix directions
- Don't drop scheduled rounds. Make the per-
group/shard queue either:
- cancel-and-replace when a new round arrives (preempt the stale in-flight round), or
- allow a small bounded queue and explicitly fail-fast if the in-flight round is older than
scheduleInterval.
- Bound the older-property lookup.
shard.repair runs s.search per property; its context inherits an aggressive deadline that the very first post-replicas-bump reconcile can't meet. Either give the initial reconcile a longer budget, batch the lookups, or make the deadline configurable.
- Surface the stall. When a round is rejected as "but it's full", emit a metric and consider failing the round rather than silently logging — the current behavior makes the cluster look healthy in metrics while no progress is being made.
Environment
- Branch / SHA:
main @ 2078f683ae967f297d6d7ce23639c9d3f99f50d4
- Workflow:
.github/workflows/property-repair.yml (scheduled)
- 3-node
data-node-{1,2,3} cluster + liaison, prometheus
- Test parameters: 5,000 properties written at replicas=1, then replicas updated to 2
Logs / artifacts
CI run: https://github.com/apache/skywalking-banyandb/actions/runs/25017558589/job/73269526641
Artifact: test-data- zip from the run (data-node-{1,2,3}.log, liaison.log, prometheus.log).
Symptom
The scheduled
Property Repair Testworkflow failed once on a cleanmain(SHA2078f683) while the same SHA passed 6 runs immediately before and 8+ runs immediately after — clearly flaky, but the logs reveal a real defect that the flakiness exposes.CI run: https://github.com/apache/skywalking-banyandb/actions/runs/25017558589/job/73269526641
The
full_dataintegrated test (test/property_repair/full_data/integrated_test.go:158) timed out after 2 h waiting for any node to show bothtotal_propagation_countandrepair_success_countincrease. Final metrics polled by the test:Propagation moved on data-node-3, but no repair ever succeeded.
Evidence from container logs
The cron fires
background repair gossip scheduledon data-node-1 every 3 minutes from 20:33:00 onwards.Starting 20:39:00, data-node-3 begins logging at
ERROR:— emitted at
banyand/property/gossip/server.go:346, every 3 minutes for ~50 min (18 consecutive rejections). At 21:33:00 the wording flips to… in a new round, but it's full(line 335), meaning thescheduleInterval/2TTL on the originator finally rotated, but the per-group/shardbuffer is still occupied.Around the same 21:33:00 mark, data-node-1 and data-node-3 dump tens of:
These originate from
banyand/property/db/shard.go:370→s.search(ctx, iq, nil, 100)insideshard.repair.Root cause
The first repair round after the replicas update (1 → 2, with ~5,000 stale properties to reconcile) gets stuck inside the per-property older-version lookup against the inverted index in
shard.repair. The per-property query exceeds its context deadline on this runner.Compounding the slowness, the gossip handler maintains a buffered channel of size 1 per
group/shard(banyand/property/gossip/server.go:318):and the same-/new-round paths drop new requests when the channel is full instead of cancelling-and-replacing or queueing:
Net effect: a single slow
searchwedges the only buffer slot; every subsequent 3-minute cron tick is silently dropped; the test's 2 h budget elapses with effectively one half-completed round.Why this is flaky in CI
GitHub Actions runners have variable disk/index latency. On most runs the older-property lookup completes inside its deadline. On a slow runner, one wedged round amplifies into a full test stall.
Suggested fix directions
group/shardqueue either:scheduleInterval.shard.repairrunss.searchper property; its context inherits an aggressive deadline that the very first post-replicas-bump reconcile can't meet. Either give the initial reconcile a longer budget, batch the lookups, or make the deadline configurable.Environment
main@2078f683ae967f297d6d7ce23639c9d3f99f50d4.github/workflows/property-repair.yml(scheduled)data-node-{1,2,3}cluster +liaison,prometheusLogs / artifacts
CI run: https://github.com/apache/skywalking-banyandb/actions/runs/25017558589/job/73269526641
Artifact:
test-data-zip from the run (data-node-{1,2,3}.log, liaison.log, prometheus.log).