[Bug] Property repair gossip stalls indefinitely when shard.repair query times out — size-1 buffer drops every subsequent scheduled round

## Symptom

The scheduled `Property Repair Test` workflow failed once on a clean `main` (SHA `2078f683`) while the same SHA passed 6 runs immediately before and 8+ runs immediately after — clearly flaky, but the logs reveal a real defect that the flakiness exposes.

CI run: https://github.com/apache/skywalking-banyandb/actions/runs/25017558589/job/73269526641

The `full_data` integrated test (`test/property_repair/full_data/integrated_test.go:158`) timed out after 2 h waiting for any node to show both `total_propagation_count` *and* `repair_success_count` increase. Final metrics polled by the test:

| node        | propagation 0→ | repair_success 0→ |
|-------------|----------------|-------------------|
| data-node-1 | 0              | 0                 |
| data-node-2 | 0              | 0                 |
| data-node-3 | 2              | **0**             |

Propagation moved on data-node-3, but no repair ever succeeded.

## Evidence from container logs

The cron fires `background repair gossip scheduled` on data-node-1 every 3 minutes from 20:33:00 onwards.

Starting **20:39:00**, data-node-3 begins logging at `ERROR`:

```
ready to added propagation into group shard perf-test-group(0) in a same round, but it's full
```

— emitted at `banyand/property/gossip/server.go:346`, every 3 minutes for ~50 min (18 consecutive rejections). At **21:33:00** the wording flips to `… in a new round, but it's full` (line 335), meaning the `scheduleInterval/2` TTL on the originator finally rotated, but the per-`group/shard` buffer is still occupied.

Around the same 21:33:00 mark, data-node-1 and data-node-3 dump tens of:

```
query older properties failed: iterate document match iterator: context deadline exceeded
query older properties failed: iterate document match iterator: context canceled
failed to repair property perf-test-group/perf-test-property/property-XXXX/...
```

These originate from `banyand/property/db/shard.go:370` → `s.search(ctx, iq, nil, 100)` inside `shard.repair`.

## Root cause

The first repair round after the replicas update (1 → 2, with ~5,000 stale properties to reconcile) gets stuck inside the per-property older-version lookup against the inverted index in `shard.repair`. The per-property query exceeds its context deadline on this runner.

Compounding the slowness, the gossip handler maintains a buffered channel of size **1** per `group/shard` (`banyand/property/gossip/server.go:318`):

```go
groupShard = &groupWithShardPropagation{
    channel:        make(chan *handlingRequest, 1),
    originalNodeID: request.Context.OriginNode,
    latestTime:     time.Now(),
}
```

and the same-/new-round paths drop new requests when the channel is full instead of cancelling-and-replacing or queueing:

```go
select {
case groupShard.channel <- handlingRequestData:
    q.notifyNewRequest()
default:
    q.s.log.Error().Msgf("ready to added propagation into group shard %s(%d) in a same round, but it's full", request.Group, request.ShardId)
}
```

Net effect: a single slow `search` wedges the only buffer slot; every subsequent 3-minute cron tick is silently dropped; the test's 2 h budget elapses with effectively one half-completed round.

## Why this is flaky in CI

GitHub Actions runners have variable disk/index latency. On most runs the older-property lookup completes inside its deadline. On a slow runner, one wedged round amplifies into a full test stall.

## Suggested fix directions

1. **Don't drop scheduled rounds.** Make the per-`group/shard` queue either:
   - cancel-and-replace when a new round arrives (preempt the stale in-flight round), or
   - allow a small bounded queue and explicitly fail-fast if the in-flight round is older than `scheduleInterval`.
2. **Bound the older-property lookup.** `shard.repair` runs `s.search` per property; its context inherits an aggressive deadline that the very first post-replicas-bump reconcile can't meet. Either give the initial reconcile a longer budget, batch the lookups, or make the deadline configurable.
3. **Surface the stall.** When a round is rejected as "but it's full", emit a metric and consider failing the round rather than silently logging — the current behavior makes the cluster look healthy in metrics while no progress is being made.

## Environment

- Branch / SHA: `main` @ `2078f683ae967f297d6d7ce23639c9d3f99f50d4`
- Workflow: `.github/workflows/property-repair.yml` (scheduled)
- 3-node `data-node-{1,2,3}` cluster + `liaison`, `prometheus`
- Test parameters: 5,000 properties written at replicas=1, then replicas updated to 2

## Logs / artifacts

CI run: https://github.com/apache/skywalking-banyandb/actions/runs/25017558589/job/73269526641
Artifact: `test-data-` zip from the run (data-node-{1,2,3}.log, liaison.log, prometheus.log).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Property repair gossip stalls indefinitely when shard.repair query times out — size-1 buffer drops every subsequent scheduled round #13852

Symptom

Evidence from container logs

Root cause

Why this is flaky in CI

Suggested fix directions

Environment

Logs / artifacts

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] Property repair gossip stalls indefinitely when shard.repair query times out — size-1 buffer drops every subsequent scheduled round #13852

Description

Symptom

Evidence from container logs

Root cause

Why this is flaky in CI

Suggested fix directions

Environment

Logs / artifacts

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions