[Bug] BanyanDB 0.10.1 trace merge: "offset must be equal to bytesRead" panic in part_iter, crashes the process

### Search before asking

- [x] I had searched in the [issues](https://github.com/apache/skywalking/issues?q=is%3Aissue) and found no similar issues. (Different from the related #13860 — that one is in the trace **write** path and is recovered by the gRPC interceptor; this one is in the trace **merge/read** path and crashes the process.)

### Apache SkyWalking Component

BanyanDB

### What happened

After upgrading from `apache/skywalking-banyandb:0.9.0` to `0.10.1` (with OAP 10.4.0), BanyanDB **crashes the process** every ~7-8 minutes with:

```
panic: offset 1400877 must be equal to bytesRead 1400490
```

Unlike the timestamp-ordering panic in #13860 (which is recovered by `grpc-middleware`), this one fires from a **background `mergeLoop` goroutine** that is not wrapped by recovery, so the process exits and the pod restarts.

#### Full stack

```
goroutine 3900 [running]:
github.com/apache/skywalking-banyandb/pkg/logger.Panicf(...)
github.com/apache/skywalking-banyandb/banyand/trace.(*partMergeIter).mustReadRaw(0xc001ac4000, 0xc002d716b8, 0xc001ac4118)
    /mnt/d/skywalking-banyandb/banyand/trace/part_iter.go:359 +0xf5
github.com/apache/skywalking-banyandb/banyand/trace.(*blockReader).mustReadRaw(...)
    /mnt/d/skywalking-banyandb/banyand/trace/block_reader.go:263
github.com/apache/skywalking-banyandb/banyand/trace.mergeBlocks(...)
    /mnt/d/skywalking-banyandb/banyand/trace/merger.go:421 +0x79e
github.com/apache/skywalking-banyandb/banyand/trace.(*tsTable).mergeParts(...)
    /mnt/d/skywalking-banyandb/banyand/trace/merger.go:344 +0x42a
github.com/apache/skywalking-banyandb/banyand/trace.(*tsTable).mergePartsThenSendIntroduction(...)
    /mnt/d/skywalking-banyandb/banyand/trace/merger.go:118 +0x145
github.com/apache/skywalking-banyandb/banyand/trace.(*tsTable).mergeSnapshot(...)
    /mnt/d/skywalking-banyandb/banyand/trace/merger.go:104 +0x125
github.com/apache/skywalking-banyandb/banyand/trace.(*tsTable).mergeLoop.func1(...)
    /mnt/d/skywalking-banyandb/banyand/trace/merger.go:78 +0x1f9
github.com/apache/skywalking-banyandb/banyand/trace.(*tsTable).mergeLoop(...)
    /mnt/d/skywalking-banyandb/banyand/trace/merger.go:90 +0x271
created by github.com/apache/skywalking-banyandb/banyand/trace.(*tsTable).startLoop in goroutine 157
    /mnt/d/skywalking-banyandb/banyand/trace/tstable.go:130 +0x246
```

#### Source location (apache/skywalking-banyandb v0.10.1)

`banyand/trace/part_iter.go:354-365`:

```go
func (pmi *partMergeIter) mustReadRaw(r *rawBlock, bm *blockMetadata) {
    r.bm = bm
    // spans
    if bm.spans != nil && bm.spans.size > 0 {
        // Validate the reader is aligned to the expected offset
        if bm.spans.offset != pmi.seqReaders.spans.bytesRead {
            logger.Panicf("offset %d must be equal to bytesRead %d", bm.spans.offset, pmi.seqReaders.spans.bytesRead)
        }
        ...
    }
    ...
}
```

So the merger sequentially reads spans from `seqReaders.spans`, and a per-block `bm.spans.offset` is expected to match how far the `seqReader` has advanced (`bytesRead`). When they diverge — by 387 bytes in our sample — the merger panics. The same pattern (`offset must be equal to bytesRead`) appears at:

- `banyand/trace/block.go:196` (tag metadata)
- `banyand/trace/block.go:329` (span data)
- `banyand/internal/sidx/block.go`
- `banyand/measure/block.go`
- `banyand/stream/block.go`

So the invariant is repeated across the new (0.10) trace storage engine.

### Cadence and impact

In our cluster, BanyanDB pod restarted **126 times in 17 hours** = roughly once every 8 minutes. Every time, OAP loses connection to BanyanDB and hot-loops crash too (~148 OAP restarts in the same window). Net effect: rolling availability — every ~8 minutes there is a 1-2 minute window where ingestion and queries fail.

For comparison, on 0.9.0 the only panic we saw fired ~once every 28 minutes. **0.10.1 is significantly less stable on our workload, primarily because of this new panic in the merger.**

### What you expected to happen

The merger should not panic on what is clearly a corrupted or out-of-sync block metadata. Reasonable options (maintainers know best):

1. **Skip the offending block** with a warning instead of `Panicf` — at minimum, contain the blast radius to one block instead of restarting the whole DB.
2. **Restart the seqReader** to the offset declared in `bm.spans.offset` (or vice versa) when divergence is detected — assumes the metadata is the source of truth.
3. **Fail the merge of the affected part** but keep the process running and let retention/cleanup eventually drop the corrupted part.

### How to reproduce

Steady-state SkyWalking deployment, OAP forwarding traces to standalone BanyanDB. We see this on:

- BanyanDB: `apache/skywalking-banyandb:0.10.1`
- SkyWalking OAP: `apache/skywalking-oap-server:10.4.0`
- ~30+ Java services, `apache-skywalking-java-agent` 9.5.0, JDK 21
- Standalone BanyanDB on Kubernetes (Aliyun ACK), `--trace-root-path=/data/trace`
- 51 GB cumulative on disk after 17h of ingest (stream 38.5G + trace 12.9G + measure 26M)

The very first occurrence happens within ~30 minutes of starting fresh (after fully wiping `/data` and letting OAP recreate schemas). After that, panic cadence stabilizes at ~8 minutes.

### Anything else

This bug is in the new trace storage engine introduced by #713 in 0.10.0; we did not see this panic on 0.9.0 (which uses the older trace path).

We have already reported the related — but distinct — timestamp-ordering panic in the **write** path as #13860 (recoverable, not crashing the process). Filing this one separately because the failure mode (background merge goroutine, no recovery, full process exit) is different and arguably more disruptive.

Happy to gather more samples (full stack traces over time, sample part dumps if a tool exists, sysrq dumps, anything) on request.

### Are you willing to submit a pull request to fix on your own

- [ ] Yes, I am willing to submit a pull request on my own!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] BanyanDB 0.10.1 trace merge: "offset must be equal to bytesRead" panic in part_iter, crashes the process #13861

Search before asking

Apache SkyWalking Component

What happened

Full stack

Source location (apache/skywalking-banyandb v0.10.1)

Cadence and impact

What you expected to happen

How to reproduce

Anything else

Are you willing to submit a pull request to fix on your own

Code of Conduct

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] BanyanDB 0.10.1 trace merge: "offset must be equal to bytesRead" panic in part_iter, crashes the process #13861

Description

Search before asking

Apache SkyWalking Component

What happened

Full stack

Source location (apache/skywalking-banyandb v0.10.1)

Cadence and impact

What you expected to happen

How to reproduce

Anything else

Are you willing to submit a pull request to fix on your own

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions