[Bug] BanyanDB trace/measure/stream/sidx merge: missing fsync makes torn parts perpetuate CrashLoopBackOff (root cause for #13861)

Spun off from #13861. That issue reports a `panic: offset must be equal to bytesRead` from the trace `mergeLoop`, restarting the BanyanDB pod every ~8 minutes in production. While investigating I found that the panic is reproducible from a torn `spans.bin` (or any per-tag `*.t`/`*.tm`) at a block boundary, and that the BanyanDB merge write path has a durability gap that allows such torn parts to be left on disk and survive across process restarts.

Filing as a separate issue because the fix lives in the BanyanDB storage layer (it's not a trace-only or merge-only bug — same gap exists across `trace`, `measure`, `stream`, and `sidx`); the user-facing symptom in #13861 will go away once it lands.

> **Revision (2026-05-07):** This proposal was originally a four-layer fix that included converting six `logger.Panicf` sites to error returns and adding a quarantine/recovery path. After design review with the maintainer, the panic-on-corruption pattern is BanyanDB's canonical fail-fast contract and is preserved unchanged. The revised proposal is the three preventive fixes only (1 + 2 + 3 below). They eliminate the *creation* of torn parts in normal operation; pre-fix-deployment torn parts still trigger panic-at-boot, which is the canonical operator-intervention signal. The "Why the panic perpetuates" framing below is preserved for context, but the perpetuation is now solved indirectly: torn parts can no longer be *created*, so the panic-at-corruption sites become effectively unreachable post-fix.

## Reproduction

A standalone unit test forges the on-disk shape ("`spans.bin` truncated at a block boundary, `metadata.json` and `primary.bin` intact") and feeds it back into `tst.mergeParts`. The first read of the truncated block silently EOF-returns from `seqReader.mustReadFull` without advancing `bytesRead`; the *following* block's offset check then fires with the exact production message:

```
panic: offset 5300 must be equal to bytesRead 5247
```

Same shape as apache/skywalking#13861's `offset 1400877 must be equal to bytesRead 1400490`. Reproducer: `Test_merger_tornWriteRepro` in `banyand/trace/merger_repro_test.go` (held back from the fix PR — under the revised plan, this test only documents the failure shape; the fix prevents the shape from being created on disk in the first place).

## Audit of the merge durability path

`banyand/trace/merger.go::mergeParts` (lines 469-525), with cross-references:

| Step | What happens | fsync? |
|---|---|---|
| `bw.writers.MustClose()` | Closes SeqWriters for `meta.bin`, `primary.bin`, `spans.bin`, plus each `*.t` / `*.tm` | **Yes, but error is discarded.** `seqWriter.Close()` (`pkg/fs/local_file_system.go:656-669`) does `bufio.Flush()` then `_ = SyncAndDropCache(...)`. On Linux `SyncAndDropCache` is `Fdatasync` + `FADV_DONTNEED` (`pkg/fs/local_file_system_linux.go:121-127`); on Darwin `F_FULLFSYNC`; on Windows `FlushFileBuffers`. All are real durability primitives — the bug is that the return value is discarded, so a fdatasync failure (ENOSPC, EIO) silently loses data. The underlying `*os.File` is also never closed in this path, leaking the fd. |
| `pm.mustWriteMetadata(...)` | Writes `metadata.json` | **No.** `localFileSystem.Write` (`pkg/fs/local_file_system.go:201-233`) opens, writes, defers `file.Close()` — plain `os.File.Close()`, no `Sync()`. |
| `tf.mustWriteTraceIDFilter(...)` | Writes `traceID.filter` | **No.** Same path. |
| `tt.mustWriteTagType(...)` | Writes `tag.type` | **No.** Same path. |
| `fileSystem.SyncPath(dstPath)` | `fsync(dir)` | Yes — but a directory fsync persists *names*, not *contents*. |

> **Correction to the original audit.** The first row originally claimed Linux used `sync_file_range`. It does not — `SyncAndDropCache` calls `unix.Fdatasync` followed by `unix.Fadvise(FADV_DONTNEED)`. `fdatasync(2)` is a real durability primitive (flushes data plus the metadata required to read it; only skips non-essential inode metadata like atime/mtime). The bugs are narrower than "wrong syscall": the error from `Fdatasync` is silently discarded, and the `*os.File` handle is leaked.

Failure window: between `localFileSystem.Write` returning for a metadata file (file content still in page cache) and the directory fsync completing, a hard kill (SIGKILL, OOM, eviction, kernel panic) can leave the part dir with a `metadata.json` directory entry that points to a zero-or-partial inode. On startup, `mustOpenFilePart` reads `metadata.json` and trusts it; the next merge that touches the part panics.

## Why the panic perpetuates

Because the panic in the merge goroutine is unrecovered, the process exits, k8s restarts the pod, the merge loop wakes up, picks up the same torn part, panics again — every ~8 minutes in the original report.

> **How the revised fix addresses this without touching the panic.** Phase 1+2+3 below make it *impossible* for the merger to leave the perpetuating shape on disk in normal operation. After deploy, every merge ends with either a fully-renamed `metadata.json` (durable, healthy) or only `metadata.json.tmp` (no final — `mustReadMetadata` panics at boot, operator moves the part aside). For pre-fix-deployment torn parts that already exist on disk, the panic-at-boot still fires once, and the operator handles it manually — this is the canonical BanyanDB pattern, not a regression introduced by this fix.

## Scope

The same `seqReader.mustReadFull` silent-EOF and the same `seqWriter.Close()` no-fsync path are used by all four storage engines that share this merge infrastructure:

- `banyand/trace/part_iter.go:359` (this report)
- `banyand/trace/block.go:196,329`
- `banyand/internal/sidx/block.go:594,616,652`
- `banyand/measure/block.go:246,390`
- `banyand/stream/block.go:250,387`

One fix to `pkg/fs/local_file_system.go` covers (1) and (2) for all of them. (3) is a per-engine swap of three `Write` call sites to `WriteAtomic` plus a small `cleanupLeftoverTmp` helper on `mustOpenFilePart`. None of the existing `logger.Panicf` sites are modified.

## Proposed fix (three preventive layers)

1. **`seqWriter.Close()` must propagate the durability-call error and close the underlying file.** `pkg/fs/local_file_system.go:656-669` — replace `_ = SyncAndDropCache(...)` with a checked call whose error fails the close, and explicitly close `*os.File`. `bw.writers.MustClose()` continues to panic on error: that turns silent data loss on disk failure into a visible panic operators can act on.
2. **`localFileSystem.Write` must `fsync` before `Close`.** `pkg/fs/local_file_system.go:201-233` — `metadata.json`, `traceID.filter`, `tag.type` are atomic state files and cannot race with their sibling data files for durability.
3. **Two-phase commit for merge metadata.** Add a `WriteAtomic(buffer, name, perm)` to `FileSystem` that does write-tmp → fsync → rename → fsync-dir. In each engine's `mustWriteMetadata` / `mustWriteTraceIDFilter` / `mustWriteTagType`, replace the `Write` call with `WriteAtomic`. On startup, `mustOpenFilePart` removes any `<file>.tmp` sibling whose matching final exists (safe post-rename leftover); a `<file>.tmp` without its final is left in place and the existing `mustReadMetadata` panic fires — operator-intervention path, unchanged from today.

Together, (1)+(2)+(3) eliminate the torn-write window. The merger's response to a torn part remains `logger.Panicf`; only the conditions under which a torn part can exist are tightened.

## Explicitly out of scope

- Conversion of any `logger.Panicf("offset %d must be equal to bytesRead %d", ...)` to error returns. Preserves BanyanDB's fail-fast contract.
- `recover()` boundary in `mergeParts`. The merger goroutine continues to crash the process on panic.
- Quarantine of torn source parts to a `corrupt/` subdirectory. Operator handles surviving torn parts manually.
- `merger_corrupt_part_total` metric. Pod restart count from kubelet remains the operator signal.

## Caveat

The reproducer matches the panic shape exactly and the audit confirms the durability gap exists, but I have not yet identified what triggers the *very first* torn part on a fresh disk in apache/skywalking#13861's setup (the reporter sees the first panic ~30 minutes after a clean wipe, with no prior process kill). Two diagnostic questions are pending in the original issue. Either way, the proposed fix is correct on its own merits — it eliminates the catastrophic perpetuation regardless of what causes the initial tear.

A new diagnostic candidate the audit surfaced: the discarded `Fdatasync` error in `seqWriter.Close()` could itself be the first-tear cause if the host hit a transient ENOSPC/EIO. Worth asking the reporter to grep dmesg for I/O errors around the time of the first panic.

## Implementation plan

A task-by-task plan with TDD steps is saved locally at `.omc/plans/2026-05-07-merge-fsync-fix.md` (8 tasks across 4 phases: pkg/fs primitives → trace two-phase commit → propagate to measure/stream/sidx → docs + integration test). Will land as a single PR titled `fix(banyand): close merge durability gap (#13862)`.


Step	What happens	fsync?
`bw.writers.MustClose()`	Closes SeqWriters for `meta.bin`, `primary.bin`, `spans.bin`, plus each `.t` / `.tm`	Yes, but error is discarded. `seqWriter.Close()` (`pkg/fs/local_file_system.go:656-669`) does `bufio.Flush()` then `_ = SyncAndDropCache(...)`. On Linux `SyncAndDropCache` is `Fdatasync` + `FADV_DONTNEED` (`pkg/fs/local_file_system_linux.go:121-127`); on Darwin `F_FULLFSYNC`; on Windows `FlushFileBuffers`. All are real durability primitives — the bug is that the return value is discarded, so a fdatasync failure (ENOSPC, EIO) silently loses data. The underlying `*os.File` is also never closed in this path, leaking the fd.
`pm.mustWriteMetadata(...)`	Writes `metadata.json`	No. `localFileSystem.Write` (`pkg/fs/local_file_system.go:201-233`) opens, writes, defers `file.Close()` — plain `os.File.Close()`, no `Sync()`.
`tf.mustWriteTraceIDFilter(...)`	Writes `traceID.filter`	No. Same path.
`tt.mustWriteTagType(...)`	Writes `tag.type`	No. Same path.
`fileSystem.SyncPath(dstPath)`	`fsync(dir)`	Yes — but a directory fsync persists names, not contents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] BanyanDB trace/measure/stream/sidx merge: missing fsync makes torn parts perpetuate CrashLoopBackOff (root cause for #13861) #13862

Reproduction

Audit of the merge durability path

Why the panic perpetuates

Scope

Proposed fix (three preventive layers)

Explicitly out of scope

Caveat

Implementation plan

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] BanyanDB trace/measure/stream/sidx merge: missing fsync makes torn parts perpetuate CrashLoopBackOff (root cause for #13861) #13862

Description

Reproduction

Audit of the merge durability path

Why the panic perpetuates

Scope

Proposed fix (three preventive layers)

Explicitly out of scope

Caveat

Implementation plan

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions