Skip to content

[Bug] BanyanDB trace/measure/stream/sidx merge: missing fsync makes torn parts perpetuate CrashLoopBackOff (root cause for #13861) #13862

@hanahmily

Description

@hanahmily

Spun off from #13861. That issue reports a panic: offset must be equal to bytesRead from the trace mergeLoop, restarting the BanyanDB pod every ~8 minutes in production. While investigating I found that the panic is reproducible from a torn spans.bin (or any per-tag *.t/*.tm) at a block boundary, and that the BanyanDB merge write path has a durability gap that allows such torn parts to be left on disk and survive across process restarts.

Filing as a separate issue because the fix lives in the BanyanDB storage layer (it's not a trace-only or merge-only bug — same gap exists across trace, measure, stream, and sidx); the user-facing symptom in #13861 will go away once it lands.

Revision (2026-05-07): This proposal was originally a four-layer fix that included converting six logger.Panicf sites to error returns and adding a quarantine/recovery path. After design review with the maintainer, the panic-on-corruption pattern is BanyanDB's canonical fail-fast contract and is preserved unchanged. The revised proposal is the three preventive fixes only (1 + 2 + 3 below). They eliminate the creation of torn parts in normal operation; pre-fix-deployment torn parts still trigger panic-at-boot, which is the canonical operator-intervention signal. The "Why the panic perpetuates" framing below is preserved for context, but the perpetuation is now solved indirectly: torn parts can no longer be created, so the panic-at-corruption sites become effectively unreachable post-fix.

Reproduction

A standalone unit test forges the on-disk shape ("spans.bin truncated at a block boundary, metadata.json and primary.bin intact") and feeds it back into tst.mergeParts. The first read of the truncated block silently EOF-returns from seqReader.mustReadFull without advancing bytesRead; the following block's offset check then fires with the exact production message:

panic: offset 5300 must be equal to bytesRead 5247

Same shape as #13861's offset 1400877 must be equal to bytesRead 1400490. Reproducer: Test_merger_tornWriteRepro in banyand/trace/merger_repro_test.go (held back from the fix PR — under the revised plan, this test only documents the failure shape; the fix prevents the shape from being created on disk in the first place).

Audit of the merge durability path

banyand/trace/merger.go::mergeParts (lines 469-525), with cross-references:

Step What happens fsync?
bw.writers.MustClose() Closes SeqWriters for meta.bin, primary.bin, spans.bin, plus each *.t / *.tm Yes, but error is discarded. seqWriter.Close() (pkg/fs/local_file_system.go:656-669) does bufio.Flush() then _ = SyncAndDropCache(...). On Linux SyncAndDropCache is Fdatasync + FADV_DONTNEED (pkg/fs/local_file_system_linux.go:121-127); on Darwin F_FULLFSYNC; on Windows FlushFileBuffers. All are real durability primitives — the bug is that the return value is discarded, so a fdatasync failure (ENOSPC, EIO) silently loses data. The underlying *os.File is also never closed in this path, leaking the fd.
pm.mustWriteMetadata(...) Writes metadata.json No. localFileSystem.Write (pkg/fs/local_file_system.go:201-233) opens, writes, defers file.Close() — plain os.File.Close(), no Sync().
tf.mustWriteTraceIDFilter(...) Writes traceID.filter No. Same path.
tt.mustWriteTagType(...) Writes tag.type No. Same path.
fileSystem.SyncPath(dstPath) fsync(dir) Yes — but a directory fsync persists names, not contents.

Correction to the original audit. The first row originally claimed Linux used sync_file_range. It does not — SyncAndDropCache calls unix.Fdatasync followed by unix.Fadvise(FADV_DONTNEED). fdatasync(2) is a real durability primitive (flushes data plus the metadata required to read it; only skips non-essential inode metadata like atime/mtime). The bugs are narrower than "wrong syscall": the error from Fdatasync is silently discarded, and the *os.File handle is leaked.

Failure window: between localFileSystem.Write returning for a metadata file (file content still in page cache) and the directory fsync completing, a hard kill (SIGKILL, OOM, eviction, kernel panic) can leave the part dir with a metadata.json directory entry that points to a zero-or-partial inode. On startup, mustOpenFilePart reads metadata.json and trusts it; the next merge that touches the part panics.

Why the panic perpetuates

Because the panic in the merge goroutine is unrecovered, the process exits, k8s restarts the pod, the merge loop wakes up, picks up the same torn part, panics again — every ~8 minutes in the original report.

How the revised fix addresses this without touching the panic. Phase 1+2+3 below make it impossible for the merger to leave the perpetuating shape on disk in normal operation. After deploy, every merge ends with either a fully-renamed metadata.json (durable, healthy) or only metadata.json.tmp (no final — mustReadMetadata panics at boot, operator moves the part aside). For pre-fix-deployment torn parts that already exist on disk, the panic-at-boot still fires once, and the operator handles it manually — this is the canonical BanyanDB pattern, not a regression introduced by this fix.

Scope

The same seqReader.mustReadFull silent-EOF and the same seqWriter.Close() no-fsync path are used by all four storage engines that share this merge infrastructure:

  • banyand/trace/part_iter.go:359 (this report)
  • banyand/trace/block.go:196,329
  • banyand/internal/sidx/block.go:594,616,652
  • banyand/measure/block.go:246,390
  • banyand/stream/block.go:250,387

One fix to pkg/fs/local_file_system.go covers (1) and (2) for all of them. (3) is a per-engine swap of three Write call sites to WriteAtomic plus a small cleanupLeftoverTmp helper on mustOpenFilePart. None of the existing logger.Panicf sites are modified.

Proposed fix (three preventive layers)

  1. seqWriter.Close() must propagate the durability-call error and close the underlying file. pkg/fs/local_file_system.go:656-669 — replace _ = SyncAndDropCache(...) with a checked call whose error fails the close, and explicitly close *os.File. bw.writers.MustClose() continues to panic on error: that turns silent data loss on disk failure into a visible panic operators can act on.
  2. localFileSystem.Write must fsync before Close. pkg/fs/local_file_system.go:201-233metadata.json, traceID.filter, tag.type are atomic state files and cannot race with their sibling data files for durability.
  3. Two-phase commit for merge metadata. Add a WriteAtomic(buffer, name, perm) to FileSystem that does write-tmp → fsync → rename → fsync-dir. In each engine's mustWriteMetadata / mustWriteTraceIDFilter / mustWriteTagType, replace the Write call with WriteAtomic. On startup, mustOpenFilePart removes any <file>.tmp sibling whose matching final exists (safe post-rename leftover); a <file>.tmp without its final is left in place and the existing mustReadMetadata panic fires — operator-intervention path, unchanged from today.

Together, (1)+(2)+(3) eliminate the torn-write window. The merger's response to a torn part remains logger.Panicf; only the conditions under which a torn part can exist are tightened.

Explicitly out of scope

  • Conversion of any logger.Panicf("offset %d must be equal to bytesRead %d", ...) to error returns. Preserves BanyanDB's fail-fast contract.
  • recover() boundary in mergeParts. The merger goroutine continues to crash the process on panic.
  • Quarantine of torn source parts to a corrupt/ subdirectory. Operator handles surviving torn parts manually.
  • merger_corrupt_part_total metric. Pod restart count from kubelet remains the operator signal.

Caveat

The reproducer matches the panic shape exactly and the audit confirms the durability gap exists, but I have not yet identified what triggers the very first torn part on a fresh disk in #13861's setup (the reporter sees the first panic ~30 minutes after a clean wipe, with no prior process kill). Two diagnostic questions are pending in the original issue. Either way, the proposed fix is correct on its own merits — it eliminates the catastrophic perpetuation regardless of what causes the initial tear.

A new diagnostic candidate the audit surfaced: the discarded Fdatasync error in seqWriter.Close() could itself be the first-tear cause if the host hit a transient ENOSPC/EIO. Worth asking the reporter to grep dmesg for I/O errors around the time of the first panic.

Implementation plan

A task-by-task plan with TDD steps is saved locally at .omc/plans/2026-05-07-merge-fsync-fix.md (8 tasks across 4 phases: pkg/fs primitives → trace two-phase commit → propagate to measure/stream/sidx → docs + integration test). Will land as a single PR titled fix(banyand): close merge durability gap (#13862).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working and you are sure it's a bug!databaseBanyanDB - SkyWalking native database

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions