[Bug] BanyanDB segment loader silently deletes segments with missing/empty metadata file (version-check bypass)

Surfaced during triage of #13861. Filing separately because it's an independent footgun that doesn't depend on (and isn't fixed by) #13862.

## Background

In the #13861 thread I claimed BanyanDB 0.10.1 should refuse to load 0.9.0 data files because the segment file version 1.3.0 (used in 0.9) is not in `banyand/internal/storage/versions.yml` (which on 0.10.1 is `[1.4.0]`, on `main` is `[1.4.0, 1.5.0]`). The reporter's controlled fresh-disk experiment subsequently proved that #13861's panic is unrelated to legacy file format — it reproduces on disks that 0.9 never touched. So the version-check question is *separate* from #13861's root cause, but the original observation still stands: in the reporter's environment, an in-place 0.9.0 → 0.10.1 upgrade *did* start successfully. That should not be possible if the version check is functioning.

Reading the code, I think I see why — and it's worse than "the check didn't fire": **the segment loader silently `MustRMAll`s any segment whose metadata file is missing or empty.**

## Code path

`banyand/internal/storage/segment.go::segmentController.open()` (lines 516-564):

```go
err := loadSegments(sc.location, segPathPrefix, sc, sc.getOptions().SegmentInterval, func(start, end time.Time) error {
    ...
    metadataPath := path.Join(segmentPath, metadataFilename)            // <segPath>/metadata
    rawMeta, readErr := sc.lfs.Read(metadataPath)
    if readErr != nil {
        if errors.Is(readErr, fs.ErrNotExist) {
            invalidSegments = append(invalidSegments, segmentPath)        // (A) missing → "invalid"
            return nil
        }
        return readErr
    }
    if len(rawMeta) == 0 {
        invalidSegments = append(invalidSegments, segmentPath)            // (B) empty → "invalid"
        return nil
    }
    meta, parseErr := readSegmentMeta(rawMeta)
    if parseErr != nil {
        return parseErr                                                   // (C) unparseable / wrong version → propagate
    }
    ...
})
if len(invalidSegments) > 0 {
    sc.l.Warn().Strs("segments", invalidSegments).Msg("invalid segments found (empty or epoch-dated), removing them")
    for i := range invalidSegments {
        sc.lfs.MustRMAll(invalidSegments[i])                              // ← silent data loss
    }
}
```

So:

- **(A) Metadata file missing** → segment dir silently deleted (`MustRMAll`), only a `Warn` log.
- **(B) Metadata file empty** → silently deleted, only a `Warn` log.
- **(C) Metadata file present, version 1.3.0 (or any other unsupported)** → `errVersionIncompatible` returned and propagated up; bootstrap fails. ✓ correct.

The reporter's 0.9.0 → 0.10.1 upgrade hit (A) or (B) rather than (C), depending on whether 0.9.0 wrote a `metadata` file at all and what shape it had. That's why the upgrade *appeared* to succeed — segments from 0.9 were silently destroyed and a new tree was built on top.

## Why this is a bug independent of version-check

The "delete on invalid metadata" branch is wrong for many reasons that have nothing to do with version compatibility:

1. **Silent data loss.** Any operational hiccup that produces an unreadable `metadata` file (permission error temporarily masked by another caller, partial write from a previous crash before the metadata fsync, manual operator inspection that briefly moves the file aside) destroys the entire segment with only a `Warn` log. The user has no signal that data is being deleted.
2. **Version-mismatch gate is bypassed.** A core function of the metadata file is to gate compatibility — but the missing/empty branches make that gate disappear. An operator running an unsupported upgrade gets a clean-looking "fresh database" instead of a "your data is incompatible, here's how to migrate" error.
3. **Crash-loop amplification.** If a writer crashes between segment dir creation and `metadata` file write (a window that exists in `segmentController.create`, lines 599-621: `MkdirPanicIfExist` → `CreateLockFile` → `Write`, with no fsync sequence), the next startup deletes that segment.
4. **`MustRMAll` panics on permission/IO errors**, so an operator who debug-protected `/data` (e.g., `chmod -R a-w` to inspect) gets a hard panic during bootstrap on top of the silent-deletion path.

## Proposed fix

Two changes:

1. **Stop deleting "invalid" segments.** Replace the `MustRMAll` with one of:
   - **Refuse to start** with a clear error: `"segment <path> has missing/unreadable/incompatible metadata; manual intervention required"`. This is the safest default for a database — never auto-delete data the operator didn't ask to delete.
   - Or **quarantine** the segment dir by renaming it to `<segPath>.corrupt-<timestamp>` and emit a metric (`banyandb_corrupt_segment_total{reason=...}`) + WARN log. Operator can review/delete manually.

2. **Treat missing/empty `metadata` as a real error, not as a license to delete.** Empty/missing metadata almost always indicates incomplete write or an unsupported predecessor version — both cases the operator needs to be told about explicitly.

For the original 0.9 → 0.10 upgrade scenario specifically: if 0.9 wrote no `metadata` file at all, the new behavior surfaces a clean "found segment dir without metadata file, refusing to load" error. The operator can then check the upgrade docs (or file an issue, like #13861 turned out to be).

Optionally, a third fix:

3. **Tighten `segmentController.create`'s metadata write durability.** `CreateLockFile` → `Write` → return without explicit fsync of either the file or the directory means a crash between segment-dir creation and metadata commit leaves the (A) state. After fix (1) this is no longer catastrophic (refuse-to-start instead of silent delete), but the fsync gap is the same one #13862 audits for the merge path; the fix shape (`fsync` data → atomic rename → `fsync` dir) applies here too.

## Scope

`banyand/internal/storage/segment.go::open` is shared by all storage engines (`trace`, `measure`, `stream`, `sidx`), so one fix covers all of them.

## Caveat

I haven't yet checked what 0.9.0 actually wrote in segment dirs (whether a `metadata` file existed, whether it had the version inside, whether the format was the same key/JSON). That would let us confirm exactly which branch (A or B) fired in the reporter's upgrade. Code-only audit is enough for this issue though — even if 0.9 wrote a perfect 1.3.0 metadata file and the reporter's upgrade hit branch (C), the silent-delete behavior on (A) and (B) is a footgun in its own right.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] BanyanDB segment loader silently deletes segments with missing/empty metadata file (version-check bypass) #13863

Background

Code path

Why this is a bug independent of version-check

Proposed fix

Scope

Caveat

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] BanyanDB segment loader silently deletes segments with missing/empty metadata file (version-check bypass) #13863

Description

Background

Code path

Why this is a bug independent of version-check

Proposed fix

Scope

Caveat

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions