Skip to content

Walkable-v8: force-rewrite v7 buckets to v8 on first master-up load (closes lazy-migration gap) #10

@ehsan6sha

Description

@ehsan6sha

Summary

Buckets written before walkable-v8 (pre-v0.6) are stuck with PointerWire::Link(StorageKey) pointers in their HAMT internal nodes. After upgrading to a walkable-v8-enabled SDK (current default: walkable_v8_writer_enabled = true since 0.6.1 / #89), those legacy buckets stay v7 until each shard happens to be touched by a real write. Per the v0.6.1 release notes: "Lazy migration is per-shard, not per-bucket". For users with many existing buckets and infrequent writes, this means offline-walkability never engages on their existing data.

This issue proposes a transparent, one-shot per-bucket force-rewrite that fires on the first master-up load after the SDK upgrade — closing the lazy-migration gap without requiring users to write to every shard manually.

Evidence

A real user device on fula_client 0.5.2 (the published version that includes #8 fix #3 + walkable-v8 writer default-on) shows the failure pattern when the master endpoint is mutated to a non-resolvable URL (deliberate offline-simulation):

Old bucket (images) — fails offline:

Forest loaded for bucket: images
listObjects(images, prefix="") error: AnyhowException(Encryption error:
  storage backend error: HTTP error: error sending request for url
  (https://<masked-bogus-master>/images/__fula_forest_v7_nodes/bef00324f310ac3c032a4b94b9779c6af865a7a1854d))

The path __fula_forest_v7_nodes/<storage_key> is the v7 layout. The HAMT walker is fetching an internal node by raw storage_key against master because the parent pointer is Link(StorageKey), not LinkV2 { storage_key, cid }. No CID hint → no gateway-race fallback path engages.

New bucket (walkable-v8-test-…) — works offline on the same device, same session:

Forest loaded for bucket: walkable-v8-test-1778428540
listObjects(walkable-v8-test-1778428540, prefix=""): 5 files (raw forest=5)

Same SDK, same bogus master URL. Only difference: the manifest has LinkV2 stamps everywhere because it was created from a v8 writer.

Acceptance criteria

A regression test in crates/fula-client/tests/ that:

  1. Creates a bucket entirely under v7 writer (no LinkV2 stamps).
  2. Verifies that list_files_from_forest(bucket) against a DNS-failing master returns Err.
  3. Applies the proposed migration (single SDK call, no operator intervention).
  4. Re-runs list_files_from_forest(bucket) against the same DNS-failing master, asserts it returns the expected file list.
  5. Verifies the post-migration manifest's page_index entries all have cid: Some(_) (i.e., LinkV2 cascade fully fired).

Proposed mechanism (minimal spec)

Trigger. Inside load_forest_internal after the manifest is decoded, scan manifest_snapshot.root.page_index. If ANY entry has cid: None, the bucket has un-migrated v7 pages. Set a "needs migration" flag on the loaded forest cache entry.

Marker. Persist a small flag in the existing BlockCache::METADATA table keyed migrate_to_walkable_v8/v1/<bucket_lookup_h_hex> → 1-byte 0x01. Set after a successful migration flush. Checked on next load to short-circuit.

Mechanism. When the flag is set and the marker is absent:

  1. Iterate every shard in the loaded forest. Mark its HAMT root + every reachable internal node as dirty. The existing dirty-tracker (fula-crypto::wnfs_hamt::Node::dirty) is the only seam needed; flushing a dirty node re-encodes it under the v8 writer (since walkable_v8_writer_enabled = true).
  2. Mark every manifest_snapshot.root.page_index[*].cid entry as needing a re-stamp (or just mark the page dirty — same effect via Phase 1.5).
  3. Call the existing save_sharded_hamt_forest path. The Phase 1.5 / 1.6 / 2 cascade re-encodes pages, writes them, etag-self-verifies the new CIDs, stamps them into the new ManifestRoot, and commits via Phase 2's If-Match CAS.
  4. On successful Phase 2 commit, write the migration marker.

Atomicity. Phase 2 root commit uses If-Match on the prior etag. The migration either fully commits a new v8 root OR fails cleanly and the legacy v7 root stays live. Mid-flight crashes leave the old root + some orphan v8 blobs — same cleanup behavior as any other failed Phase 2 commit (no corruption).

One-shot per bucket per device. The marker prevents re-running. If two devices concurrently load + migrate the same bucket, one wins the Phase 2 CAS, the other gets ConcurrentModification and just observes the already-v8 state on retry.

No effect when bucket is already v8. The page_index scan short-circuits cheaply (single integer-equality check per page). Buckets where every page already has cid: Some(_) skip the migration path entirely. Zero overhead on healthy buckets.

No effect when master is unreachable. Migration requires Phase 2 PUT to commit. If master is down at load time, the scan still detects "needs migration" but the actual flush is deferred until next master-up load. The marker is only set after a successful commit, so retry is automatic.

Bounded write cost

Per the existing W.8.4 analysis, a fully-rewritten manifest costs ≤ 5% more bytes than its v7 predecessor (the LinkV2 variant adds 22 bytes per pointer × pointer count). For a bucket with 16 shards, ~80 pages, ~500 internal HAMT nodes (typical user), that's ~50 KB of additional traffic for the migration commit. One-time per bucket per device. Negligible.

What this does NOT do

  • No chunk-level migration. Chunks already have content-addressed keys; their CIDs are independent of walkable-v8. Already covered by issue Offline reads break after online write: BLOCKS cache not warmed by writer + 3 collateral SDK gaps #8 fix Walkability #3 (BLOCKS warm-on-write).
  • No retroactive offline-walking of unmigrated buckets. Until a user does a master-up load that fires the migration, their bucket stays v7-only. The migration is opt-in by SDK upgrade + first reconnect.
  • No cross-device coordination. Each device migrates independently the first time it observes the v7 state. If device A migrates and device B reads after, device B sees the v8 root and short-circuits the scan immediately.

Why this is safe to ship default-on

  • The migration is additive — it only RE-WRITES existing data under a stricter format. It does not change any bucket's logical contents.
  • The migration is observable — a tracing::info! line on every fire lets operators measure adoption + diagnose stuck buckets.
  • The migration is rollback-safe — if the migration fix is reverted in a future SDK, the v8-stamped pointers are still readable by every v0.6+ SDK. The marker just becomes stale (gets skipped on next load), and the bucket continues working under whichever wire format the writer emits.

Implementation plan

  1. Add migrate_v7_to_v8_if_needed(bucket) helper on EncryptedClient (~80 LOC).
  2. Wire trigger inside load_forest_internal post-decode (~10 LOC).
  3. Add marker get/set methods on BlockCache mirroring the existing store_users_index_state pattern (~40 LOC).
  4. Add walkable_v8_migrate_v7_bucket integration test in crates/fula-client/tests/ (~200 LOC).
  5. Cross-platform alignment check: fula-flutter + fula-js no-op (this is a Rust-internal path), wasm32 compile-clean.

Total: ~330 LOC + test.

Acceptance test (gold standard)

Filed as part of this PR. End-to-end on the user's real images bucket on s3.cloud.fx.land:

  1. Setup: assume the user's images bucket exists and was originally written under pre-v0.6 SDK (verified by the v7-nodes-URL failure mode above).
  2. Pre-fix: spin a fresh EncryptedClient, set endpoint to a non-resolvable URL, attempt list_files_from_forest("images"). MUST return Err with the v7-nodes URL in the error chain.
  3. Apply fix: spin up an EncryptedClient against the REAL master, call the migration trigger (initially manual via a test-only entry point; in production it fires on first load).
  4. Verify post-migration: spin a third EncryptedClient, set endpoint back to bogus URL, repeat list_files_from_forest("images"). MUST succeed.
  5. Verify manifest is v8: inspect the post-migration manifest's page_index, assert every entry has cid: Some(_).

Out of scope / future work

  • Forced re-migration after a wire-format upgrade past v8. When v9 lands, a similar marker-and-scan pattern will be needed. The marker version (v1 suffix) is so future migrations don't conflict.
  • Migration progress reporting. Users with very large buckets may want a UI indicator. Out of scope for v1; can be added via a future Phase 19 transparency surface.
  • Per-shard parallelism. The current dirty-flag-and-flush approach re-flushes shards sequentially. Could be parallelized, but adds complexity. Defer until measured to be a bottleneck.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions