You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Buckets written before walkable-v8 (pre-v0.6) are stuck with PointerWire::Link(StorageKey) pointers in their HAMT internal nodes. After upgrading to a walkable-v8-enabled SDK (current default: walkable_v8_writer_enabled = true since 0.6.1 / #89), those legacy buckets stay v7 until each shard happens to be touched by a real write. Per the v0.6.1 release notes: "Lazy migration is per-shard, not per-bucket". For users with many existing buckets and infrequent writes, this means offline-walkability never engages on their existing data.
This issue proposes a transparent, one-shot per-bucket force-rewrite that fires on the first master-up load after the SDK upgrade — closing the lazy-migration gap without requiring users to write to every shard manually.
Evidence
A real user device on fula_client 0.5.2 (the published version that includes #8 fix #3 + walkable-v8 writer default-on) shows the failure pattern when the master endpoint is mutated to a non-resolvable URL (deliberate offline-simulation):
The path __fula_forest_v7_nodes/<storage_key> is the v7 layout. The HAMT walker is fetching an internal node by raw storage_key against master because the parent pointer is Link(StorageKey), not LinkV2 { storage_key, cid }. No CID hint → no gateway-race fallback path engages.
New bucket (walkable-v8-test-…) — works offline on the same device, same session:
Same SDK, same bogus master URL. Only difference: the manifest has LinkV2 stamps everywhere because it was created from a v8 writer.
Acceptance criteria
A regression test in crates/fula-client/tests/ that:
Creates a bucket entirely under v7 writer (no LinkV2 stamps).
Verifies that list_files_from_forest(bucket) against a DNS-failing master returns Err.
Applies the proposed migration (single SDK call, no operator intervention).
Re-runs list_files_from_forest(bucket) against the same DNS-failing master, asserts it returns the expected file list.
Verifies the post-migration manifest's page_index entries all have cid: Some(_) (i.e., LinkV2 cascade fully fired).
Proposed mechanism (minimal spec)
Trigger. Inside load_forest_internal after the manifest is decoded, scan manifest_snapshot.root.page_index. If ANY entry has cid: None, the bucket has un-migrated v7 pages. Set a "needs migration" flag on the loaded forest cache entry.
Marker. Persist a small flag in the existing BlockCache::METADATA table keyed migrate_to_walkable_v8/v1/<bucket_lookup_h_hex> → 1-byte 0x01. Set after a successful migration flush. Checked on next load to short-circuit.
Mechanism. When the flag is set and the marker is absent:
Iterate every shard in the loaded forest. Mark its HAMT root + every reachable internal node as dirty. The existing dirty-tracker (fula-crypto::wnfs_hamt::Node::dirty) is the only seam needed; flushing a dirty node re-encodes it under the v8 writer (since walkable_v8_writer_enabled = true).
Mark every manifest_snapshot.root.page_index[*].cid entry as needing a re-stamp (or just mark the page dirty — same effect via Phase 1.5).
Call the existing save_sharded_hamt_forest path. The Phase 1.5 / 1.6 / 2 cascade re-encodes pages, writes them, etag-self-verifies the new CIDs, stamps them into the new ManifestRoot, and commits via Phase 2's If-Match CAS.
On successful Phase 2 commit, write the migration marker.
Atomicity. Phase 2 root commit uses If-Match on the prior etag. The migration either fully commits a new v8 root OR fails cleanly and the legacy v7 root stays live. Mid-flight crashes leave the old root + some orphan v8 blobs — same cleanup behavior as any other failed Phase 2 commit (no corruption).
One-shot per bucket per device. The marker prevents re-running. If two devices concurrently load + migrate the same bucket, one wins the Phase 2 CAS, the other gets ConcurrentModification and just observes the already-v8 state on retry.
No effect when bucket is already v8. The page_index scan short-circuits cheaply (single integer-equality check per page). Buckets where every page already has cid: Some(_) skip the migration path entirely. Zero overhead on healthy buckets.
No effect when master is unreachable. Migration requires Phase 2 PUT to commit. If master is down at load time, the scan still detects "needs migration" but the actual flush is deferred until next master-up load. The marker is only set after a successful commit, so retry is automatic.
Bounded write cost
Per the existing W.8.4 analysis, a fully-rewritten manifest costs ≤ 5% more bytes than its v7 predecessor (the LinkV2 variant adds 22 bytes per pointer × pointer count). For a bucket with 16 shards, ~80 pages, ~500 internal HAMT nodes (typical user), that's ~50 KB of additional traffic for the migration commit. One-time per bucket per device. Negligible.
No retroactive offline-walking of unmigrated buckets. Until a user does a master-up load that fires the migration, their bucket stays v7-only. The migration is opt-in by SDK upgrade + first reconnect.
No cross-device coordination. Each device migrates independently the first time it observes the v7 state. If device A migrates and device B reads after, device B sees the v8 root and short-circuits the scan immediately.
Why this is safe to ship default-on
The migration is additive — it only RE-WRITES existing data under a stricter format. It does not change any bucket's logical contents.
The migration is observable — a tracing::info! line on every fire lets operators measure adoption + diagnose stuck buckets.
The migration is rollback-safe — if the migration fix is reverted in a future SDK, the v8-stamped pointers are still readable by every v0.6+ SDK. The marker just becomes stale (gets skipped on next load), and the bucket continues working under whichever wire format the writer emits.
Implementation plan
Add migrate_v7_to_v8_if_needed(bucket) helper on EncryptedClient (~80 LOC).
Add marker get/set methods on BlockCache mirroring the existing store_users_index_state pattern (~40 LOC).
Add walkable_v8_migrate_v7_bucket integration test in crates/fula-client/tests/ (~200 LOC).
Cross-platform alignment check: fula-flutter + fula-js no-op (this is a Rust-internal path), wasm32 compile-clean.
Total: ~330 LOC + test.
Acceptance test (gold standard)
Filed as part of this PR. End-to-end on the user's real images bucket on s3.cloud.fx.land:
Setup: assume the user's images bucket exists and was originally written under pre-v0.6 SDK (verified by the v7-nodes-URL failure mode above).
Pre-fix: spin a fresh EncryptedClient, set endpoint to a non-resolvable URL, attempt list_files_from_forest("images"). MUST return Err with the v7-nodes URL in the error chain.
Apply fix: spin up an EncryptedClient against the REAL master, call the migration trigger (initially manual via a test-only entry point; in production it fires on first load).
Verify post-migration: spin a third EncryptedClient, set endpoint back to bogus URL, repeat list_files_from_forest("images"). MUST succeed.
Verify manifest is v8: inspect the post-migration manifest's page_index, assert every entry has cid: Some(_).
Out of scope / future work
Forced re-migration after a wire-format upgrade past v8. When v9 lands, a similar marker-and-scan pattern will be needed. The marker version (v1 suffix) is so future migrations don't conflict.
Migration progress reporting. Users with very large buckets may want a UI indicator. Out of scope for v1; can be added via a future Phase 19 transparency surface.
Per-shard parallelism. The current dirty-flag-and-flush approach re-flushes shards sequentially. Could be parallelized, but adds complexity. Defer until measured to be a bottleneck.
Summary
Buckets written before walkable-v8 (pre-v0.6) are stuck with
PointerWire::Link(StorageKey)pointers in their HAMT internal nodes. After upgrading to a walkable-v8-enabled SDK (current default:walkable_v8_writer_enabled = truesince 0.6.1 / #89), those legacy buckets stay v7 until each shard happens to be touched by a real write. Per the v0.6.1 release notes: "Lazy migration is per-shard, not per-bucket". For users with many existing buckets and infrequent writes, this means offline-walkability never engages on their existing data.This issue proposes a transparent, one-shot per-bucket force-rewrite that fires on the first master-up load after the SDK upgrade — closing the lazy-migration gap without requiring users to write to every shard manually.
Evidence
A real user device on
fula_client 0.5.2(the published version that includes #8 fix #3 + walkable-v8 writer default-on) shows the failure pattern when the master endpoint is mutated to a non-resolvable URL (deliberate offline-simulation):Old bucket (
images) — fails offline:The path
__fula_forest_v7_nodes/<storage_key>is the v7 layout. The HAMT walker is fetching an internal node by raw storage_key against master because the parent pointer isLink(StorageKey), notLinkV2 { storage_key, cid }. No CID hint → no gateway-race fallback path engages.New bucket (
walkable-v8-test-…) — works offline on the same device, same session:Same SDK, same bogus master URL. Only difference: the manifest has
LinkV2stamps everywhere because it was created from a v8 writer.Acceptance criteria
A regression test in
crates/fula-client/tests/that:LinkV2stamps).list_files_from_forest(bucket)against a DNS-failing master returns Err.list_files_from_forest(bucket)against the same DNS-failing master, asserts it returns the expected file list.page_indexentries all havecid: Some(_)(i.e., LinkV2 cascade fully fired).Proposed mechanism (minimal spec)
Trigger. Inside
load_forest_internalafter the manifest is decoded, scanmanifest_snapshot.root.page_index. If ANY entry hascid: None, the bucket has un-migrated v7 pages. Set a "needs migration" flag on the loaded forest cache entry.Marker. Persist a small flag in the existing
BlockCache::METADATAtable keyedmigrate_to_walkable_v8/v1/<bucket_lookup_h_hex>→ 1-byte0x01. Set after a successful migration flush. Checked on next load to short-circuit.Mechanism. When the flag is set and the marker is absent:
fula-crypto::wnfs_hamt::Node::dirty) is the only seam needed; flushing a dirty node re-encodes it under the v8 writer (sincewalkable_v8_writer_enabled = true).manifest_snapshot.root.page_index[*].cidentry as needing a re-stamp (or just mark the page dirty — same effect via Phase 1.5).save_sharded_hamt_forestpath. The Phase 1.5 / 1.6 / 2 cascade re-encodes pages, writes them, etag-self-verifies the new CIDs, stamps them into the newManifestRoot, and commits via Phase 2'sIf-MatchCAS.Atomicity. Phase 2 root commit uses
If-Matchon the prior etag. The migration either fully commits a new v8 root OR fails cleanly and the legacy v7 root stays live. Mid-flight crashes leave the old root + some orphan v8 blobs — same cleanup behavior as any other failed Phase 2 commit (no corruption).One-shot per bucket per device. The marker prevents re-running. If two devices concurrently load + migrate the same bucket, one wins the Phase 2 CAS, the other gets
ConcurrentModificationand just observes the already-v8 state on retry.No effect when bucket is already v8. The page_index scan short-circuits cheaply (single integer-equality check per page). Buckets where every page already has
cid: Some(_)skip the migration path entirely. Zero overhead on healthy buckets.No effect when master is unreachable. Migration requires Phase 2 PUT to commit. If master is down at load time, the scan still detects "needs migration" but the actual flush is deferred until next master-up load. The marker is only set after a successful commit, so retry is automatic.
Bounded write cost
Per the existing W.8.4 analysis, a fully-rewritten manifest costs ≤ 5% more bytes than its v7 predecessor (the
LinkV2variant adds 22 bytes per pointer × pointer count). For a bucket with 16 shards, ~80 pages, ~500 internal HAMT nodes (typical user), that's ~50 KB of additional traffic for the migration commit. One-time per bucket per device. Negligible.What this does NOT do
Why this is safe to ship default-on
tracing::info!line on every fire lets operators measure adoption + diagnose stuck buckets.Implementation plan
migrate_v7_to_v8_if_needed(bucket)helper onEncryptedClient(~80 LOC).load_forest_internalpost-decode (~10 LOC).BlockCachemirroring the existingstore_users_index_statepattern (~40 LOC).walkable_v8_migrate_v7_bucketintegration test incrates/fula-client/tests/(~200 LOC).Total: ~330 LOC + test.
Acceptance test (gold standard)
Filed as part of this PR. End-to-end on the user's real
imagesbucket ons3.cloud.fx.land:imagesbucket exists and was originally written under pre-v0.6 SDK (verified by the v7-nodes-URL failure mode above).EncryptedClient, set endpoint to a non-resolvable URL, attemptlist_files_from_forest("images"). MUST return Err with the v7-nodes URL in the error chain.EncryptedClientagainst the REAL master, call the migration trigger (initially manual via a test-only entry point; in production it fires on first load).EncryptedClient, set endpoint back to bogus URL, repeatlist_files_from_forest("images"). MUST succeed.page_index, assert every entry hascid: Some(_).Out of scope / future work
v1suffix) is so future migrations don't conflict.Related
project_walkable_v8_default_on.mddocuments the "lazy migration per-shard" acceptance trade-off that this issue revisits.