Skip to content

Offline reads break after online write: BLOCKS cache not warmed by writer + 3 collateral SDK gaps #8

@ehsan6sha

Description

@ehsan6sha

Reproducer (FxFiles, real device)

User flow with walkableV8WriterEnabled: true + Phase 2.4 + Phase 3.3 all configured:

  1. Offline (S3 endpoint mutated to non-resolvable s33.cloud.fx.land to simulate master-down): open bucket images → succeeds via warm cache.
  2. Online (real s3.cloud.fx.land): upload IMG-20260509-WA0056(1).jpg to images. Etag returned: bafkr4ibeguudvn5bpgdmgu5fciupmu2rydjpyjm4x47halvotdth6zjpti.
  3. Online: listObjects(images, prefix=""): 29 files (raw forest=29) — succeeds.
  4. Offline (back to s33): listObjects(images)FAILS with:
    Forest load for images failed: AnyhowException(Download failed:
      failed to fetch manifest page 0 for bucket images:
      Master unreachable (health gate; down for ~7s))
    

Same failure on face-metadata, fula-metadata, images — every bucket that was written to during the online window.

Server side is fine

Server logs from cloud.fx.land for the same window show:

  • Phase 2 root commits succeed.
  • ipfs-cluster pins every CID (CID pinned successfully cid=bafkr4ie2vudjzb...).
  • IPNS publisher tick succeeds (sequence=144, ipns_name=k51qzi5uqu5dkkd6tv8slgoouzzs505qdcr4cb5egc9rlx7qwq0e794yxj9cg4).
  • Populated forest_manifest_cid (v0.4.4) bucket=face-metadata cid=... per write.

The breakers are all in fula-client.

Four independent breakers (Cluster A — offline-walkability)

Breaker 1 (PRIMARY) — BLOCKS cache not warmed by write path

crates/fula-client/src/encryption.rs flush path (save_sharded_hamt_forest, Phase 1.5 page commits, Phase 1.6 dir-index commits, Phase 2 root commits) writes bytes to master via S3BlobBackend::put but never calls cache.put(&cid, &bytes) on the local BLOCKS table. Verified by grep: zero cache.put calls in the write path; all cache.put calls live in READ wrappers (client.rs:675, 775, 874 and encryption.rs:3589 cold-start).

Consequence: the only way a write's bytes land in BLOCKS is if a subsequent master-up READ for the same CID happens. In the user's flow, that READ did happen (listObjects(images): 29 files). It should have populated BLOCKS via the cid-hint variant of get_object_with_offline_fallback_known_cid. Then offline read should serve from BLOCKS without master OR gateway.

The fact that offline-after-online-list STILL fails suggests either (a) the cache was re-opened against a different file across reinitializeFulaClient and lost data, (b) the cache's cache.get(cid) returns None for the post-upload CIDs because the cache file lock was held by the prior EncryptedClient, or (c) the in-memory forest cache served stale page_ref.cid values that don't match what BLOCKS has. The fix below pre-warms BLOCKS on every write so no read-after-write step is required.

Breaker 2 — list_buckets has zero offline fallback

crates/fula-client/src/client.rs:282-286:

pub async fn list_buckets(&self) -> Result<ListBucketsResult> {
    let response = self.request("GET", "/", None, None, None).await?;
    let text = response.text().await?;
    parse_list_buckets_response(&text)
}

No health-gate check, no cache, no gateway race, no cold-start. DNS error / master-down propagates raw. FxFiles compensates with its own listBucketsCached shim — but every other consumer of the SDK has to invent the same workaround.

Breaker 3 — Cloudflare dead in default gateway list (top priority slot)

crates/fula-client/src/gateway_fetch.rs:82-91 puts cloudflare-ipfs.com/ipfs/{cid} at slot 0. Cloudflare retired the public IPFS gateway in 2024/2025. The race tries top-3, so Cloudflare consumes a slot returning fast errors. Effective gateway race is 2 alive (dweb.link, ipfs.io) instead of 3.

Breaker 4 — Silent block_cache=None → gateway_pool=None cascade

crates/fula-client/src/client.rs:99-132:

let block_cache = if config.block_cache_enabled {
    match build_block_cache(&config) {
        Ok(cache) => Some(Arc::new(cache)),
        Err(e) => {
            warn!(error = %e, "block_cache: failed to open; offline fallback disabled for this session");
            None
        }
    }
} else { None };

let gateway_pool = if config.gateway_fallback_enabled && block_cache.is_some() {
    ...
} else { None };

If the redb cache file is briefly locked (e.g., by a prior EncryptedClient instance during reinitializeFulaClient before its Arc is dropped), the new client gets BlockCacheError::AlreadyOpen, block_cache becomes None, and gateway_pool is also disabled as a consequence. The entire offline fallback path is dead for the session — with only a warn! line that may not surface in flutter logs.

The cascade is too brittle. The gateway race doesn't actually need the block_cache to work — only the (bucket,key)→cid mapping does. The cid-hint variant of the wrapper has a CID directly. Decouple.

Out of scope here (Cluster B — separate issue)

Forest load for tag-metadata failed: AnyhowException(Encryption error: serialization error: expected value at line 1 column 1)
Forest load for website-metadata failed: ... (same)
Forest load for nft-metadata failed: ... (same)

These are serde_json errors — expected value at line 1 column 1 is the classic "unexpected leading byte" pattern. tag/website/nft buckets are FxFiles JSON-keyed singletons (.fula/tags/<userId>.json). Either the SDK is returning empty bytes (which JSON-parses to that error) or FxFiles is mis-handling a typed SDK error as a body. Different code path; will file separately after triage.

Proposed fixes (single PR, four files, ~120 LOC total)

  1. encryption.rs: after every Phase 1.5 / 1.6 / 2 write in the cascade, call cache.put(&cid, &bytes) for the just-written bytes. Best-effort (a BlockTooLarge error doesn't fail the write).
  2. client.rs::list_buckets: cache the response body keyed by a deterministic per-user key, route through health-gate + offline-fallback. On master-down, serve cached ListBucketsResult if present.
  3. gateway_fetch.rs::default_gateway_urls: drop cloudflare-ipfs.com, re-order to dweb.link, ipfs.io, trustless-gateway.link, 4everland.io, gateway.pinata.cloud, plus one IPFS-only fallback added at the end. 5 working gateways > 6 with a dead one.
  4. client.rs::new: when block_cache fails to open, leave gateway_pool enabled for cid-hint fetches. Decouple the gate.

Test plan

Integration test in crates/fula-client/tests/offline_e2e.rs that exactly mirrors the FxFiles flow:

  • Wiremock master, real BlockCache + stubbed GatewayPool (returning failures).
  • Online write a file → online list (warms cache) → tear master down → offline list → MUST succeed solely from BLOCKS.

Test added pre-fix to demonstrate the failure; passes post-fix.

Verification

  • All four fixes pass dual-reviewer audit + advisor signoff per repo's standard discipline for SDK changes.
  • Cross-platform alignment verified: fula-flutter FRB bindings, fula-js wasm-bindgen surface, wasm gating.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions