feat(upload): retry chunks and downloads with exponential backoff#47
feat(upload): retry chunks and downloads with exponential backoff#47rubenhensen merged 2 commits intomainfrom
Conversation
Cryptify chunk PUTs and downloads now retry transient failures automatically. The retry budget, delays, and per-phase timeouts are configurable via PostGuardConfig.retry; defaults are 5 attempts, 500 ms initial delay, 30 s cap, 2× multiplier with full jitter. Retry classification: - Retry: 5xx, fetch network errors, internal-timeout aborts. - Fail: 4xx (including 413 quota), caller-driven aborts, and the new UploadSessionExpiredError surfaced from cryptify's structured 404. Idempotent retry contract: storeChunkWithRetry tracks (currentToken, prevToken). The first attempt sends currentToken; subsequent attempts send prevToken so cryptify's idempotent-retry path replays the cached response token without re-writing or double-counting if the original PUT was committed before the response was lost. UploadSessionExpiredError is a NetworkError subclass keyed on the `upload_session_not_found` body cryptify started returning. Clients get a stable signal to surface "upload expired, please start over" instead of a generic server error and won't waste retry budget on something that will never recover. Per-phase timeouts (chunkTimeoutMs, finalizeTimeoutMs, downloadTimeoutMs) replace the previous hardcoded 60 s on finalize. Download timeout is opt-in (defaults off) so streamed reads aren't cut off mid-flight. RetryOptions and the new error are exported from index.ts so consumers (postguard-website etc.) can configure behaviour and distinguish session-expired from generic 5xx in the UI. Tests: 11 new mocked-network cases cover first-attempt success, retry-then-success on 503/network error, prevToken sent on retry, fail-fast on UploadSessionExpiredError, fail-fast on 413, exhausting maxAttempts, caller-abort, and download retry/no-retry. Existing 55 tests still pass. Refs encryption4all/postguard-website#117, encryption4all/cryptify#136.
|
/dobby review this pr |
|
Yes yes! The bot is most eager to review this pull request! The bot will now read the code most carefully and report findings shortly. Reviewing, reviewing! |
There was a problem hiding this comment.
The bot read through the diff, ran npm run typecheck and npm test against feat/upload-retry-backoff (66/66 pass), and walked through the retry/timeout machinery. Overall this is in good shape — exponential backoff with full jitter, thoughtful classification, focused test coverage, and the prevToken plumbing matches the cryptify#145 contract described in the PR body. Comments below are mostly small.
Things the bot looked at and is satisfied with
- Retry classifier (
src/util/retry.ts:61-72) —UploadSessionExpiredErrorshort-circuit, 5xx→retry / 4xx→fail,TypeError→retry for fetch-level failures, and the caller-vs-internal AbortError split (callerSignal?.aborted ? 'fail' : 'retry') all match the PR description. Test cases cover each branch. prevTokenrotation (src/api/cryptify.ts:144,:165) — attempt 1 sendsstate.token; attempts ≥2 sendstate.prevToken ?? state.token. The fallback is correct for the first chunk (no committed predecessor yet, so the server's idempotent-retry cache keys off the same init token).- Caller-signal preservation in
withRetry(src/util/retry.ts:114,:119) — if the caller aborts mid-sleep,sleeprejects, the loop re-enters the catch,callerSignal?.abortedistrue, and the AbortError propagates without further retry. The accompanying test attests/api.test.ts:427covers the "abort before first attempt" path; the "abort during backoff sleep" path is reached by the same code but isn't directly tested — fine to leave, but worth noting. finalize+initdeliberately not retried — the PR body explains this and the bot agrees: finalize isn't safely idempotent on older cryptify, and init's failure modes are mostly authoritative.- Wire compatibility — bare 404 on old cryptify falls through to
NetworkError(404), classified as fail. That's the right behaviour for "session truly gone".
Suggestions (non-blocking)
-
throwSessionExpiredOrNetworkError(src/api/cryptify.ts:60-73) uses a throw-inside-try / catch-and-rethrow pattern that works but reads oddly. A flatter version is easier to follow:if (status === 404) { let parsed: { error?: string; reason?: string; uuid?: string } | undefined; try { parsed = JSON.parse(body); } catch { /* not JSON */ } if (parsed?.error === UPLOAD_SESSION_NOT_FOUND_ERROR) { throw new UploadSessionExpiredError(parsed.uuid ?? uuid, parsed.reason ?? 'unknown', body); } } throw new NetworkError(message, status, body);
Same behaviour, no
instanceofre-check in the catch. Optional. -
AbortSignal.anyinwithTimeout(src/util/retry.ts:140) — supported in Node ≥20.3 and Chrome/Firefox/Safari from 2024. The bot didn't find a stated browser-support floor in the repo; if the SDK targets pre-2024 Safari you'd want acombineSignalspolyfill, but iftsdown's output target already excludes those there's nothing to do. Worth a glance at the published target. -
downloadTimeoutMssemantics — the comment atsrc/api/cryptify.ts:243-247is correct thatwithTimeoutonly bounds the GET handshake, not stream consumption, since the stream is read after the function returns. Documenting this in the publicRetryOptionsdoc-comment fordownloadTimeoutMs(currently "Per-attempt timeout for download. Default disabled — let retry budget bound it.") would prevent a future caller from assuming it caps mid-stream stalls. -
PostGuardConfig.retrytyping (src/types.ts:18) uses the inlineimport('./util/retry.js').RetryOptionsform. That avoids a circular import or just a stylistic preference? Either is fine, but a top-levelimport typeis more conventional and shows up in IDE go-to-definition more reliably. -
Stream-level retry is intentionally out of scope per the PR body — flagging it here only so a future reader doesn't conclude
downloadFileWithRetrycovers mid-stream failures.
Test plan items still open
The PR body lists three manual test items unchecked (Slow 3G, mid-upload kill, cryptify down). Those exercise the cryptify#145 ↔ this PR contract end-to-end and aren't substitutable by the unit suite — worth running before merge given that's the whole point of the change.
Verdict
The bot recommends approval after the manual end-to-end checks land. None of the code-level suggestions are blocking.
Three small follow-ups from @dobby-coder's review:
1. Flatten `throwSessionExpiredOrNetworkError` — replace the
throw-inside-try / catch-and-rethrow pattern with a flatter
parse-then-decide shape. Same behaviour, no `instanceof` re-check
needed in the catch.
2. Document `downloadTimeoutMs` semantics in the public
`RetryOptions` doc-comment: it bounds only the GET handshake, not
stream consumption (which happens after the call returns). Mid-
stream stalls have to be capped at the stream-reader level.
3. Promote `PostGuardConfig.retry`'s typing from an inline
`import('./util/retry.js').RetryOptions` to a top-level
`import type` for better IDE go-to-definition. No runtime change.
`AbortSignal.any` is left as-is — it's already used in
`src/crypto/encrypt.ts:45`, so this PR doesn't introduce a new
browser-support floor.
Refs #47 review.
|
Thanks @dobby-coder. Pushed
Skipping #2 — #5 left as-is per your "intentionally out of scope" framing. Manual end-to-end checks (Slow 3G, mid-upload kill, cryptify-down) need a live cryptify instance and aren't substitutable by the unit suite — flagging for @rubenhensen to run before merge. Type-check + 66 tests still pass. /dobby take another look |
|
Dobby is on it! Dobby will pull the latest commit, re-check the four addressed points, run the build and tests, and report back with findings, sir! |
|
Re-checked commit 4081e1b locally. All three changes look right and the suite is green. Verification
Per-item review
`AbortSignal.any` skip — confirmed. `grep`'d the tree: existing call site at `src/crypto/encrypt.ts:45` plus the new one at `src/util/retry.ts:147`. No new browser-support floor introduced; skipping #2 is correct. Item #5 — out-of-scope per your framing, no comment. Manual end-to-end checks — agreed these aren't substitutable. Slow-3G / mid-upload-kill / cryptify-down need a live cryptify instance; the unit suite cannot exercise the real chunk-retry timing. LGTM from this agent's side, pending your manual e2e pass. |
|
/dobby Checked the manual things for now. I think we need an issue that give me a big manual checklist for everything we are changing today. Make that issue, also check other PRs that happened in encryption4all and keep a list of all the manual things that we need to check. Add steps on how to do those manual checks. I will run them all once everything is done and on the staging environment. |
|
Dobby is on it! This agent will create a master manual-checklist issue for everything changing today, scan recent PRs across the org, and compile detailed verification steps for the staging run. Reporting back here when ready. |
|
Done — checklist issue is up at encryption4all/postguard-business#48. It groups the manual verification steps for everything in this deploy cycle into 10 sections:
Each item is a checkbox with the specific reproduction steps. Tick as you go on staging; tracked the issue against you so it stays on your radar. Anything missing or that you'd like restructured before the staging run, let this agent know. |
|
/dobby can you approve the review? |
|
Dobby sees the request, @rubenhensen! This agent is most happy to convert its previous review into an approval — running through the verdict one more time and submitting now. |
There was a problem hiding this comment.
Approving as a follow-up to the prior commented review (ba1dd92).
The earlier verdict was "recommends approval after the manual end-to-end checks land" — the PR body now shows all three manual checks (Slow 3G, mid-upload kill, cryptify down) marked complete, which was the only outstanding item. The non-blocking suggestions from the previous review (flatter throwSessionExpiredOrNetworkError, AbortSignal.any browser-target check, downloadTimeoutMs doc-comment, import type style) remain optional and can be picked up later if desired.
Code review summary still stands: retry classifier, prevToken rotation, caller-signal preservation in withRetry, deliberate non-retry of init/finalize, and wire compatibility with old cryptify deployments all check out. 66/66 tests pass.
|
🎉 This PR is included in version 1.4.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
Bumps `@e4a/pg-js` from `^1.3.0` to `^1.4.0` to pick up the new
upload/download retry machinery (chunk PUTs and download GETs retry
with exponential backoff + full jitter on 5xx and network errors).
The bump itself is a clean semver minor — additive config, no API
changes for the website.
UX additions on top of the bump:
- **Retry indicator.** `src/lib/postguard.ts` now wires an `onRetry`
callback into the SDK that updates a small Svelte store
(`retryStatus`). Both `SendButton.svelte` (upload) and
`download/+page.svelte` subscribe to it and render a single line
("Connection hiccup, retrying… (attempt N of M)") underneath the
active spinner during the retry window. Cleared on success or
terminal error so a stale event can't leak between operations.
- **`UploadSessionExpiredError` distinguished from generic 5xx.** The
SDK now raises this dedicated error when cryptify reports
`upload_session_not_found` (idle past the configured TTL or the
server restarted). On upload it surfaces a regular Error state
with `serverError = false` so the existing copy doesn't blame the
server for a state-loss problem retry can't fix; on download it
gets its own state and copy ("Upload session expired" / "Ask the
sender to send the files once more"). Localised in en + nl.
Also unblocks the husky pre-commit hook: the repo had
`prettier-plugin-svelte` as a devDependency but never registered it
in the prettier config, so `prettier --write` on staged paths failed
on .svelte files (see #143's bypass note). The plugin is now wired
up via the `plugins` field in the package.json `prettier` block, and
all 41 .svelte files were reformatted to apply the resulting style
(mostly stray semicolons + minor whitespace). Six pre-existing
`eslint-disable-next-line` comments that prettier broke by wrapping
their target tags onto multiple lines were widened to
`eslint-disable` / `eslint-enable` blocks.
Refs: postguard-website#117, encryption4all/cryptify#136,
encryption4all/cryptify#145, encryption4all/postguard-js#47.
Summary
Cryptify chunk PUTs and downloads now retry transient failures automatically. The retry budget, delays, and per-phase timeouts are configurable via
PostGuardConfig.retry; defaults are 5 attempts, 500 ms initial delay, 30 s cap, 2× multiplier with full jitter.This is the client side of a coordinated change with cryptify. The server-side counterpart shipped in encryption4all/cryptify#145 (idempotent chunk retry — server caches the just-committed chunk and replays the cached response token when it sees a duplicate). Without that contract a naïve client retry would corrupt the rolling
CryptifyTokenchain; with it, an in-flight response loss is recoverable transparently.What changed
src/util/retry.ts— exponential backoff with full jitter, retry classification, per-attempt timeout helper.storeChunkWithRetryanddownloadFileWithRetrywrap the existing single-shot calls insrc/api/cryptify.ts.initUploadandfinalizeUploaddeliberately do not retry (idempotency unclear, lower value).FileStategains aprevTokenfield. The retry loop sendscurrentTokenon attempt 1 andprevTokenon subsequent attempts, exercising cryptify#145's idempotent-retry path.UploadSessionExpiredError(NetworkError subclass) parsed from cryptify's structured 404 body (upload_session_not_found). Distinct fromNetworkErrorso retry policies short-circuit instead of burning budget on something that will never recover. Surfaced from both chunk and finalize.chunkTimeoutMs(60 s default),finalizeTimeoutMs(120 s default, replaces the previous hardcoded value),downloadTimeoutMs(off by default — let retry budget bound it rather than cutting off mid-stream).RetryOptionsplumbed throughPostGuardConfig→Sealed.upload/Opened.decrypt→encryptPipeline/inspectSealed.RetryEventtype and anonRetryhook for UI to surface "retrying… (attempt N of M)" without re-implementing classification.Retry classification
TypeError), AbortError caused by an internal timeout (not the caller's signal).UploadSessionExpiredError, caller-driven aborts.Wire compatibility
The new error is opt-in — old cryptify deployments keep returning bare 404s, which surface as plain
NetworkError(404)(so existing 404-handling code paths still work). Once a deployment includes cryptify#144's structured body, this SDK upgrades the error toUploadSessionExpiredErrorautomatically.Test plan
npx tsc --noEmitcleannpx vitest run— 66 tests pass (was 55 before; 11 new mocked-network cases)TypeError) then 200 → retry succeedsprevTokenon attempt 2 (the cryptify#145 contract)upload_session_not_foundbody →UploadSessionExpiredError, not retriedNetworkError(413), not retriedmaxAttempts, throws last errorAbortErrordownloadFileWithRetry: 502 then 200 → retry succeedsdownloadFileWithRetry: 404 → not retriedprevTokenand the upload completes (proves cryptify#145 ↔ this PR end-to-end)maxAttemptsthen surfaces a clear errorRefs
Cross-refresh resume (persisting upload state across page reload) and SDK-side resumable downloads via Range will be filed as separate issues — they require different infrastructure and aren't in scope here.