Skip to content

Implement containers snapshot API#6376

Open
gpanders wants to merge 7 commits intomainfrom
ganders/snapshot-2
Open

Implement containers snapshot API#6376
gpanders wants to merge 7 commits intomainfrom
ganders/snapshot-2

Conversation

@gpanders
Copy link
Member

@gpanders gpanders commented Mar 20, 2026

Expose a new experimental snapshotDirectory API for containers and include an implementation using Docker volumes (used by Wrangler for local dev).

This is not yet implemented in production for Cloudflare Workers, but lays the groundwork for that feature.


This PR is (as of now) identical in content to #6337 to make reviewing easier for those who reviewed the last PR. I'll address any review feedback in separate commits to keep things separated.

Implementation notes

The local dev implementation of snapshots uses a combination of Docker volumes and the Docker archive API. We use Docker volumes to manage and store the snapshots on disk (this is mostly a UX convenience for the user, so that workerd created snapshots are visible in docker volume ls and can be easily removed with docker volume prune), but we do not bind-mount the volumes when restoring snapshots, since Docker volumes do not have the same properties as snapshots (namely, snapshots should be immutable).

Instead when creating or restoring a snapshot we create a temporary container (using the same image as the "main" container) with the volume attached and then copy between the containers using the archive API.

NOTE: This is known to perform pretty badly on macOS since each of the archive API calls sends the entire snapshot tar contents over the Docker UNIX socket, which traverses the VM<->host boundary on macOS, so has a lot of overhead. There may be some ways to improve this but for now the plan is to defer those performance improvements to a follow up PR so that we can at least land the functional implementation sooner.

@gpanders gpanders requested review from a team as code owners March 20, 2026 17:34
Copy link
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

Adds an experimental snapshotDirectory API for containers backed by Docker volumes, including snapshot create, restore (with custom mount points), and stale-volume cleanup.

Findings (highest severity first):

  1. [MEDIUM] Dangling kj::StringPtr in deferred deleteDockerVolume cleanupcontainer-client.c++:1421
  2. [MEDIUM] latest TypeScript types expose experimental-only snapshot fields and typestypes/generated-snapshot/latest/index.d.ts:3785
  3. [LOW] Global mutable bool staleSnapshotVolumeCheckScheduled without thread-safety annotationcontainer-client.c++:48

These comments were generated by an AI code review assistant and may contain inaccuracies.

@ask-bonk
Copy link
Contributor

ask-bonk bot commented Mar 20, 2026

Review posted successfully on PR #6376. Here's a summary of the findings:

3 issues found, ranked by severity:

  1. [MEDIUM] Dangling kj::StringPtr in deferred deleteDockerVolume cleanup (container-client.c++:1420-1422): The KJ_DEFER in snapshotDirectory passes volumeName as a kj::StringPtr to deleteDockerVolume. When the deferred cleanup fires on scope exit, the deleteDockerVolume coroutine suspends, then snapshotDirectory's frame is destroyed — leaving the StringPtr dangling when the inner coroutine resumes. Suggested fix: pass an owned copy via kj::str(volumeName).

  2. [MEDIUM] latest TypeScript types expose experimental-only fields (latest/index.d.ts:3785): Despite the JSG_STRUCT_TS_OVERRIDE_DYNAMIC correctly excluding snapshots in the non-experimental branch, the generated latest types include snapshots, hardTimeout, and all the new snapshot-related types (ContainerDirectorySnapshot, ContainerSnapshotRestoreParams, etc.). Users on latest compat date would see these in autocomplete but get runtime errors.

  3. [LOW] Global mutable bool without synchronization (container-client.c++:48): staleSnapshotVolumeCheckScheduled is a non-atomic global bool. Safe in practice (single event loop), but fragile. Suggested using std::atomic<bool>.

github run

@codecov-commenter
Copy link

codecov-commenter commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 0% with 310 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.52%. Comparing base (b4fc0eb) to head (e91f795).

Files with missing lines Patch % Lines
src/workerd/server/container-client.c++ 0.00% 257 Missing ⚠️
src/workerd/api/container.c++ 0.00% 53 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6376      +/-   ##
==========================================
- Coverage   70.69%   70.52%   -0.17%     
==========================================
  Files         424      424              
  Lines      115450   115732     +282     
  Branches    18770    18824      +54     
==========================================
+ Hits        81619    81624       +5     
- Misses      22599    22882     +283     
+ Partials    11232    11226       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@codspeed-hq
Copy link

codspeed-hq bot commented Mar 20, 2026

Merging this PR will not alter performance

✅ 70 untouched benchmarks
⏩ 129 skipped benchmarks1


Comparing ganders/snapshot-2 (e91f795) with main (b4fc0eb)

Open in CodSpeed

Footnotes

  1. 129 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@danlapid
Copy link
Collaborator

CC @anonrig for review

@danlapid danlapid requested a review from anonrig March 20, 2026 18:42
@gpanders
Copy link
Member Author

/bonk review

Copy link
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 findings ranked by severity:

  1. [MEDIUM] normalizePath performs O(n) allocations for n trailing slashes — container-client.c++:51-56
  2. [LOW] Snapshot ID validation allows all-hyphens, which produces a Docker volume named workerd-snap----container-client.c++:1258-1263
  3. [LOW] normalizePath ordering with validateAbsolutePath in snapshotDirectorycontainer-client.c++:1388-1389

Previous review findings from PR #6337 and the prior review on this PR have been addressed:

  • The dangling kj::StringPtr in KJ_DEFER (finding #1 from ask-bonk) is already fixed — line 1421 passes kj::str(volumeName) (owned copy).
  • The staleSnapshotVolumeCheckScheduled global is already std::atomic_bool (line 48).
  • The latest types issue is a pre-existing pipeline problem (author confirmed), not a regression.

Overall: the design is sound. The capnp schema additions, compat-flag gating, path validation, snapshot ID validation during restore, and test coverage (round-trip, named, multi-directory, custom mount point, root restore, cross-DO, non-existent directory) are thorough. The KJ_DEFER + volumeCommitted cleanup pattern correctly handles partial failures.


These comments were generated by an AI code review assistant and may contain inaccuracies.

Comment on lines +1258 to +1263
JSG_REQUIRE(
snapshotId.size() > 0 && snapshotId.size() <= 64, Error, "Invalid snapshot ID length");
for (auto c: snapshotId) {
JSG_REQUIRE((c >= 'a' && c <= 'f') || (c >= '0' && c <= '9') || c == '-', Error,
"Invalid snapshot ID: must contain only hex digits and hyphens");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] Snapshot ID validation allows degenerate IDs.

The character-class check permits IDs consisting entirely of hyphens (e.g., "----") or a single hyphen, which would create Docker volume names like workerd-snap----. This is unlikely to cause breakage since Docker accepts these names, but a leading/trailing hyphen check or requiring at least one hex digit would be a cheap defensive measure:

JSG_REQUIRE(snapshotId.size() > 0 && snapshotId.size() <= 64, Error, "Invalid snapshot ID length");
bool hasHex = false;
for (auto c: snapshotId) {
  if ((c >= 'a' && c <= 'f') || (c >= '0' && c <= '9')) {
    hasHex = true;
  } else {
    JSG_REQUIRE(c == '-', Error,
        "Invalid snapshot ID: must contain only hex digits and hyphens");
  }
}
JSG_REQUIRE(hasHex, Error, "Invalid snapshot ID: must contain at least one hex digit");

This is minor — the IDs are generated by randomUUID() during creation and this validation only runs during restore from deserialized data.

Comment on lines +1388 to +1389
validateAbsolutePath(dir);
dir = normalizePath(kj::mv(dir));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] validateAbsolutePath runs after normalizePath, which is fine, but the validation and normalization are also done in the restore path (lines 1266-1268) with normalizePath before validateAbsolutePath. The ordering is consistent and correct — normalizePath strips trailing slashes which doesn't change the absolute-path or path-traversal properties, so validation after normalization is safe.

Noting this for clarity: if normalizePath were ever extended to do more transformations (e.g., collapsing // or resolving .), the validation-after-normalization order could become important. The current implementation is fine.

@ask-bonk
Copy link
Contributor

ask-bonk bot commented Mar 20, 2026

Review posted on PR #6376. Here's a summary:

Review body (top-level comment, review ID 3983426108):

  • 3 ranked findings with confirmation that prior review findings have been addressed

3 inline comments:

  1. [MEDIUM] container-client.c++:53normalizePath allocates a new kj::String per trailing slash in its while loop. Suggested a single-pass approach that does at most one allocation.

  2. [LOW] container-client.c++:1258-1263 — Snapshot ID validation allows degenerate IDs consisting entirely of hyphens. Suggested requiring at least one hex digit for defense-in-depth.

  3. [LOW] container-client.c++:1388-1389 — Noted the validate-after-normalize ordering is correct and consistent between both the snapshot creation and restore paths. Flagged as informational for future maintenance.

Key verification results:

  • The dangling kj::StringPtr finding from the prior review is already fixed (line 1421 uses kj::str(volumeName))
  • The staleSnapshotVolumeCheckScheduled global is already std::atomic_bool
  • The latest types issue is a pre-existing pipeline concern, confirmed by the author
  • The capnp schema, compat-flag gating, path validation, and test coverage are all solid
  • No coroutine lifetime issues found — no jsg::Lock or V8 handles held across co_await

github run

@github-actions
Copy link

github-actions bot commented Mar 20, 2026

The generated output of @cloudflare/workers-types matches the snapshot in types/generated-snapshot 🎉

@gpanders gpanders force-pushed the ganders/snapshot-2 branch from 971c76e to 7c84356 Compare March 20, 2026 18:59
gpanders and others added 4 commits March 23, 2026 08:34
Expose a new experimental `snapshotDirectory` API for containers and
include an implementation using Docker volumes (used by Wrangler for
local dev).

This is not yet implemented in production for Cloudflare Workers, but
lays the groundwork for that feature.
Co-authored-by: ask-bonk[bot] <249159057+ask-bonk[bot]@users.noreply.github.com>
@gpanders gpanders force-pushed the ganders/snapshot-2 branch from 7c84356 to 6fad9a1 Compare March 23, 2026 13:35
@gpanders
Copy link
Member Author

Rebased on main and resolved conflicts.

* Use .exchange() to update atomic boolean
* Use consistent ordering for normalize -> validate code paths
* Add tests for snapshotting a stopped container and restoring a
  container with a non-existent snapshot ID
@anonrig
Copy link
Member

anonrig commented Mar 23, 2026

/bigbonk roast this pr

Co-authored-by: anonrig <anonrig@users.noreply.github.com>
@ask-bonk
Copy link
Contributor

ask-bonk bot commented Mar 23, 2026

Here's a summary of what was done:

Addressed anonrig's review feedback on src/workerd/api/container.c++:139:

  • Extracted the magic number (1ull << 53) - 1 into a named constexpr uint64_t MAX_SAFE_INTEGER at file scope (line 17)
  • Added a comment explaining the reasoning: snapshot sizes are uint64 in capnp but exposed as double in JS, so we cap at Number.MAX_SAFE_INTEGER to avoid precision loss
  • Updated both usage sites (line 95 in start() and line 143 in snapshotDirectory()) to reference the named constant
  • This follows existing precedent in the codebase (identity-transform-stream.c++:358 uses the same pattern)
  • Replied to the review comment on the PR

github run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants