Skip to content

docs(bench): recontextualise CubeSandbox row + small-N replay numbers#42

Merged
WaylandYang merged 1 commit into
mainfrom
dev
May 14, 2026
Merged

docs(bench): recontextualise CubeSandbox row + small-N replay numbers#42
WaylandYang merged 1 commit into
mainfrom
dev

Conversation

@WaylandYang
Copy link
Copy Markdown
Contributor

Follow-up to TencentCloud/CubeSandbox#235. The Cube maintainer responded with two clarifications that change how the 20.3 s / 77-of-100 N=100 number should be interpreted (without changing the measurement itself).

What the Cube team said

  1. The reflink-copy race is on a slow code path the original template inadvertently selected. CubeSandbox pre-formats a pool of writable-layer ext4 images at sizes listed in `pool_default_format_size_list` (default `["1Gi"]`). A sandbox whose `writable_layer_size` matches one of those sizes reuses a pool entry — fast path, no `mkfs.ext4` or reflink-copy per sandbox. We passed `--writable-layer-size 2Gi`, which doesn't match, so every sandbox went through the live `mkfs.ext4 + reflink-copy` path. That's where the bad-magic race lives.
  2. Cube's published <60 ms single-instance / P95 90 ms @ N=50 / <200 ms @ N=100 numbers are measured on a 96 vCPU server. Our 20 vCPU host (the dev box) is outside their tested matrix.

Cube also accepted the first two improvements from our issue (configurable `cmdTimeout`, richer diagnostic info on `newExt4RawByReflinkCopy` failures) and is reviewing the third (drop per-clone `e2fsck`).

Doc changes

  • README.md — both footnotes ¹ rewritten to lead with "slow-path measurement on a host outside CubeSandbox's documented testing matrix". The 20.3 s figure itself stays in the table. The footnote now explicitly says we did not re-test the fast-path configuration.
  • bench/CUBESANDBOX.md — two new sections:
    • Upstream response (2026-05-14) with the clarifications from Cube and the status of our improvement proposals.
    • Small-N replay on the same (slow-path) configuration with N=1/5/10 numbers — done so the row's narrative isn't just "we hit a race once."
  • bench/cube-replay.sh — the script that produced the small-N numbers.

Small-N replay results

Same 2 GiB template (slow path), this dev box (20 vCPU / 30 GiB):

N Succeeded Wall-clock Per-sandbox
1 1/1 924 ms 924 ms
5 5/5 2,207 ms 441 ms
10 10/10 2,567 ms 257 ms

100 % success at every size measured — the race is specifically a slow-path-at-high-N phenomenon. The 20.3 s / 100 = 203 ms-per-sandbox figure at N=100 is consistent with the per-sandbox cost shrinking with concurrency.

Test plan

  • README footnotes render OK on GitHub
  • bench/CUBESANDBOX.md anchor link from README footnote works
  • bench/cube-replay.sh is executable + matches the numbers reported

🤖 Generated with Claude Code

… small-N replay

After filing TencentCloud/CubeSandbox#235, the Cube maintainer
confirmed two things that recontextualise the 20.3 s / 77-of-100
N=100 number we'd published:

1. The reflink-copy race lives on a slow code path the original
   template inadvertently selected — `--writable-layer-size 2Gi`
   doesn't match the default `pool_default_format_size_list =
   ["1Gi"]`, so every sandbox went through live `mkfs.ext4 +
   reflink-copy` instead of the pool fast path. The fast path
   (writable_layer_size matches pool) doesn't run the racy code.
2. Cube's published "<60 ms single-instance / P95 90 ms @ N=50 /
   <200 ms @ N=100" numbers are measured on a 96 vCPU server. Our
   20 vCPU host is outside their tested matrix.

Neither of those changes the fact that the 20.3 s figure is what we
measured. They do change how the row should be interpreted, so:

- README.md: both footnotes ¹ rewritten to lead with "slow-path
  measurement on a host outside CubeSandbox's documented testing
  matrix". The 20.3 s number stays in the table. The footnote now
  explicitly says we did not re-test the fast-path configuration.
- bench/CUBESANDBOX.md: two new sections.
    "Upstream response (2026-05-14)" — the two clarifications from
    Tencent verbatim, plus a note that they accepted the first two
    fixes from our issue (configurable cmdTimeout, richer error
    diagnostics) and are reviewing the third.
    "Small-N replay on the same (slow-path) configuration" — we
    re-ran with the same 2 GiB template at N=1, N=5, N=10 to fit
    the 30 GiB host RAM budget. 100% success at every size; cold
    start ~924 ms, per-sandbox cost shrinks to 257 ms at N=10
    (consistent with the 20.3 s / 100 = 203 ms-per-sandbox at
    N=100). Confirms the race is a slow-path-at-high-N phenomenon,
    not a general spawn-time issue.
- bench/cube-replay.sh: the script that produced the small-N
  numbers, parking it next to the rest of bench/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@WaylandYang WaylandYang merged commit 66ff403 into main May 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant