test(uffd): lift cross-process test infrastructure for UFFD work#2461
test(uffd): lift cross-process test infrastructure for UFFD work#2461ValentaTomas merged 2 commits intomainfrom
Conversation
Lift small, well-scoped pieces of the planned UFFD test infrastructure from feat/free-page-reporting (#1896) so the remaining REMOVE work in that branch can be reviewed in isolation. Production change (intentionally tiny, ~3 lines): - Add Userfaultfd.defaultCopyMode CULong, OR-ed into every UFFDIO_COPY. Production leaves it at 0; tests can set it to UFFDIO_COPY_MODE_WP to validate prefault-style flows where every copy carries the WP bit. Test infrastructure: - helpers_test.go: introduce h.executeAll / h.executeOperation so test bodies stop duplicating the read/write switch. - cross_process_helpers_test.go: * Pass -test.timeout=0 to the helper child. The parent's t.Context() is now the sole authority on helper lifetime, preventing confusing "test killed" output when go test's default timeout fires inside the child while the parent is still doing real work. * Wire GO_ALWAYS_WP=1 (set when testConfig.alwaysWP is true) so the helper sets uffd.defaultCopyMode = UFFDIO_COPY_MODE_WP. Tests: - async_wp_test.go: refactor to h.executeAll and add four alwaysWP subtests (4k/hugepage x missing-write/mixed) that exercise the new copy-mode hook end-to-end. - missing_test.go / missing_write_test.go: refactor to h.executeAll and drop the parallel-operation count from 1_000_000 to 10_000. The old count regularly pushed the helper child into multi-minute runs under -race; 10_000 still meaningfully exercises parallel pagefault handling. What this PR explicitly does NOT contain (deferred to follow-ups on top of #1896): UFFD_EVENT_REMOVE handling, page tracker, deferred pagefaults, wakeup pipe, gated/pause-resume test mode, unregister(), UFFD_FEATURE_EVENT_REMOVE registration, and remove_test.go.
PR SummaryLow Risk Overview Reviewed by Cursor Bugbot for commit 11a7aaa. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e416f5c65f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| cmd.Env = append(os.Environ(), "GO_TEST_HELPER_PROCESS=1") | ||
| cmd.Env = append(cmd.Env, fmt.Sprintf("GO_MMAP_START=%d", memoryStart)) | ||
| cmd.Env = append(cmd.Env, fmt.Sprintf("GO_MMAP_PAGE_SIZE=%d", tt.pagesize)) | ||
| if tt.alwaysWP { |
There was a problem hiding this comment.
Isolate GO_ALWAYS_WP from parent environment
The helper process environment is seeded from os.Environ() and GO_ALWAYS_WP is only appended when tt.alwaysWP is true, so any ambient GO_ALWAYS_WP value from the parent shell can override the test config and change helper behavior unexpectedly. This makes configureCrossProcessTest non-deterministic across environments (e.g., non-alwaysWP cases can still run with forced WP), which can produce flaky or misleading test results.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
LGTM — test-only refactor with a trivially safe production hook (zero-default field).
Extended reasoning...
Overview
This PR adds cross-process UFFD test infrastructure: a shared executeAll/executeOperation helper, an alwaysWP test config flag that wires GO_ALWAYS_WP=1 into the child process environment, -test.timeout=0 on the helper child to prevent confusing timeout output, four new alwaysWP subtests in async_wp_test.go, and a parallel-operation count reduction (1M→10K). The sole production change is adding defaultCopyMode CULong (zero-initialized) to Userfaultfd, OR-ed into every UFFDIO_COPY; since it is zero in all production code paths, behavior is entirely unchanged.
Security risks
None. The new field is unexported, never set in production code, and the env-var wiring is confined to test helper processes. No auth, crypto, or permission logic is touched.
Level of scrutiny
Low. The production diff is ~3 lines adding a zero-default field; the remainder is test-file refactoring following established patterns already present in the package. Both reported bugs are test-only issues (one pre-existing, one a minor isolation nit for a brand-new project-specific env var) and do not affect production correctness.
Other factors
The two bug findings are minor: the missing root guard in missing_write_test.go is a pre-existing inconsistency not introduced by this PR, and the GO_ALWAYS_WP inheritance issue is a hygiene nit for a newly introduced, project-internal variable unlikely to appear in any real CI environment. Neither warrants blocking. CI runs are still pending, but the compile-only smoke test passes and the logic is straightforward enough that approval is reasonable here.
| cmd.Env = append(cmd.Env, fmt.Sprintf("GO_MMAP_PAGE_SIZE=%d", tt.pagesize)) | ||
| if tt.alwaysWP { | ||
| cmd.Env = append(cmd.Env, "GO_ALWAYS_WP=1") | ||
| } | ||
|
|
||
| dup, err := syscall.Dup(int(uffdFd)) | ||
| require.NoError(t, err) |
There was a problem hiding this comment.
🟡 In configureCrossProcessTest, the child environment is built from os.Environ() and GO_ALWAYS_WP=1 is only appended when tt.alwaysWP is true — it is never explicitly removed or overridden when false. If a developer runs GO_ALWAYS_WP=1 go test ./..., every spawned helper process inherits the variable via os.Environ(), causing non-alwaysWP subtests to run the alwaysWP code path and breaking test isolation. Fix by always explicitly setting GO_ALWAYS_WP to either "0" or "1" rather than conditionally appending.
Extended reasoning...
What the bug is and how it manifests
In configureCrossProcessTest (cross_process_helpers_test.go:112-118), the child process command environment is assembled as:
cmd.Env = append(os.Environ(), "GO_TEST_HELPER_PROCESS=1")
// ...
if tt.alwaysWP {
cmd.Env = append(cmd.Env, "GO_ALWAYS_WP=1")
}The code only appends GO_ALWAYS_WP=1 when alwaysWP is true, but never removes or overrides it when alwaysWP is false. Because the base environment comes from os.Environ(), any value of GO_ALWAYS_WP present in the parent process is inherited unconditionally by every child.
The specific code path that triggers it
If a developer or CI runner executes GO_ALWAYS_WP=1 go test ./... (e.g., to quickly re-run only the alwaysWP subtests and then forgets to unset the variable), the parent process has GO_ALWAYS_WP=1 in its environment. os.Environ() captures that value, and every helper child launched — including those for alwaysWP=false subtests — inherits it. Inside crossProcessServe() (cross_process_helpers_test.go:303-305):
if os.Getenv("GO_ALWAYS_WP") == "1" {
uffd.defaultCopyMode = UFFDIO_COPY_MODE_WP
}This sets defaultCopyMode = UFFDIO_COPY_MODE_WP even for tests that are supposed to exercise the normal (non-WP) copy path.
Why existing code doesn't prevent it
The pattern cmd.Env = append(os.Environ(), ...) is standard Go helper-process idiom and is not inherently wrong — the issue is specifically that a conditionally appended variable is never explicitly cleared. Since cmd.Env is evaluated left-to-right and multiple occurrences of the same variable are resolved by position (first or last, depending on OS), and since the code never appends GO_ALWAYS_WP=0 for the false branch, there is no defensive override.
What the impact would be
Non-alwaysWP tests (e.g., "4k write to missing page") would run the helper with defaultCopyMode = UFFDIO_COPY_MODE_WP, meaning every UFFDIO_COPY carries the WP bit. The tests still pass because UFFD_FEATURE_WP_ASYNC automatically clears the WP bit on write faults, but the code path being exercised is the alwaysWP path, not the normal path. This defeats the purpose of having both variants and means regressions in the normal path would not be caught.
Step-by-step proof
- Developer runs:
GO_ALWAYS_WP=1 go test -run TestAsyncWriteProtection ./pkg/sandbox/uffd/userfaultfd/ - The test binary's process environment contains
GO_ALWAYS_WP=1. configureCrossProcessTestis called for the subtest"4k write to missing page"wherett.alwaysWP = false.cmd.Env = append(os.Environ(), ...)—GO_ALWAYS_WP=1is now incmd.Env.- The
if tt.alwaysWPbranch is false, so nothing overrides it. - The child starts and
crossProcessServereadsos.Getenv("GO_ALWAYS_WP") == "1"→ true → setsuffd.defaultCopyMode = UFFDIO_COPY_MODE_WP. - The subtest runs the non-alwaysWP code path with WP always set — silent wrong behavior.
Addressing the refutation
The refuter correctly notes that GO_ALWAYS_WP is a brand-new, project-specific variable introduced by this very PR and would not appear in any real CI or external environment. This makes the scenario unlikely in practice. The tests also still pass due to WP_ASYNC. However, the issue is a straightforward hygiene fix (add an else branch setting GO_ALWAYS_WP=0, or filter it from os.Environ()) and the test isolation concern is legitimate — the purpose of the non-alwaysWP subtests is precisely to verify the non-WP code path. Silently running the wrong variant reduces the diagnostic value of the test suite.
| h, err := configureCrossProcessTest(t, tt) | ||
| require.NoError(t, err) | ||
|
|
||
| for _, operation := range tt.operations { | ||
| if operation.mode == operationModeRead { | ||
| err := h.executeRead(t.Context(), operation) | ||
| require.NoError(t, err, "for operation %+v", operation) | ||
| } | ||
|
|
||
| if operation.mode == operationModeWrite { | ||
| err := h.executeWrite(t.Context(), operation) | ||
| require.NoError(t, err, "for operation %+v", operation) | ||
| } | ||
| } | ||
| h.executeAll(t, tt.operations) | ||
|
|
||
| expectedAccessedOffsets := getOperationsOffsets(tt.operations, operationModeRead|operationModeWrite) | ||
|
|
There was a problem hiding this comment.
🟣 The four tests in missing_write_test.go (TestMissingWrite, TestParallelMissingWrite, TestParallelMissingWriteWithPrefault, TestSerialMissingWrite) are missing the if os.Geteuid() \!= 0 { t.Skip(...) } guard present in missing_test.go (line 17) and async_wp_test.go (line 33). Without this guard, running these tests as a non-root user produces a hard kernel permission-denied failure via require.NoError rather than a clean skip, making the test suite appear broken to contributors. This is a pre-existing inconsistency that the PR refactors but does not fix.
Extended reasoning...
What the bug is and how it manifests
missing_write_test.go does not import the os package and contains no os.Geteuid() call anywhere in the file. All four of its top-level test functions — TestMissingWrite, TestParallelMissingWrite, TestParallelMissingWriteWithPrefault, and TestSerialMissingWrite — proceed directly to configureCrossProcessTest(t, tt) without first checking whether the calling process has root privileges.
The specific code path that triggers it
configureCrossProcessTest (cross_process_helpers_test.go) calls newFd(syscall.O_CLOEXEC | syscall.O_NONBLOCK), configureApi(uffdFd, tt.pagesize), and register(uffdFd, memoryStart, uint64(size), UFFDIO_REGISTER_MODE_MISSING|UFFDIO_REGISTER_MODE_WP) in sequence. The UFFDIO_REGISTER_MODE_MISSING|UFFDIO_REGISTER_MODE_WP combination requires root or CAP_SYS_PTRACE (or vm.unprivileged_userfaultfd=1). On a standard developer machine, newFd() or configureApi() will return EPERM from the kernel ioctl.
Why existing code does not prevent it
Each of the four test functions calls require.NoError(t, err) immediately after configureCrossProcessTest, which calls t.FailNow() on the returned error. There is no t.Skip path, so a non-root run results in a hard test failure rather than an informative skip. By contrast, TestMissing (missing_test.go:17-18) and TestAsyncWriteProtection (async_wp_test.go:33-34) both guard with if os.Geteuid() \!= 0 { t.Skip("skipping test as not running as root") } before any privileged operation.
What the impact would be
A contributor running go test ./pkg/sandbox/uffd/... without root sees four unexplained failures. The error messages (permission denied on an ioctl) are not obviously linked to privilege requirements, so developers spend time diagnosing what appears to be a broken test suite rather than realising they simply need root.
How to fix it
Add "os" to the import block of missing_write_test.go and insert the following guard at the top of each of the four test functions (before the first configureCrossProcessTest call):
if os.Geteuid() \!= 0 {
t.Skip("skipping test as not running as root")
}Step-by-step proof
- Developer clones repo, runs
go test -count=1 -run TestMissingWrite ./pkg/sandbox/uffd/userfaultfd/as a non-root user. TestMissingWritecallsconfigureCrossProcessTest(t, tt).- Inside
configureCrossProcessTest,newFd(syscall.O_CLOEXEC | syscall.O_NONBLOCK)performs theuserfaultfd(2)syscall — rejected with EPERM since the process lacksCAP_SYS_PTRACEandvm.unprivileged_userfaultfdis 0. require.NoError(t, err)on the line immediately after callst.FailNow()with the EPERM error.- Test is reported as FAILED instead of SKIPPED. The same happens for the three parallel/serial variants.
- Compare with
TestMissing: step 1 isos.Geteuid() \!= 0which triggerst.Skip— test is reported as SKIPPED with an explanatory message.
Drop the speculative pageStatesOnce/handlerPageStates plumbing and the extra prose comments that were not present in #1896. The PR now ports only the changes that are needed for the alwaysWP coverage and the checkDirtiness pagemap assertions.
…_000_000 Reverts the 1M -> 10K reduction made in #2461. Intermittent failures under 1M are bugs to be hunted, not silenced by lowering the load. The other parallel/serial variants in these files have legitimately been at 10K since their introduction in November 2025; only the two TestParallelMissing(/Write) cases were reduced.
Summary
A small slice of #1896 (free page reporting / UFFD REMOVE) lifted out so the remaining REMOVE-specific work can be reviewed in isolation. This PR is tests + tiny test-only hook only; no behavior change in production.
Production change (intentionally minimal: ~3 lines)
Userfaultfd.defaultCopyMode CULong, OR-ed into the copy mode of everyUFFDIO_COPY. Production leaves it at0, so behavior is unchanged. Tests can set it toUFFDIO_COPY_MODE_WPto validate prefault-style flows where every copy carries the WP bit.Test infrastructure
helpers_test.go: introduceh.executeAll/h.executeOperationso test bodies stop duplicating the read/write switch.cross_process_helpers_test.go:-test.timeout=0to the helper child. The parent'st.Context()is now the sole authority on helper lifetime, preventing confusing "test killed" output whengo test's default timeout fires inside the child while the parent is still doing real work.GO_ALWAYS_WP=1(set whentestConfig.alwaysWPistrue) so the helper setsuffd.defaultCopyMode = UFFDIO_COPY_MODE_WP.Tests
async_wp_test.go: refactor toh.executeAlland add fouralwaysWPsubtests (4k / hugepage × missing-write / mixed) that exercise the new copy-mode hook end-to-end.missing_test.go/missing_write_test.go: refactor toh.executeAlland drop the parallel-operation count from1_000_000→10_000. The old count regularly pushed the helper child into multi-minute runs under-race;10_000still meaningfully exercises parallel pagefault handling.Explicitly out of scope (deferred to follow-ups on top of #1896)
UFFD_EVENT_REMOVEhandlinggated/ pause-resume test modeunregister()cleanup helperUFFD_FEATURE_EVENT_REMOVEregistrationremove_test.goTest plan
go vet ./pkg/sandbox/uffd/...go test -count=1 -run '^$' ./pkg/sandbox/uffd/...(compile)TestAsyncWriteProtection(root, KVM-capable host) including newalwaysWPsubtests