Skip to content

test(uffd): lift cross-process test infrastructure for UFFD work#2461

Merged
ValentaTomas merged 2 commits intomainfrom
test/uffd-cross-process-helpers
Apr 21, 2026
Merged

test(uffd): lift cross-process test infrastructure for UFFD work#2461
ValentaTomas merged 2 commits intomainfrom
test/uffd-cross-process-helpers

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

Summary

A small slice of #1896 (free page reporting / UFFD REMOVE) lifted out so the remaining REMOVE-specific work can be reviewed in isolation. This PR is tests + tiny test-only hook only; no behavior change in production.

Production change (intentionally minimal: ~3 lines)

  • Add Userfaultfd.defaultCopyMode CULong, OR-ed into the copy mode of every UFFDIO_COPY. Production leaves it at 0, so behavior is unchanged. Tests can set it to UFFDIO_COPY_MODE_WP to validate prefault-style flows where every copy carries the WP bit.

Test infrastructure

  • helpers_test.go: introduce h.executeAll / h.executeOperation so test bodies stop duplicating the read/write switch.
  • cross_process_helpers_test.go:
    • Pass -test.timeout=0 to the helper child. The parent's t.Context() is now the sole authority on helper lifetime, preventing confusing "test killed" output when go test's default timeout fires inside the child while the parent is still doing real work.
    • Wire GO_ALWAYS_WP=1 (set when testConfig.alwaysWP is true) so the helper sets uffd.defaultCopyMode = UFFDIO_COPY_MODE_WP.

Tests

  • async_wp_test.go: refactor to h.executeAll and add four alwaysWP subtests (4k / hugepage × missing-write / mixed) that exercise the new copy-mode hook end-to-end.
  • missing_test.go / missing_write_test.go: refactor to h.executeAll and drop the parallel-operation count from 1_000_00010_000. The old count regularly pushed the helper child into multi-minute runs under -race; 10_000 still meaningfully exercises parallel pagefault handling.

Explicitly out of scope (deferred to follow-ups on top of #1896)

  • UFFD_EVENT_REMOVE handling
  • page tracker / page state machine
  • deferred pagefaults + wakeup pipe
  • gated / pause-resume test mode
  • unregister() cleanup helper
  • UFFD_FEATURE_EVENT_REMOVE registration
  • remove_test.go

Test plan

  • go vet ./pkg/sandbox/uffd/...
  • go test -count=1 -run '^$' ./pkg/sandbox/uffd/... (compile)
  • CI x86 unit tests
  • CI ARM64 tests
  • Manually run TestAsyncWriteProtection (root, KVM-capable host) including new alwaysWP subtests

Lift small, well-scoped pieces of the planned UFFD test infrastructure
from feat/free-page-reporting (#1896) so the remaining REMOVE work in
that branch can be reviewed in isolation.

Production change (intentionally tiny, ~3 lines):
- Add Userfaultfd.defaultCopyMode CULong, OR-ed into every UFFDIO_COPY.
  Production leaves it at 0; tests can set it to UFFDIO_COPY_MODE_WP
  to validate prefault-style flows where every copy carries the WP bit.

Test infrastructure:
- helpers_test.go: introduce h.executeAll / h.executeOperation so test
  bodies stop duplicating the read/write switch.
- cross_process_helpers_test.go:
  * Pass -test.timeout=0 to the helper child. The parent's t.Context()
    is now the sole authority on helper lifetime, preventing confusing
    "test killed" output when go test's default timeout fires inside
    the child while the parent is still doing real work.
  * Wire GO_ALWAYS_WP=1 (set when testConfig.alwaysWP is true) so the
    helper sets uffd.defaultCopyMode = UFFDIO_COPY_MODE_WP.

Tests:
- async_wp_test.go: refactor to h.executeAll and add four alwaysWP
  subtests (4k/hugepage x missing-write/mixed) that exercise the new
  copy-mode hook end-to-end.
- missing_test.go / missing_write_test.go: refactor to h.executeAll and
  drop the parallel-operation count from 1_000_000 to 10_000. The old
  count regularly pushed the helper child into multi-minute runs under
  -race; 10_000 still meaningfully exercises parallel pagefault handling.

What this PR explicitly does NOT contain (deferred to follow-ups on
top of #1896): UFFD_EVENT_REMOVE handling, page tracker, deferred
pagefaults, wakeup pipe, gated/pause-resume test mode, unregister(),
UFFD_FEATURE_EVENT_REMOVE registration, and remove_test.go.
@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 20, 2026

PR Summary

Low Risk
Primarily test/infrastructure changes; the only runtime change is a default-noop defaultCopyMode override that could affect UFFD copy behavior if set non-zero, but production leaves it unset.

Overview
Adds a small hook to Userfaultfd to optionally force a default UFFDIO_COPY mode (used only by cross-process tests), then expands and refactors UFFD write-protection tests to run ordered read/write sequences via a shared executor, validate clean/dirty state via pagemap, exercise an always write-protect on copy path end-to-end, and reduce high-parallelism test iteration counts while ensuring the helper process doesn’t self-timeout.

Reviewed by Cursor Bugbot for commit 11a7aaa. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e416f5c65f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 110 to +113
cmd.Env = append(os.Environ(), "GO_TEST_HELPER_PROCESS=1")
cmd.Env = append(cmd.Env, fmt.Sprintf("GO_MMAP_START=%d", memoryStart))
cmd.Env = append(cmd.Env, fmt.Sprintf("GO_MMAP_PAGE_SIZE=%d", tt.pagesize))
if tt.alwaysWP {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Isolate GO_ALWAYS_WP from parent environment

The helper process environment is seeded from os.Environ() and GO_ALWAYS_WP is only appended when tt.alwaysWP is true, so any ambient GO_ALWAYS_WP value from the parent shell can override the test config and change helper behavior unexpectedly. This makes configureCrossProcessTest non-deterministic across environments (e.g., non-alwaysWP cases can still run with forced WP), which can produce flaky or misleading test results.

Useful? React with 👍 / 👎.

@ValentaTomas ValentaTomas requested a review from djeebus April 20, 2026 23:16
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — test-only refactor with a trivially safe production hook (zero-default field).

Extended reasoning...

Overview

This PR adds cross-process UFFD test infrastructure: a shared executeAll/executeOperation helper, an alwaysWP test config flag that wires GO_ALWAYS_WP=1 into the child process environment, -test.timeout=0 on the helper child to prevent confusing timeout output, four new alwaysWP subtests in async_wp_test.go, and a parallel-operation count reduction (1M→10K). The sole production change is adding defaultCopyMode CULong (zero-initialized) to Userfaultfd, OR-ed into every UFFDIO_COPY; since it is zero in all production code paths, behavior is entirely unchanged.

Security risks

None. The new field is unexported, never set in production code, and the env-var wiring is confined to test helper processes. No auth, crypto, or permission logic is touched.

Level of scrutiny

Low. The production diff is ~3 lines adding a zero-default field; the remainder is test-file refactoring following established patterns already present in the package. Both reported bugs are test-only issues (one pre-existing, one a minor isolation nit for a brand-new project-specific env var) and do not affect production correctness.

Other factors

The two bug findings are minor: the missing root guard in missing_write_test.go is a pre-existing inconsistency not introduced by this PR, and the GO_ALWAYS_WP inheritance issue is a hygiene nit for a newly introduced, project-internal variable unlikely to appear in any real CI environment. Neither warrants blocking. CI runs are still pending, but the compile-only smoke test passes and the logic is straightforward enough that approval is reasonable here.

Comment on lines 112 to 118
cmd.Env = append(cmd.Env, fmt.Sprintf("GO_MMAP_PAGE_SIZE=%d", tt.pagesize))
if tt.alwaysWP {
cmd.Env = append(cmd.Env, "GO_ALWAYS_WP=1")
}

dup, err := syscall.Dup(int(uffdFd))
require.NoError(t, err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 In configureCrossProcessTest, the child environment is built from os.Environ() and GO_ALWAYS_WP=1 is only appended when tt.alwaysWP is true — it is never explicitly removed or overridden when false. If a developer runs GO_ALWAYS_WP=1 go test ./..., every spawned helper process inherits the variable via os.Environ(), causing non-alwaysWP subtests to run the alwaysWP code path and breaking test isolation. Fix by always explicitly setting GO_ALWAYS_WP to either "0" or "1" rather than conditionally appending.

Extended reasoning...

What the bug is and how it manifests

In configureCrossProcessTest (cross_process_helpers_test.go:112-118), the child process command environment is assembled as:

cmd.Env = append(os.Environ(), "GO_TEST_HELPER_PROCESS=1")
// ...
if tt.alwaysWP {
    cmd.Env = append(cmd.Env, "GO_ALWAYS_WP=1")
}

The code only appends GO_ALWAYS_WP=1 when alwaysWP is true, but never removes or overrides it when alwaysWP is false. Because the base environment comes from os.Environ(), any value of GO_ALWAYS_WP present in the parent process is inherited unconditionally by every child.

The specific code path that triggers it

If a developer or CI runner executes GO_ALWAYS_WP=1 go test ./... (e.g., to quickly re-run only the alwaysWP subtests and then forgets to unset the variable), the parent process has GO_ALWAYS_WP=1 in its environment. os.Environ() captures that value, and every helper child launched — including those for alwaysWP=false subtests — inherits it. Inside crossProcessServe() (cross_process_helpers_test.go:303-305):

if os.Getenv("GO_ALWAYS_WP") == "1" {
    uffd.defaultCopyMode = UFFDIO_COPY_MODE_WP
}

This sets defaultCopyMode = UFFDIO_COPY_MODE_WP even for tests that are supposed to exercise the normal (non-WP) copy path.

Why existing code doesn't prevent it

The pattern cmd.Env = append(os.Environ(), ...) is standard Go helper-process idiom and is not inherently wrong — the issue is specifically that a conditionally appended variable is never explicitly cleared. Since cmd.Env is evaluated left-to-right and multiple occurrences of the same variable are resolved by position (first or last, depending on OS), and since the code never appends GO_ALWAYS_WP=0 for the false branch, there is no defensive override.

What the impact would be

Non-alwaysWP tests (e.g., "4k write to missing page") would run the helper with defaultCopyMode = UFFDIO_COPY_MODE_WP, meaning every UFFDIO_COPY carries the WP bit. The tests still pass because UFFD_FEATURE_WP_ASYNC automatically clears the WP bit on write faults, but the code path being exercised is the alwaysWP path, not the normal path. This defeats the purpose of having both variants and means regressions in the normal path would not be caught.

Step-by-step proof

  1. Developer runs: GO_ALWAYS_WP=1 go test -run TestAsyncWriteProtection ./pkg/sandbox/uffd/userfaultfd/
  2. The test binary's process environment contains GO_ALWAYS_WP=1.
  3. configureCrossProcessTest is called for the subtest "4k write to missing page" where tt.alwaysWP = false.
  4. cmd.Env = append(os.Environ(), ...)GO_ALWAYS_WP=1 is now in cmd.Env.
  5. The if tt.alwaysWP branch is false, so nothing overrides it.
  6. The child starts and crossProcessServe reads os.Getenv("GO_ALWAYS_WP") == "1" → true → sets uffd.defaultCopyMode = UFFDIO_COPY_MODE_WP.
  7. The subtest runs the non-alwaysWP code path with WP always set — silent wrong behavior.

Addressing the refutation

The refuter correctly notes that GO_ALWAYS_WP is a brand-new, project-specific variable introduced by this very PR and would not appear in any real CI or external environment. This makes the scenario unlikely in practice. The tests also still pass due to WP_ASYNC. However, the issue is a straightforward hygiene fix (add an else branch setting GO_ALWAYS_WP=0, or filter it from os.Environ()) and the test isolation concern is legitimate — the purpose of the non-alwaysWP subtests is precisely to verify the non-WP code path. Silently running the wrong variant reduces the diagnostic value of the test suite.

Comment on lines 119 to 125
h, err := configureCrossProcessTest(t, tt)
require.NoError(t, err)

for _, operation := range tt.operations {
if operation.mode == operationModeRead {
err := h.executeRead(t.Context(), operation)
require.NoError(t, err, "for operation %+v", operation)
}

if operation.mode == operationModeWrite {
err := h.executeWrite(t.Context(), operation)
require.NoError(t, err, "for operation %+v", operation)
}
}
h.executeAll(t, tt.operations)

expectedAccessedOffsets := getOperationsOffsets(tt.operations, operationModeRead|operationModeWrite)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 The four tests in missing_write_test.go (TestMissingWrite, TestParallelMissingWrite, TestParallelMissingWriteWithPrefault, TestSerialMissingWrite) are missing the if os.Geteuid() \!= 0 { t.Skip(...) } guard present in missing_test.go (line 17) and async_wp_test.go (line 33). Without this guard, running these tests as a non-root user produces a hard kernel permission-denied failure via require.NoError rather than a clean skip, making the test suite appear broken to contributors. This is a pre-existing inconsistency that the PR refactors but does not fix.

Extended reasoning...

What the bug is and how it manifests

missing_write_test.go does not import the os package and contains no os.Geteuid() call anywhere in the file. All four of its top-level test functions — TestMissingWrite, TestParallelMissingWrite, TestParallelMissingWriteWithPrefault, and TestSerialMissingWrite — proceed directly to configureCrossProcessTest(t, tt) without first checking whether the calling process has root privileges.

The specific code path that triggers it

configureCrossProcessTest (cross_process_helpers_test.go) calls newFd(syscall.O_CLOEXEC | syscall.O_NONBLOCK), configureApi(uffdFd, tt.pagesize), and register(uffdFd, memoryStart, uint64(size), UFFDIO_REGISTER_MODE_MISSING|UFFDIO_REGISTER_MODE_WP) in sequence. The UFFDIO_REGISTER_MODE_MISSING|UFFDIO_REGISTER_MODE_WP combination requires root or CAP_SYS_PTRACE (or vm.unprivileged_userfaultfd=1). On a standard developer machine, newFd() or configureApi() will return EPERM from the kernel ioctl.

Why existing code does not prevent it

Each of the four test functions calls require.NoError(t, err) immediately after configureCrossProcessTest, which calls t.FailNow() on the returned error. There is no t.Skip path, so a non-root run results in a hard test failure rather than an informative skip. By contrast, TestMissing (missing_test.go:17-18) and TestAsyncWriteProtection (async_wp_test.go:33-34) both guard with if os.Geteuid() \!= 0 { t.Skip("skipping test as not running as root") } before any privileged operation.

What the impact would be

A contributor running go test ./pkg/sandbox/uffd/... without root sees four unexplained failures. The error messages (permission denied on an ioctl) are not obviously linked to privilege requirements, so developers spend time diagnosing what appears to be a broken test suite rather than realising they simply need root.

How to fix it

Add "os" to the import block of missing_write_test.go and insert the following guard at the top of each of the four test functions (before the first configureCrossProcessTest call):

if os.Geteuid() \!= 0 {
    t.Skip("skipping test as not running as root")
}

Step-by-step proof

  1. Developer clones repo, runs go test -count=1 -run TestMissingWrite ./pkg/sandbox/uffd/userfaultfd/ as a non-root user.
  2. TestMissingWrite calls configureCrossProcessTest(t, tt).
  3. Inside configureCrossProcessTest, newFd(syscall.O_CLOEXEC | syscall.O_NONBLOCK) performs the userfaultfd(2) syscall — rejected with EPERM since the process lacks CAP_SYS_PTRACE and vm.unprivileged_userfaultfd is 0.
  4. require.NoError(t, err) on the line immediately after calls t.FailNow() with the EPERM error.
  5. Test is reported as FAILED instead of SKIPPED. The same happens for the three parallel/serial variants.
  6. Compare with TestMissing: step 1 is os.Geteuid() \!= 0 which triggers t.Skip — test is reported as SKIPPED with an explanatory message.

Drop the speculative pageStatesOnce/handlerPageStates plumbing and the
extra prose comments that were not present in #1896. The PR now ports
only the changes that are needed for the alwaysWP coverage and the
checkDirtiness pagemap assertions.
@ValentaTomas ValentaTomas merged commit b72af33 into main Apr 21, 2026
45 checks passed
@ValentaTomas ValentaTomas deleted the test/uffd-cross-process-helpers branch April 21, 2026 00:01
ValentaTomas added a commit that referenced this pull request May 2, 2026
…_000_000

Reverts the 1M -> 10K reduction made in #2461. Intermittent failures
under 1M are bugs to be hunted, not silenced by lowering the load.
The other parallel/serial variants in these files have legitimately
been at 10K since their introduction in November 2025; only the two
TestParallelMissing(/Write) cases were reduced.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants