roachprod: add functional test framework and tests by williamchoe3 · Pull Request #161178 · cockroachdb/cockroach

williamchoe3 · 2026-01-15T17:12:13Z

Introduces a lightweight test framework for running roachprod commands against real GCE infrastructure, along with a suite of functional tests targeting roachprod cluster operations on GCE, and a TeamCity CI script for weekly execution. See README.md for more details.

The framework (pkg/cmd/roachprod/test/framework) provides:

RoachprodTest harness managing cluster lifecycle
Assertion helpers that query roachprod list --json to verify cluster state (node count, cloud, lifetime, etc.)
Randomized configuration generation (RandomGCECreateOptions) that produces valid, internally consistent GCE configs (machine type, architecture, storage, zones, etc.) using a seeded RNG for reproducibility
Operational helpers for cluster ops and cluster utility helpers

Tests are located in pkg/cmd/roachprod/test/test/ and current coverage includes GCE related commands: create, destroy, extend, start/stop, reset, version, populate-etc-hosts, and managed clusters.

The CI script (build/teamcity/cockroach/nightlies/roachprod_weekly.sh) builds bazci and runs all test targets inside a Docker container.

See below comments for more details.

blathers-crl · 2026-01-15T17:12:20Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2026-01-15T17:13:01Z

This change is

williamchoe3 · 2026-02-18T20:59:23Z

PR #161178 Review: roachprod functional tests

Overview

This PR introduces a lightweight end-to-end testing framework for roachprod that creates real clusters on GCE, runs roachprod commands against them, and asserts cluster state. The framework supports randomized configuration generation, reproducible seeds, and automatic cleanup. Well-structured overall — the separation between framework/ and tests/ is clean, the functional options pattern is idiomatic, and the randomized testing approach with seeded RNG is the right design.

BLOCKERS

B1. Cluster name collisions with --runs_per_test > 1
framework/framework.go:253-257

timestamp := time.Now().Unix()
return fmt.Sprintf("%s-%s-%d", username, testName, timestamp)

The cluster name uses unix seconds for uniqueness. The CI script explicitly supports --runs_per_test="$TEST_COUNT" and the README describes TEST_COUNT as being "for stressing or running randomization tests." When TEST_COUNT > 1, Bazel launches multiple instances of the same test target in parallel. Two instances of TestCreateAMD64 starting in the same second produce the identical cluster name teamcity-testcreateamd64-1738000000. This causes:

Both tests trying to create the same cluster (one fails or overwrites)
Cleanup in one test destroying the other's cluster mid-test
Flaky, confusing failures that are hard to reproduce

Fix: include a per-invocation disambiguator. Bazel sets TEST_RUN_NUMBER for --runs_per_test runs. Alternatively, use nanosecond precision or add a short random suffix:

return fmt.Sprintf("%s-%s-%d-%s", username, testName, timestamp, randomSuffix(4))

B2. BootDiskOnly and UseLocalSSD can both be true, producing conflicting flags
framework/createconfig.go:138-141 (step 8) vs createconfig.go:115-120 (step 3)

Step 3 sets UseLocalSSD = true (50% chance on machines that support it). Step 8 independently sets BootDiskOnly = true (5% chance). Neither checks the other. When both are true, ToCreateArgs emits both --local-ssd and --gce-boot-disk-only, which are contradictory — boot-disk-only means no additional volumes (no local SSDs, no PD), while --local-ssd requests local SSD attachment.

This affects ~2.5% of randomized configs on machines that support local SSD. The resulting roachprod create call will likely either fail or produce undefined behavior.

Fix: step 8 should set UseLocalSSD = false when BootDiskOnly is true, or skip the BootDiskOnly coin flip when UseLocalSSD is already true.

MEDIUM

M1. ToCreateArgs omits --gce-use-bulk-insert=false for provisioned IOPS/throughput
framework/createconfig.go:197-209

TestCreateHyperdisk explicitly notes: "These flags are only wired through the CLI path, so --gce-use-bulk-insert=false is required." But RandomGCECreateOptions can generate configs with hyperdisk-balanced + provisioned IOPS/throughput, and ToCreateArgs never emits --gce-use-bulk-insert=false. The create succeeds (because the bulk insert path just ignores the IOPS values), but the test silently doesn't test what it thinks it's testing — the actual disk gets default IOPS, not the requested 4000.

M2. extractJSON vulnerable to trailing output with } on its own line
framework/framework.go:262-281

The function tracks the last line that equals }. If roachprod outputs a warning after the JSON that happens to contain } on its own line, end extends past the real JSON closing brace, and json.Unmarshal fails with a confusing parse error. The test case "warnings containing braces" only covers inline braces ({"error": "auth failed"}), not standalone } on its own line.

This is unlikely today but fragile — any future change to roachprod's error output could trigger it. Consider parsing from the end backwards, or using json.Decoder which stops at the first complete JSON object.

M3. list_test.go is entirely commented-out dead code

The entire file is a commented-out test with the note "Commenting because this just makes the log unusable rn..." This shouldn't ship in the PR. Either fix the log noise issue and include the test, or remove the file entirely. Dead code in a test framework that "all pipelines will migrate on" sets a bad precedent.

M4. WaitForSSH only probes node 1
framework/operations.go:129

target := fmt.Sprintf("%s:1", tc.clusterName)

For multi-node clusters after a reset, SSH could be available on node 1 but not node 2 or 3. The test then proceeds to run commands on all nodes and gets a flaky failure. Today only TestReset uses this (with 1 node), but the method is general-purpose and will be used for multi-node scenarios as the framework grows.

NITs

N1. joinZones reimplements strings.Join
framework/createconfig.go:233-242

func joinZones(zones []string) string {
    result := ""
    for i, z := range zones { ... }
    return result
}

This is just strings.Join(zones, ",").

N2. AssertClusterCloud uses manual prefix matching instead of strings.HasPrefix
framework/assertions.go:62

if len(provider) >= len(expectedCloud) && provider[:len(expectedCloud)] == expectedCloud {

→ strings.HasPrefix(provider, expectedCloud)

N3. README typos

Line 182: "convinient" → "convenient"
Line 278: "uni tests" → "unit tests"

N4. TestExtend lifetime assertion may be brittle
tests/extend_test.go:43

rpt.AssertClusterLifetime(initialLifetime + extension)

This assumes extend is purely additive and the lifetime hasn't drifted due to clock skew or rounding. If roachprod uses server-side timestamps with even a second of drift, this strict equality check could flake. Using a tolerance window (require.InDelta) would be more robust, but this may be fine if roachprod's extend is known to be exact.

williamchoe3 · 2026-02-18T21:45:41Z

Updated Review

Previously raised items — status

#	Issue	Status
B1	Cluster name collision with `--runs_per_test > 1`	Fixed. Changed from `Unix()` (seconds) to `UnixNano()` (nanoseconds). Two instances starting in the same nanosecond is effectively impossible. VM name length with nanosecond timestamp is 54 chars (under 60 limit).
B2	`BootDiskOnly` + `UseLocalSSD` conflict	Fixed. Step 8 now skips the `BootDiskOnly` coin flip when `UseLocalSSD` is true, with a clear comment explaining why.
M1	Missing `--gce-use-bulk-insert=false` for provisioned IOPS	Fixed. `ToCreateArgs` now emits the flag when either provisioned IOPS or throughput is set, with a comment explaining the BulkInsert API limitation. Test cases updated to assert the flag is present.
M3	Commented-out `list_test.go` dead code	Fixed. File deleted, BUILD.bazel target removed.
M4	`WaitForSSH` only probes node 1	Fixed. Now targets the full cluster name (no `:1` suffix), so roachprod runs on all nodes. Comment updated to explain the behavior.
N1	`joinZones` reimplements `strings.Join`	Fixed. Replaced with `strings.Join`, function and its test deleted.
N2	Manual prefix matching vs `strings.HasPrefix`	Fixed.
N3	README typos	Fixed. "convinient" → "convenient", "uni tests" → "unit tests".

Remaining items (unchanged, author chose not to address)

MEDIUM

M2. extractJSON vulnerable to trailing output with } on its own line — unchanged. Low probability in practice since roachprod's error messages don't currently produce standalone } lines, but worth keeping in mind as the framework grows.

NIT

N4. TestExtend lifetime strict equality — unchanged. Acceptable if roachprod's extend is known to be exact.

All blockers are resolved. The remaining M2 is a theoretical fragility that's acceptable for now. This looks good to merge.

williamchoe3 · 2026-02-18T21:57:40Z

Details

Read the README.md for general overview.

CI: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachprodFunctionalTestsWeekly?branch=161178&buildTypeTab=overview&mode=builds

Stylistically I took inspiration from mixedversion, with the main goal of making test writing feel "natural". I wanted something simple, but extensible because roachprod does a lot.

I want to call out that I had some trouble with the gcloud cli auth. Most of the logic around auth is handled in TestMain. The main issues being because the tests are ran inside docker inside a bazel sandbox, but when google cli tries to auth, it attempts to write to the home directory which it doesn't have access to. The workaround is declaring "HOME": "$$TEST_TMPDIR" in the bazel test rule. Claude explanation below.

Without this, roachprod would try to read/write ~/.ssh, ~/.config/gcloud, etc. in a non-existent or read-only directory. By pointing HOME at TEST_TMPDIR, the test gets a writable home directory where TestMain can set up SSH keys and gcloud credentials before the tests run.

This has been working, but it feels like it's doing a lot and wondering if there's any simpler approach anyone happens to know. Also I assumed I would only have to auth a single time, but because of how each bazel test rule compiles into it's own binary, it seems like I need to auth per rule. Again, this has been working, seems a bit complicated to me though, but also seems like it's a consequence of each test rule being treated individually.

I hope to increase the coverage overtime, I had more to add, but I think dividing the commands into 3 broad categories (cloud provider facing, crdb facing, utility) makes sense so I'll try to add baseline coverage in that order and based on priority.

In general, very open to feedback. Hope this is helpful, and a start to increasing the reliability of our tooling!

golgeek

Went through a first pass, will have another look tomorrow.
Good job overall, I left a few comments here and there.

golgeek · 2026-02-24T00:11:22Z

+			return
+		}
+		remaining := time.Until(deadline)
+		require.Greater(tc.t, remaining, time.Duration(0),


This is neat.

golgeek · 2026-02-24T00:12:33Z

+
+	rpt.AssertClusterExists()
+	rpt.AssertClusterNodeCount(numNodes)
+	rpt.AssertClusterCloud("gce")


Shouldn't we also assert amd64 in case one day the default machine type is changed to arm64?

oops, thx for the catch added

golgeek · 2026-02-24T00:13:54Z

+	rpt.AssertClusterNodeCount(numNodes)
+	rpt.AssertClusterCloud("gce")
+	rpt.AssertClusterMachineType("n2-standard-4")
+	rpt.AssertClusterArchitecture("amd64")


Do we have the image used, in the assertion?
We could have something like AssertClusterMachineImageContains("-fips-").

golgeek · 2026-02-24T00:15:22Z

+
+	rpt.AssertClusterExists()
+	rpt.AssertClusterNodeCount(numNodes)
+	rpt.AssertClusterCloud("gce")


Should probably assert the image :)

no images :/ can add a todo do expose it in roachprod list --json
roachprod_test.log

https://cockroachlabs.atlassian.net/browse/CRDB-60741

golgeek · 2026-02-24T00:25:06Z

+	rpt.AssertClusterCloud("gce")
+	rpt.AssertClusterNodeCount(opts.NumNodes)
+
+	// Type assert to GCE options for verification


This feels a bit weird in this context.

Unless I'm mistaken, we generate randomized settings, but somehow test arbitrary ones, but since the framework generates the options and has the cluster info, I feel like it has everything to run the correct assertions on its own.

What about RandomGCECreateOptions() would:

generate the randomized options

store a list of asserts and values

expose a AssertRandomizedOptions() that execs all stored assert functions on the cluster info?

Makes sense, I think overall the assertions part could use a revisit down the line, but I should be able to implement something like this, and I agree it makes more sense.

For assertions as a whole, looking back I was trying to figure out a definition of "correctness" for a cluster create command from the perspective of an end user. I never really formalized that anywhere though, so I can attempt to do this now.

1. roachprod create succeeds 2. all options specified were applied, verified 3. ssh? is that implied by roachprod success? anything else?

So that's why I just went with listing out each Assert for each test, looks boilerplate, but I like how readable it is.

But yeah, this breaks down for the randomized test since the create args are generated on runtime, not visible in the test code.

implemented, changes described below, let me know if this is different from what you had in mind

// RandomGCECreateOptions (createconfig.go) co-generates assertion closures // alongside each randomized setting. Each closure captures the expected // value and calls the appropriate Assert* method above. These types and // methods support that mechanism. // // Registration (in createconfig.go during config generation): // // providerOpts.MachineType = "n2-standard-4" // cfg.expect("machine type = n2-standard-4", func(rpt *RoachprodTest) { // rpt.AssertClusterMachineType("n2-standard-4") // }) // // The closure does not execute here — it is stored and run later when the // caller invokes cfg.AssertAll(rpt) // // Execution (in the test, after cluster creation): // // opts := framework.RandomGCECreateOptions(rpt.Rand()) // rpt.RunExpectSuccess(opts.ToCreateArgs(rpt.ClusterName())...) // opts.AssertAll(rpt) // now all registered closures execute

Looks great!

Introduce a test framework for running roachprod commands against real GCE infrastructure, along with a suite of functional tests and a TeamCity CI script for weekly execution. The framework (`pkg/cmd/roachprod/test/framework`) provides: - `RoachprodTest` harness managing cluster lifecycle - Assertion helpers that query `roachprod list --json` to verify cluster state (node count, cloud, lifetime, etc.) - Randomized configuration generation (`RandomGCECreateOptions`) that produces valid, internally consistent GCE configs (machine type, architecture, storage, zones, etc.) using a seeded RNG for reproducibility - Operations helpers for cluster ops and cluster utility helpers Test coverage includes GCE related commands: create, destroy, extend, start/stop, reset, version, populate-etc-hosts, and managed clusters. The CI script (`build/teamcity/cockroach/nightlies/roachprod_weekly.sh`) builds bazci and runs all test targets inside a Docker container. Epic: None Release note: None

williamchoe3 · 2026-02-26T16:54:38Z

looking at this message if this would've went in, it would've ran under the ci stress tc build which would've been bad
https://cockroachlabs.slack.com/archives/CJ0H8Q97C/p1772041242180389?thread_ts=1772038552.542499&cid=CJ0H8Q97C

DarrylWong · 2026-02-26T17:25:29Z

+//
+// The closure does not execute here — it is stored and run later when the
+// caller invokes cfg.AssertAll(rpt) against a live cluster.
+func RandomGCECreateOptions(rng *rand.Rand) *RandomizedClusterConfig {


Take this with a grain of salt, since I only thought about this at a high level:

Right now, the function promises to return a valid GCE create config, doing so by cross referencing each random field with GetMachineInfo as the source of truth.

I think this is a bit limited because:

The verification is a test only detail. If we add a new constraint to an opt, we need to make sure to update our tests as well. i.e. its extra toil to maintain this RandomGCECreateOptions function.

It might lead us to be overly conservative in our opt randomization. If we add a new storage type, we have to manually add it to the test.

Instead, what if verification was pushed down to the actual roachprod Create command. We already have bits and pieces of verification, e.g. return nil, errors.Newf("--%s-machine-type requires gce in --clouds", gce.ProviderName), we could centralize all the verification in one place.

This would instead let us return a completely random GCE create config, regardless of if it's valid or not. We can then check if its valid with our above Verify method, if it's not then regenerate a new config. If it is, then we should be able to run the rest of our assertions as before.

I think this approach is nicer because:

If we generate an invalid combination of flags, its not a test flake but rather an actual roachprod bug in Verify. i.e. we are more incentivized to fix it which in turn improves our bad config checking in the actual command itself.

It simplifies the randomization of this function by a lot, which should hopefully reduce the toil of adding new fields.

verification was pushed down to the actual roachprod Create command.

That makes sense. I do like this a lot from the test's PoV. I suppose it's a question of responsibility i.e. should roachprod create be responsible for doing a thorough validation step (new) or let roachprod create do some "sanity" verification and then let the provider basically validate the command (current). Will think about this, I think there's pros / cons but could be a something worth pursuing

williamchoe3 force-pushed the wchoe/roachprod-e2e-bazel branch 2 times, most recently from a0b664e to 62c2e33 Compare January 15, 2026 22:21

williamchoe3 force-pushed the wchoe/roachprod-e2e-bazel branch 2 times, most recently from 2b54943 to 33a63eb Compare February 17, 2026 04:02

williamchoe3 force-pushed the wchoe/roachprod-e2e-bazel branch 3 times, most recently from b77f156 to aed3981 Compare February 18, 2026 20:13

williamchoe3 changed the title ~~roachprod function tests DRAFT~~ roachprod: add functional test framework and tests Feb 18, 2026

williamchoe3 force-pushed the wchoe/roachprod-e2e-bazel branch from ad02843 to c011536 Compare February 18, 2026 21:57

williamchoe3 marked this pull request as ready for review February 19, 2026 06:07

williamchoe3 requested review from a team as code owners February 19, 2026 06:07

williamchoe3 requested review from DarrylWong and golgeek and removed request for a team February 19, 2026 06:07

golgeek reviewed Feb 24, 2026

View reviewed changes

williamchoe3 force-pushed the wchoe/roachprod-e2e-bazel branch from 09cc96f to 0686e5b Compare February 25, 2026 19:25

DarrylWong reviewed Feb 26, 2026

View reviewed changes

Conversation

williamchoe3 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blathers-crl Bot commented Jan 15, 2026

Uh oh!

cockroach-teamcity commented Jan 15, 2026

Uh oh!

williamchoe3 commented Feb 18, 2026

PR #161178 Review: roachprod functional tests

Overview

BLOCKERS

MEDIUM

NITs

Uh oh!

williamchoe3 commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Updated Review

Previously raised items — status

Remaining items (unchanged, author chose not to address)

Uh oh!

williamchoe3 commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Uh oh!

golgeek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

williamchoe3 commented Feb 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

williamchoe3 Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

williamchoe3 commented Jan 15, 2026 •

edited

Loading

williamchoe3 commented Feb 18, 2026 •

edited

Loading

williamchoe3 commented Feb 18, 2026 •

edited

Loading

williamchoe3 Feb 26, 2026 •

edited

Loading