Skip to content

roachprod: add functional test framework and tests#161178

Open
williamchoe3 wants to merge 1 commit intocockroachdb:masterfrom
williamchoe3:wchoe/roachprod-e2e-bazel
Open

roachprod: add functional test framework and tests#161178
williamchoe3 wants to merge 1 commit intocockroachdb:masterfrom
williamchoe3:wchoe/roachprod-e2e-bazel

Conversation

@williamchoe3
Copy link
Copy Markdown
Contributor

@williamchoe3 williamchoe3 commented Jan 15, 2026

Introduces a lightweight test framework for running roachprod commands against real GCE infrastructure, along with a suite of functional tests targeting roachprod cluster operations on GCE, and a TeamCity CI script for weekly execution. See README.md for more details.

The framework (pkg/cmd/roachprod/test/framework) provides:

  • RoachprodTest harness managing cluster lifecycle
  • Assertion helpers that query roachprod list --json to verify cluster state (node count, cloud, lifetime, etc.)
  • Randomized configuration generation (RandomGCECreateOptions) that produces valid, internally consistent GCE configs (machine type, architecture, storage, zones, etc.) using a seeded RNG for reproducibility
  • Operational helpers for cluster ops and cluster utility helpers

Tests are located in pkg/cmd/roachprod/test/test/ and current coverage includes GCE related commands: create, destroy, extend, start/stop, reset, version, populate-etc-hosts, and managed clusters.

The CI script (build/teamcity/cockroach/nightlies/roachprod_weekly.sh) builds bazci and runs all test targets inside a Docker container.

See below comments for more details.

@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented Jan 15, 2026

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@williamchoe3 williamchoe3 force-pushed the wchoe/roachprod-e2e-bazel branch 2 times, most recently from a0b664e to 62c2e33 Compare January 15, 2026 22:21
@williamchoe3 williamchoe3 force-pushed the wchoe/roachprod-e2e-bazel branch 2 times, most recently from 2b54943 to 33a63eb Compare February 17, 2026 04:02
@williamchoe3 williamchoe3 force-pushed the wchoe/roachprod-e2e-bazel branch 3 times, most recently from b77f156 to aed3981 Compare February 18, 2026 20:13
@williamchoe3
Copy link
Copy Markdown
Contributor Author

PR #161178 Review: roachprod functional tests

Overview

This PR introduces a lightweight end-to-end testing framework for roachprod that creates real clusters on GCE, runs roachprod commands against them, and asserts cluster state. The framework supports randomized configuration generation, reproducible seeds, and automatic cleanup. Well-structured overall — the separation between framework/ and tests/ is clean, the functional options pattern is idiomatic, and the randomized testing approach with seeded RNG is the right design.


BLOCKERS

B1. Cluster name collisions with --runs_per_test > 1
framework/framework.go:253-257

timestamp := time.Now().Unix()
return fmt.Sprintf("%s-%s-%d", username, testName, timestamp)

The cluster name uses unix seconds for uniqueness. The CI script explicitly supports --runs_per_test="$TEST_COUNT" and the README describes TEST_COUNT as being "for stressing or running randomization tests." When TEST_COUNT > 1, Bazel launches multiple instances of the same test target in parallel. Two instances of TestCreateAMD64 starting in the same second produce the identical cluster name teamcity-testcreateamd64-1738000000. This causes:

  • Both tests trying to create the same cluster (one fails or overwrites)
  • Cleanup in one test destroying the other's cluster mid-test
  • Flaky, confusing failures that are hard to reproduce

Fix: include a per-invocation disambiguator. Bazel sets TEST_RUN_NUMBER for --runs_per_test runs. Alternatively, use nanosecond precision or add a short random suffix:

return fmt.Sprintf("%s-%s-%d-%s", username, testName, timestamp, randomSuffix(4))

B2. BootDiskOnly and UseLocalSSD can both be true, producing conflicting flags
framework/createconfig.go:138-141 (step 8) vs createconfig.go:115-120 (step 3)

Step 3 sets UseLocalSSD = true (50% chance on machines that support it). Step 8 independently sets BootDiskOnly = true (5% chance). Neither checks the other. When both are true, ToCreateArgs emits both --local-ssd and --gce-boot-disk-only, which are contradictory — boot-disk-only means no additional volumes (no local SSDs, no PD), while --local-ssd requests local SSD attachment.

This affects ~2.5% of randomized configs on machines that support local SSD. The resulting roachprod create call will likely either fail or produce undefined behavior.

Fix: step 8 should set UseLocalSSD = false when BootDiskOnly is true, or skip the BootDiskOnly coin flip when UseLocalSSD is already true.


MEDIUM

M1. ToCreateArgs omits --gce-use-bulk-insert=false for provisioned IOPS/throughput
framework/createconfig.go:197-209

TestCreateHyperdisk explicitly notes: "These flags are only wired through the CLI path, so --gce-use-bulk-insert=false is required." But RandomGCECreateOptions can generate configs with hyperdisk-balanced + provisioned IOPS/throughput, and ToCreateArgs never emits --gce-use-bulk-insert=false. The create succeeds (because the bulk insert path just ignores the IOPS values), but the test silently doesn't test what it thinks it's testing — the actual disk gets default IOPS, not the requested 4000.

M2. extractJSON vulnerable to trailing output with } on its own line
framework/framework.go:262-281

The function tracks the last line that equals }. If roachprod outputs a warning after the JSON that happens to contain } on its own line, end extends past the real JSON closing brace, and json.Unmarshal fails with a confusing parse error. The test case "warnings containing braces" only covers inline braces ({"error": "auth failed"}), not standalone } on its own line.

This is unlikely today but fragile — any future change to roachprod's error output could trigger it. Consider parsing from the end backwards, or using json.Decoder which stops at the first complete JSON object.

M3. list_test.go is entirely commented-out dead code

The entire file is a commented-out test with the note "Commenting because this just makes the log unusable rn..." This shouldn't ship in the PR. Either fix the log noise issue and include the test, or remove the file entirely. Dead code in a test framework that "all pipelines will migrate on" sets a bad precedent.

M4. WaitForSSH only probes node 1
framework/operations.go:129

target := fmt.Sprintf("%s:1", tc.clusterName)

For multi-node clusters after a reset, SSH could be available on node 1 but not node 2 or 3. The test then proceeds to run commands on all nodes and gets a flaky failure. Today only TestReset uses this (with 1 node), but the method is general-purpose and will be used for multi-node scenarios as the framework grows.


NITs

N1. joinZones reimplements strings.Join
framework/createconfig.go:233-242

func joinZones(zones []string) string {
    result := ""
    for i, z := range zones { ... }
    return result
}

This is just strings.Join(zones, ",").

N2. AssertClusterCloud uses manual prefix matching instead of strings.HasPrefix
framework/assertions.go:62

if len(provider) >= len(expectedCloud) && provider[:len(expectedCloud)] == expectedCloud {

strings.HasPrefix(provider, expectedCloud)

N3. README typos

  • Line 182: "convinient" → "convenient"
  • Line 278: "uni tests" → "unit tests"

N4. TestExtend lifetime assertion may be brittle
tests/extend_test.go:43

rpt.AssertClusterLifetime(initialLifetime + extension)

This assumes extend is purely additive and the lifetime hasn't drifted due to clock skew or rounding. If roachprod uses server-side timestamps with even a second of drift, this strict equality check could flake. Using a tolerance window (require.InDelta) would be more robust, but this may be fine if roachprod's extend is known to be exact.

@williamchoe3
Copy link
Copy Markdown
Contributor Author

williamchoe3 commented Feb 18, 2026

Updated Review

Previously raised items — status

# Issue Status
B1 Cluster name collision with --runs_per_test > 1 Fixed. Changed from Unix() (seconds) to UnixNano() (nanoseconds). Two instances starting in the same nanosecond is effectively impossible. VM name length with nanosecond timestamp is 54 chars (under 60 limit).
B2 BootDiskOnly + UseLocalSSD conflict Fixed. Step 8 now skips the BootDiskOnly coin flip when UseLocalSSD is true, with a clear comment explaining why.
M1 Missing --gce-use-bulk-insert=false for provisioned IOPS Fixed. ToCreateArgs now emits the flag when either provisioned IOPS or throughput is set, with a comment explaining the BulkInsert API limitation. Test cases updated to assert the flag is present.
M3 Commented-out list_test.go dead code Fixed. File deleted, BUILD.bazel target removed.
M4 WaitForSSH only probes node 1 Fixed. Now targets the full cluster name (no :1 suffix), so roachprod runs on all nodes. Comment updated to explain the behavior.
N1 joinZones reimplements strings.Join Fixed. Replaced with strings.Join, function and its test deleted.
N2 Manual prefix matching vs strings.HasPrefix Fixed.
N3 README typos Fixed. "convinient" → "convenient", "uni tests" → "unit tests".

Remaining items (unchanged, author chose not to address)

MEDIUM

M2. extractJSON vulnerable to trailing output with } on its own line — unchanged. Low probability in practice since roachprod's error messages don't currently produce standalone } lines, but worth keeping in mind as the framework grows.

NIT

N4. TestExtend lifetime strict equality — unchanged. Acceptable if roachprod's extend is known to be exact.


All blockers are resolved. The remaining M2 is a theoretical fragility that's acceptable for now. This looks good to merge.

@williamchoe3 williamchoe3 changed the title roachprod function tests DRAFT roachprod: add functional test framework and tests Feb 18, 2026
@williamchoe3 williamchoe3 force-pushed the wchoe/roachprod-e2e-bazel branch from ad02843 to c011536 Compare February 18, 2026 21:57
@williamchoe3
Copy link
Copy Markdown
Contributor Author

williamchoe3 commented Feb 18, 2026

Details

Read the README.md for general overview.

CI: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachprodFunctionalTestsWeekly?branch=161178&buildTypeTab=overview&mode=builds

Stylistically I took inspiration from mixedversion, with the main goal of making test writing feel "natural". I wanted something simple, but extensible because roachprod does a lot.

I want to call out that I had some trouble with the gcloud cli auth. Most of the logic around auth is handled in TestMain. The main issues being because the tests are ran inside docker inside a bazel sandbox, but when google cli tries to auth, it attempts to write to the home directory which it doesn't have access to. The workaround is declaring "HOME": "$$TEST_TMPDIR" in the bazel test rule. Claude explanation below.

Without this, roachprod would try to read/write ~/.ssh, ~/.config/gcloud, etc. in a non-existent or read-only directory. By pointing HOME at TEST_TMPDIR, the test gets a writable home directory where TestMain can set up SSH keys and gcloud credentials before the tests run.

This has been working, but it feels like it's doing a lot and wondering if there's any simpler approach anyone happens to know. Also I assumed I would only have to auth a single time, but because of how each bazel test rule compiles into it's own binary, it seems like I need to auth per rule. Again, this has been working, seems a bit complicated to me though, but also seems like it's a consequence of each test rule being treated individually.

I hope to increase the coverage overtime, I had more to add, but I think dividing the commands into 3 broad categories (cloud provider facing, crdb facing, utility) makes sense so I'll try to add baseline coverage in that order and based on priority.

In general, very open to feedback. Hope this is helpful, and a start to increasing the reliability of our tooling!

@williamchoe3 williamchoe3 marked this pull request as ready for review February 19, 2026 06:07
@williamchoe3 williamchoe3 requested review from a team as code owners February 19, 2026 06:07
@williamchoe3 williamchoe3 requested review from DarrylWong and golgeek and removed request for a team February 19, 2026 06:07
Copy link
Copy Markdown
Contributor

@golgeek golgeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through a first pass, will have another look tomorrow.
Good job overall, I left a few comments here and there.

return
}
remaining := time.Until(deadline)
require.Greater(tc.t, remaining, time.Duration(0),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat.


rpt.AssertClusterExists()
rpt.AssertClusterNodeCount(numNodes)
rpt.AssertClusterCloud("gce")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also assert amd64 in case one day the default machine type is changed to arm64?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, thx for the catch added

rpt.AssertClusterNodeCount(numNodes)
rpt.AssertClusterCloud("gce")
rpt.AssertClusterMachineType("n2-standard-4")
rpt.AssertClusterArchitecture("amd64")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have the image used, in the assertion?
We could have something like AssertClusterMachineImageContains("-fips-").


rpt.AssertClusterExists()
rpt.AssertClusterNodeCount(numNodes)
rpt.AssertClusterCloud("gce")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably assert the image :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no images :/ can add a todo do expose it in roachprod list --json
roachprod_test.log

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rpt.AssertClusterCloud("gce")
rpt.AssertClusterNodeCount(opts.NumNodes)

// Type assert to GCE options for verification
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit weird in this context.

Unless I'm mistaken, we generate randomized settings, but somehow test arbitrary ones, but since the framework generates the options and has the cluster info, I feel like it has everything to run the correct assertions on its own.

What about RandomGCECreateOptions() would:

  1. generate the randomized options
  2. store a list of asserts and values
  3. expose a AssertRandomizedOptions() that execs all stored assert functions on the cluster info?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I think overall the assertions part could use a revisit down the line, but I should be able to implement something like this, and I agree it makes more sense.

For assertions as a whole, looking back I was trying to figure out a definition of "correctness" for a cluster create command from the perspective of an end user. I never really formalized that anywhere though, so I can attempt to do this now.

1. roachprod create succeeds
2. all options specified were applied, verified 
3. ssh? is that implied by roachprod success? anything else?

So that's why I just went with listing out each Assert for each test, looks boilerplate, but I like how readable it is.

But yeah, this breaks down for the randomized test since the create args are generated on runtime, not visible in the test code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented, changes described below, let me know if this is different from what you had in mind

// RandomGCECreateOptions (createconfig.go) co-generates assertion closures
// alongside each randomized setting. Each closure captures the expected
// value and calls the appropriate Assert* method above. These types and
// methods support that mechanism.
//
// Registration (in createconfig.go during config generation):
//
//	providerOpts.MachineType = "n2-standard-4"
//	cfg.expect("machine type = n2-standard-4", func(rpt *RoachprodTest) {
//	    rpt.AssertClusterMachineType("n2-standard-4")
//	})
//
// The closure does not execute here — it is stored and run later when the
// caller invokes cfg.AssertAll(rpt)
//
// Execution (in the test, after cluster creation):
//
//	opts := framework.RandomGCECreateOptions(rpt.Rand())
//	rpt.RunExpectSuccess(opts.ToCreateArgs(rpt.ClusterName())...)
//	opts.AssertAll(rpt) // now all registered closures execute

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Introduce a test framework for running roachprod commands against real GCE
infrastructure, along with a suite of functional tests and a TeamCity CI script
for weekly execution.

The framework (`pkg/cmd/roachprod/test/framework`) provides:
- `RoachprodTest` harness managing cluster lifecycle
- Assertion helpers that query `roachprod list --json` to verify cluster state
  (node count, cloud, lifetime, etc.)
- Randomized configuration generation (`RandomGCECreateOptions`) that produces
  valid, internally consistent GCE configs (machine type, architecture,
  storage, zones, etc.) using a seeded RNG for reproducibility
- Operations helpers for cluster ops and cluster utility helpers

Test coverage includes GCE related commands: create, destroy, extend,
start/stop, reset, version, populate-etc-hosts, and managed clusters.

The CI script (`build/teamcity/cockroach/nightlies/roachprod_weekly.sh`) builds
bazci and runs all test targets inside a Docker container.

Epic: None
Release note: None
@williamchoe3 williamchoe3 force-pushed the wchoe/roachprod-e2e-bazel branch from 09cc96f to 0686e5b Compare February 25, 2026 19:25
@williamchoe3
Copy link
Copy Markdown
Contributor Author

looking at this message if this would've went in, it would've ran under the ci stress tc build which would've been bad
https://cockroachlabs.slack.com/archives/CJ0H8Q97C/p1772041242180389?thread_ts=1772038552.542499&cid=CJ0H8Q97C

//
// The closure does not execute here — it is stored and run later when the
// caller invokes cfg.AssertAll(rpt) against a live cluster.
func RandomGCECreateOptions(rng *rand.Rand) *RandomizedClusterConfig {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take this with a grain of salt, since I only thought about this at a high level:

Right now, the function promises to return a valid GCE create config, doing so by cross referencing each random field with GetMachineInfo as the source of truth.

I think this is a bit limited because:

  1. The verification is a test only detail. If we add a new constraint to an opt, we need to make sure to update our tests as well. i.e. its extra toil to maintain this RandomGCECreateOptions function.
  2. It might lead us to be overly conservative in our opt randomization. If we add a new storage type, we have to manually add it to the test.

Instead, what if verification was pushed down to the actual roachprod Create command. We already have bits and pieces of verification, e.g. return nil, errors.Newf("--%s-machine-type requires gce in --clouds", gce.ProviderName), we could centralize all the verification in one place.

This would instead let us return a completely random GCE create config, regardless of if it's valid or not. We can then check if its valid with our above Verify method, if it's not then regenerate a new config. If it is, then we should be able to run the rest of our assertions as before.

I think this approach is nicer because:

  1. If we generate an invalid combination of flags, its not a test flake but rather an actual roachprod bug in Verify. i.e. we are more incentivized to fix it which in turn improves our bad config checking in the actual command itself.
  2. It simplifies the randomization of this function by a lot, which should hopefully reduce the toil of adding new fields.

Copy link
Copy Markdown
Contributor Author

@williamchoe3 williamchoe3 Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verification was pushed down to the actual roachprod Create command.

That makes sense. I do like this a lot from the test's PoV. I suppose it's a question of responsibility i.e. should roachprod create be responsible for doing a thorough validation step (new) or let roachprod create do some "sanity" verification and then let the provider basically validate the command (current). Will think about this, I think there's pros / cons but could be a something worth pursuing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants