roachprod: add functional test framework and tests#161178
roachprod: add functional test framework and tests#161178williamchoe3 wants to merge 1 commit intocockroachdb:masterfrom
Conversation
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
a0b664e to
62c2e33
Compare
2b54943 to
33a63eb
Compare
b77f156 to
aed3981
Compare
PR #161178 Review: roachprod functional testsOverviewThis PR introduces a lightweight end-to-end testing framework for roachprod that creates real clusters on GCE, runs roachprod commands against them, and asserts cluster state. The framework supports randomized configuration generation, reproducible seeds, and automatic cleanup. Well-structured overall — the separation between BLOCKERSB1. Cluster name collisions with timestamp := time.Now().Unix()
return fmt.Sprintf("%s-%s-%d", username, testName, timestamp)The cluster name uses unix seconds for uniqueness. The CI script explicitly supports
Fix: include a per-invocation disambiguator. Bazel sets return fmt.Sprintf("%s-%s-%d-%s", username, testName, timestamp, randomSuffix(4))B2. Step 3 sets This affects ~2.5% of randomized configs on machines that support local SSD. The resulting Fix: step 8 should set MEDIUMM1.
M2. The function tracks the last line that equals This is unlikely today but fragile — any future change to roachprod's error output could trigger it. Consider parsing from the end backwards, or using M3. The entire file is a commented-out test with the note "Commenting because this just makes the log unusable rn..." This shouldn't ship in the PR. Either fix the log noise issue and include the test, or remove the file entirely. Dead code in a test framework that "all pipelines will migrate on" sets a bad precedent. M4. target := fmt.Sprintf("%s:1", tc.clusterName)For multi-node clusters after a reset, SSH could be available on node 1 but not node 2 or 3. The test then proceeds to run commands on all nodes and gets a flaky failure. Today only NITsN1. func joinZones(zones []string) string {
result := ""
for i, z := range zones { ... }
return result
}This is just N2. if len(provider) >= len(expectedCloud) && provider[:len(expectedCloud)] == expectedCloud {→ N3. README typos
N4. rpt.AssertClusterLifetime(initialLifetime + extension)This assumes |
Updated ReviewPreviously raised items — status
Remaining items (unchanged, author chose not to address)MEDIUM M2. NIT N4. All blockers are resolved. The remaining M2 is a theoretical fragility that's acceptable for now. This looks good to merge. |
ad02843 to
c011536
Compare
DetailsRead the README.md for general overview. Stylistically I took inspiration from I want to call out that I had some trouble with the This has been working, but it feels like it's doing a lot and wondering if there's any simpler approach anyone happens to know. Also I assumed I would only have to auth a single time, but because of how each bazel test rule compiles into it's own binary, it seems like I need to auth per rule. Again, this has been working, seems a bit complicated to me though, but also seems like it's a consequence of each test rule being treated individually. I hope to increase the coverage overtime, I had more to add, but I think dividing the commands into 3 broad categories (cloud provider facing, crdb facing, utility) makes sense so I'll try to add baseline coverage in that order and based on priority. In general, very open to feedback. Hope this is helpful, and a start to increasing the reliability of our tooling! |
golgeek
left a comment
There was a problem hiding this comment.
Went through a first pass, will have another look tomorrow.
Good job overall, I left a few comments here and there.
| return | ||
| } | ||
| remaining := time.Until(deadline) | ||
| require.Greater(tc.t, remaining, time.Duration(0), |
|
|
||
| rpt.AssertClusterExists() | ||
| rpt.AssertClusterNodeCount(numNodes) | ||
| rpt.AssertClusterCloud("gce") |
There was a problem hiding this comment.
Shouldn't we also assert amd64 in case one day the default machine type is changed to arm64?
There was a problem hiding this comment.
oops, thx for the catch added
| rpt.AssertClusterNodeCount(numNodes) | ||
| rpt.AssertClusterCloud("gce") | ||
| rpt.AssertClusterMachineType("n2-standard-4") | ||
| rpt.AssertClusterArchitecture("amd64") |
There was a problem hiding this comment.
Do we have the image used, in the assertion?
We could have something like AssertClusterMachineImageContains("-fips-").
|
|
||
| rpt.AssertClusterExists() | ||
| rpt.AssertClusterNodeCount(numNodes) | ||
| rpt.AssertClusterCloud("gce") |
There was a problem hiding this comment.
Should probably assert the image :)
There was a problem hiding this comment.
no images :/ can add a todo do expose it in roachprod list --json
roachprod_test.log
There was a problem hiding this comment.
| rpt.AssertClusterCloud("gce") | ||
| rpt.AssertClusterNodeCount(opts.NumNodes) | ||
|
|
||
| // Type assert to GCE options for verification |
There was a problem hiding this comment.
This feels a bit weird in this context.
Unless I'm mistaken, we generate randomized settings, but somehow test arbitrary ones, but since the framework generates the options and has the cluster info, I feel like it has everything to run the correct assertions on its own.
What about RandomGCECreateOptions() would:
- generate the randomized options
- store a list of asserts and values
- expose a
AssertRandomizedOptions()that execs all stored assert functions on the cluster info?
There was a problem hiding this comment.
Makes sense, I think overall the assertions part could use a revisit down the line, but I should be able to implement something like this, and I agree it makes more sense.
For assertions as a whole, looking back I was trying to figure out a definition of "correctness" for a cluster create command from the perspective of an end user. I never really formalized that anywhere though, so I can attempt to do this now.
1. roachprod create succeeds
2. all options specified were applied, verified
3. ssh? is that implied by roachprod success? anything else?
So that's why I just went with listing out each Assert for each test, looks boilerplate, but I like how readable it is.
But yeah, this breaks down for the randomized test since the create args are generated on runtime, not visible in the test code.
There was a problem hiding this comment.
implemented, changes described below, let me know if this is different from what you had in mind
// RandomGCECreateOptions (createconfig.go) co-generates assertion closures
// alongside each randomized setting. Each closure captures the expected
// value and calls the appropriate Assert* method above. These types and
// methods support that mechanism.
//
// Registration (in createconfig.go during config generation):
//
// providerOpts.MachineType = "n2-standard-4"
// cfg.expect("machine type = n2-standard-4", func(rpt *RoachprodTest) {
// rpt.AssertClusterMachineType("n2-standard-4")
// })
//
// The closure does not execute here — it is stored and run later when the
// caller invokes cfg.AssertAll(rpt)
//
// Execution (in the test, after cluster creation):
//
// opts := framework.RandomGCECreateOptions(rpt.Rand())
// rpt.RunExpectSuccess(opts.ToCreateArgs(rpt.ClusterName())...)
// opts.AssertAll(rpt) // now all registered closures execute
Introduce a test framework for running roachprod commands against real GCE infrastructure, along with a suite of functional tests and a TeamCity CI script for weekly execution. The framework (`pkg/cmd/roachprod/test/framework`) provides: - `RoachprodTest` harness managing cluster lifecycle - Assertion helpers that query `roachprod list --json` to verify cluster state (node count, cloud, lifetime, etc.) - Randomized configuration generation (`RandomGCECreateOptions`) that produces valid, internally consistent GCE configs (machine type, architecture, storage, zones, etc.) using a seeded RNG for reproducibility - Operations helpers for cluster ops and cluster utility helpers Test coverage includes GCE related commands: create, destroy, extend, start/stop, reset, version, populate-etc-hosts, and managed clusters. The CI script (`build/teamcity/cockroach/nightlies/roachprod_weekly.sh`) builds bazci and runs all test targets inside a Docker container. Epic: None Release note: None
09cc96f to
0686e5b
Compare
|
looking at this message if this would've went in, it would've ran under the ci stress tc build which would've been bad |
| // | ||
| // The closure does not execute here — it is stored and run later when the | ||
| // caller invokes cfg.AssertAll(rpt) against a live cluster. | ||
| func RandomGCECreateOptions(rng *rand.Rand) *RandomizedClusterConfig { |
There was a problem hiding this comment.
Take this with a grain of salt, since I only thought about this at a high level:
Right now, the function promises to return a valid GCE create config, doing so by cross referencing each random field with GetMachineInfo as the source of truth.
I think this is a bit limited because:
- The verification is a test only detail. If we add a new constraint to an opt, we need to make sure to update our tests as well. i.e. its extra toil to maintain this
RandomGCECreateOptionsfunction. - It might lead us to be overly conservative in our opt randomization. If we add a new storage type, we have to manually add it to the test.
Instead, what if verification was pushed down to the actual roachprod Create command. We already have bits and pieces of verification, e.g. return nil, errors.Newf("--%s-machine-type requires gce in --clouds", gce.ProviderName), we could centralize all the verification in one place.
This would instead let us return a completely random GCE create config, regardless of if it's valid or not. We can then check if its valid with our above Verify method, if it's not then regenerate a new config. If it is, then we should be able to run the rest of our assertions as before.
I think this approach is nicer because:
- If we generate an invalid combination of flags, its not a test flake but rather an actual roachprod bug in
Verify. i.e. we are more incentivized to fix it which in turn improves our bad config checking in the actual command itself. - It simplifies the randomization of this function by a lot, which should hopefully reduce the toil of adding new fields.
There was a problem hiding this comment.
verification was pushed down to the actual roachprod Create command.
That makes sense. I do like this a lot from the test's PoV. I suppose it's a question of responsibility i.e. should roachprod create be responsible for doing a thorough validation step (new) or let roachprod create do some "sanity" verification and then let the provider basically validate the command (current). Will think about this, I think there's pros / cons but could be a something worth pursuing
Introduces a lightweight test framework for running roachprod commands against real GCE infrastructure, along with a suite of functional tests targeting roachprod cluster operations on GCE, and a TeamCity CI script for weekly execution. See
README.mdfor more details.The framework (
pkg/cmd/roachprod/test/framework) provides:RoachprodTestharness managing cluster lifecycleroachprod list --jsonto verify cluster state (node count, cloud, lifetime, etc.)RandomGCECreateOptions) that produces valid, internally consistent GCE configs (machine type, architecture, storage, zones, etc.) using a seeded RNG for reproducibilityTests are located in
pkg/cmd/roachprod/test/test/and current coverage includes GCE related commands: create, destroy, extend, start/stop, reset, version, populate-etc-hosts, and managed clusters.The CI script (
build/teamcity/cockroach/nightlies/roachprod_weekly.sh) builds bazci and runs all test targets inside a Docker container.See below comments for more details.