Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign uproachtest: rewrite the test runner #30977
Conversation
andreimatei
assigned
petermattis
Oct 4, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andreimatei
Oct 4, 2018
Member
The first commit is #30146.
Peter, this patch is nowhere ready, but I'd like a nod on the direction pls.
|
The first commit is #30146. Peter, this patch is nowhere ready, but I'd like a nod on the direction pls. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
petermattis
reviewed
Oct 5, 2018
Before this PR, --parallelism was explicitly limited to the number of tests by this code:
// Limit the parallelism to the number of tests. The primary effect this has
// is that we'll log to stdout/stderr if only one test is being run.
if parallelism > len(tests) {
parallelism = len(tests)
}
Perhaps we should just remove that code once your PR to place each run's output in a separate directory is merged. If you want to run a single test multiple times sequentially you can already do --count 100 --parallelism 1.
Reviewed 4 of 4 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained
andreimatei
reviewed
Oct 5, 2018
Yes, that code is going away.
I'm not sure what your comment says about the current PR though (if anything). Are you suggesting this PR is not necessary? (cause this PR is all about cluster reuse)
Reviewable status:
complete! 0 of 0 LGTMs obtained
andreimatei
reviewed
Oct 5, 2018
I meant to ask - how does one use/test the roachtest run --cluster option? Is there an easy way to create a cluster that matches a particular test's expectation? Am I supposed to cause a test to fail and use the --debug option to leave a cluster around?
Also, is the roachtest invocation supposed to take ownership of such a cluster? Like, is it supposed to destroy it at the end?
Reviewable status:
complete! 0 of 0 LGTMs obtained
andreimatei
reviewed
Oct 5, 2018
Disregard the last comment; I meant to ask on another PR :P
Reviewable status:
complete! 0 of 0 LGTMs obtained
petermattis
reviewed
Oct 5, 2018
I'm not sure what your comment says about the current PR though (if anything). Are you suggesting this PR is not necessary? (cause this PR is all about cluster reuse)
My comment was in reference to the PR title which only talks about parallelism for repeated runs of the same test. I failed to notice that you're also reusing clusters (yes, this is mentioned in the commit message which I read too fast). Let me take a closer look.
Reviewable status:
complete! 0 of 0 LGTMs obtained
petermattis
requested changes
Oct 5, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/test.go, line 637 at r2 (raw file):
} sem := make(chan struct{}, parallelism) clustersPerTLT := parallelism / n
What happens if parallelism == 1 and n == 2? Seems like clustersPerTLT will be 0. Isn't that problematic, or is that what the adjustment below is trying to handle.
pkg/cmd/roachtest/test.go, line 643 at r2 (raw file):
numClusters := clustersPerTLT if i < parallelism%n { numClusters++
I don't understand what this is doing.
pkg/cmd/roachtest/test.go, line 715 at r2 (raw file):
artifactsDir: artifactsDir, } f.Put(newCluster(ctx, t, spec.Nodes))
Won't multiple concurrent cluster creation collide on both the cluster name and the artifacts dir?
andreimatei
reviewed
Oct 5, 2018
You don't seem to be against anything, so I got what I wanted for now.
Will ping when it's ready.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/test.go, line 637 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
What happens if
parallelism == 1andn == 2? Seems likeclustersPerTLTwill be0. Isn't that problematic, or is that what the adjustment below is trying to handle.
It's not supposed to be problematic. The different runTopLevelTest invocations will contend on sem for creating new clusters.
pkg/cmd/roachtest/test.go, line 643 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I don't understand what this is doing.
it's giving an extra cluster to some tests if the division or parallelism to tests had a reminder. So in your example (parallelism == 1 and n == 2), the first test would get a guaranteed cluster, and the second one would wait until the first one does all its runs.
pkg/cmd/roachtest/test.go, line 715 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Won't multiple concurrent cluster creation collide on both the cluster name and the artifacts dir?
yes (except I think you wanted to say "cluster name and log file"; the artifacts dir isn't an issue any more, as that's now a property of a test, not of a cluster ).
I told you it wasn't ready :P. I'll keep you posted.
petermattis
reviewed
Oct 5, 2018
You don't seem to be against anything, so I got what I wanted for now.
Will ping when it's ready.
Ack. The assignment of parallelism to clusters is a bit more subtle than desirable, though I suppose the same could be said of much of the roachtest code. The channel of channels is interesting. I don't have a better suggestion.
Reviewable status:
complete! 0 of 0 LGTMs obtained
andreimatei
changed the title from
roachtest: allow parallel execution of repeated runs
to
roachtest: rewrite the test runner
Oct 16, 2018
andreimatei
added some commits
Oct 4, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andreimatei
Oct 16, 2018
Member
@benesch I'd appreciate a high-level look over this. There's many i's to dot still and I haven't really run the thing, but I think it's ready for a first look.
|
@benesch I'd appreciate a high-level look over this. There's many i's to dot still and I haven't really run the thing, but I think it's ready for a first look. |
petermattis
reviewed
Oct 17, 2018
I've only skimmed parts of this and I likely won't have time to give it a thorough look until Friday. I suggest we hold off on merging this until 2.1 is out the door as I expect it to cause some short term instability. Or you have to find a way to break it into smaller pieces.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/main.go, line 196 at r5 (raw file):
cmd.Flags().IntVar( &cpuQuota, "cpu-quota", 100, "The number of cloud CPUs roachtest is allowed to use at any one time.")
What happens if a single test wants to use more CPU?
pkg/cmd/roachtest/test.go, line 90 at r5 (raw file):
// Any means that the cluster can be used by any other test. var Any = ClusterReusePolicy{policy: any}
My intuition is that we should somehow fold this into the nodeSpec. The benefit of your approach is that you can upgrade an Any cluster to a tagged cluster, but I'm not sure that is worth supporting.
The nodeSpec approach fits more seamlessly into a world where "nobarrier" and "zfs" are also part of the nodeSpec.nodes(4, tag("acceptance")) could be used to specify a cluster that is shared by the acceptance tests (they would all have this same specification). nodes(6, noreuse) could be used to indicate a cluster that is not reused. I haven't looked at how you're checking to see if two cluster specs are compatible, but this seems to fit in naturally. nodes(4, nobarrier) could indicate that the disks are to be remounted with the "nobarrier" option. You could even imagine sharing a nodes(4) and nodes(4, nobarrier) cluster by remounting the disks automatically before starting the test. I'm not suggesting you do that work in this PR, just that the nodeSpec mechanism seems like the right place to be specifying cluster reuse.
pkg/cmd/roachtest/test.go, line 911 at r5 (raw file):
for i := 0; i < parallelism; i++ { wg.Add(1) stopper.RunWorker(ctx, func(ctx context.Context) {
I'm not seeing what a stopper is adding here. Why not just go func() { ... } and a channel which is closed to indicated that runWorker should exit?
pkg/cmd/roachtest/test.go, line 974 at r5 (raw file):
name = fmt.Sprintf("%d", timeutil.Now().Unix()) } name += "-" + t.Name
Including the test name in the cluster name is going to be weird once clusters are being reused across unassociated tests. If the cluster is tagged, I think the tag should be included instead. I'm not sure what to do about untagged clusters.
andreimatei commentedOct 4, 2018
•
edited
This patch reimplements the roachtest test runner. Some issues with the
previous version:
control the number of tests that run in parallel
This patch takes a clean-slate approach to the runner. It introduces a
runner that takes a --cpu-quota (#cloud cpus) and then, building on the
cluster reuse policies introduced in the previous commit, tries to
schedule tests most efficiently and with the greatest parallelism while
staying under that quota (and also while staying under the
--parallelism, which now acts purely to protect against too many tests
mthrashing the local machine).
The runner starts --parallelism workers and lets them compete on a
ResourceGovernor which manages the cloud quota.
Tests are seen abstractly as units of work. There's no more subtests,
and --count is supported naturally as more work.
The scheduling policy is not advanced: at any point in time, a worker
has a cluster and will continuously select tests that can reuse that
cluster. When no more tests can reuse it, the cluster is destroyed, new
resources acquired and a new cluster created. Within multiple tests that
can all reuse a cluster, ones that don't soil the cluster are preferred.
Within otherwise equal tests, the ones with more runs remaining are
preferred.
Release note: None