-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachprod: roachtest: refactoring #104915
Conversation
f22da9c
to
dacfacf
Compare
f2827ea
to
97e6226
Compare
105343: roachprod: suppress SCP retries for specific error substrings. r=herkolategan,renatolabs a=smg260 Previously, SCP file transfers would be retried regardless of the error encountered, even if it was one that would unlikley be resolved by retries. This PR adds a slice of substrings to match scp output to, and in which cases not to retry, resulting in less noisy logs. A connect timeout of 10s has also been specified. Epic: none Release note: none This is cherry-picked (and slightly amended) from #104915 Co-authored-by: Miral Gadani <miral@cockroachlabs.com>
97e6226
to
ffaa73f
Compare
b6bc263
to
d953c89
Compare
- Funnel all ssh command calls to `runCmdOnSingleNode`. A boolean is added to control whether we prepend ROACHPROD env vars, which are required for user-supplied commands via `c.Run/RunWithDetails` - `c.ParallelE` now accepts a node list to make it clear which nodes a specified command is running on. Previously, and confusingly, this was achieved via an `int` arg and a captured-by-ref slice of nodes. The int arg referred to the first `int` nodes in the captured slice.
d953c89
to
9a7d06b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a lot of changes and it's hard to digest it all in one sitting. Overall I do quite like the changes though, just left minor comments.
I'm looking at the run on this branch that you linked above. I was curious about the new formatting of the command output, but couldn't find anything. Some things also caught my attention and I have the feeling they are not expected:
- the
.failed
log file seems to be missing. - The
jepsen
failure has an empty log file for thejepsen
command.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)
pkg/cmd/roachtest/cluster.go
line 589 at r2 (raw file):
err: err, stdout: stdoutString, stderr: stderrString,
Nice, 🔥
pkg/cmd/roachtest/cluster.go
line 2207 at r2 (raw file):
logFileFull := l.File.Name() if err != nil { createFailedFile(logFileFull)
In RunE
, we call createFailedFile(logFile)
. Any reason this needs to be createFailedFile(logFileFull)
in this function?
pkg/cmd/roachtest/cluster.go
line 2226 at r2 (raw file):
func createFailedFile(logFile string) { if logFile == "" { return
Why the check? Can this actually be empty in any of the callers?
pkg/roachprod/roachprod.go
line 1760 at r2 (raw file):
res := &install.RunResultDetails{Node: node} //TODO: this usage may not always be correct, if the target nodes are not sequential
The current implementation of TargetNodes
already guarantees ordering. I'd either remove the TODO
to avoid confusion (same for TODO below)
pkg/roachprod/install/cluster_synced.go
line 876 at r2 (raw file):
res = newRunResultDetails(node, cmdErr) res.CombinedOut = string(out) //output = fmtOut(res.CombinedOut)
Can be removed?
pkg/roachprod/install/cluster_synced.go
line 1134 at r2 (raw file):
runOpts.combinedOut = false return c.runCmdOnSingleNode(ctx, l, n, cmd, runOpts) //sess := c.newSession(l, 1, cmd, withDebugName("ssh-gen-key"))
Delete?
pkg/roachprod/install/cluster_synced.go
line 1148 at r2 (raw file):
runOpts.stdin = bytes.NewReader(sshTar) return c.runCmdOnSingleNode(ctx, l, node, `tar xf -`, runOpts) //sess := c.newSession(l, node, cmd, withDebugName("ssh-dist-key"))
Delete.
pkg/roachprod/install/cluster_synced.go
line 2452 at r2 (raw file):
// on the cluster. // // ParallelE runs at most `concurrency` (or `config.MaxConcurrency` if it is lower) in parallel.
Nit: extra space.
9a7d06b
to
829dab0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thankyou for the review:
the
.failed
log file seems to be missing.
Addressed in comments below
The
jepsen
failure has an empty log file for thejepsen
command.
The cmd log files show stdout/err only, and the jepsen command redirects both to invoke.log, hence no output.
However, I've made an update to show some info in the command logs regardless:
- the full command being run
- the final status ( or error)
I'll run another roachtest at 0.05 select to showcase the changes, and will update this thread when it's ready.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)
pkg/cmd/roachtest/cluster.go
line 2207 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
In
RunE
, we callcreateFailedFile(logFile)
. Any reason this needs to becreateFailedFile(logFileFull)
in this function?
Good catch! Explains why there was no failed file for the jepsen
test, but there was for knex
. They should both be logFileFull
pkg/cmd/roachtest/cluster.go
line 2226 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Why the check? Can this actually be empty in any of the callers?
I saw it occur at some point, but is now defensive.
pkg/roachprod/roachprod.go
line 1760 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
The current implementation of
TargetNodes
already guarantees ordering. I'd either remove theTODO
to avoid confusion (same for TODO below)
You're right - there is a comment to that effect on c.TargetNodes()
that I missed.
pkg/roachprod/install/cluster_synced.go
line 876 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Can be removed?
Done.
pkg/roachprod/install/cluster_synced.go
line 1134 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Delete?
Done.
pkg/roachprod/install/cluster_synced.go
line 1148 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Delete.
Done.
dbdc845
to
8a3205f
Compare
roachtest - Remove `execCmd/Ex` pipes and buffers, and instead dispatch to `roachprod.Run` directly. - `createFailedFile` fn checks for name length - `RunE` wraps `err` with location of full output - `c.RunWithDetails` creates a failed file if any node returns an error (previously just for a roachprod error) - `test.log` shows where a command is logging to - `c.runCmdOnSingleNode` includes command and output in the error - fix `fileExistsOnFirstNode` (was returning true if file not exists instead of vice versa) - update `c.{startNode,initializeCluster,setClusterSettings}` to return (*RunResultDetails, error) - simplify `c.Start` roachprod - `RunResultDetails.Output()` convenience to print cmd output - `c.runCmdOnSingleNode` includes truncated stdout/err in error object; the full command output will still be available in the cmd run log. - `processResults` formatting; use tabs, show `<ok>` or `<err>` prefix instead of blank output
8a3205f
to
3c64af2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 8 files at r4, 1 of 3 files at r5, 2 of 2 files at r6, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)
pkg/roachprod/roachprod.go
line 1080 at r6 (raw file):
httpClient := httputil.NewClientWithTimeout(timeout) startTime := timeutil.Now().Unix() err = c.Parallel(ctx, l, c.TargetNodes(), func(ctx context.Context, node install.Node) (*install.RunResultDetails, error) {
Nice!
TFTR! I will hold off merging until I am completely back from PTO so I can handle any unexpected consequences. |
bors r=renatolabs,herkolategan |
Build succeeded: |
blathers backport 23.1 |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 76860d6 to blathers/backport-release-23.1-104915: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This commit addresses some usability issues within roachprod/roachtest, and does some minor clean up.
Summary:
roachprod
c.Parallel
now acceptsnodes Nodes
andfunc(ctx context.Context, node Node)
. Previously, these was anint n
representing the firstn
nodes to operate on, andfunc(ctx context.Context, int n)
, where second arg would confusingly represented the index of the node being operated on, which was wholly dependent on a captured-by-reference slice at the call site. This change makes it explicit to the caller and the function, which exact node is targeted.c.Parallel
now returns[]*RunResultDetails, boolean, error
. The slice will contain results, successful and failed, for each node (or thus far, depending on fail-fast behaviour). The boolean is a signal to the caller that at least one of the results in the slice suffered from a command error, which can be inspected in the result itself. Theerror
now more idiomatically represents an unrecoverable error encountered duringParallel
; i.e. an un retryable error inroachprod
. In the latter case, results will benil
.All calls to execute a remote command on a node are all funnelled to
c.runCmdOnSingleNode
, ensuring consistency, and reducing duplicated code. This function accepts aRunCmdOptions
struct to configure it's behaviour, includingstdin/out
runCmdOnSingleNode
: in error cases, include stdout/err output (truncated after n bytes - effectivelytail
). This output will show in thefailure_n.log
and give convenient insight. Full output is always retained in the run log.RunResultDetails.{Stdout,Stderr,CombinedOut}
are now allstring
to represent the most common use case, and a conveniencer.Output()
added to print out the contents ofStderr,Stdout,CombinedOut
Remove complex legacy pipe processing logic from
roachtest
(execCmd,execCmdEx
), previously used whenroachprod
was not a library ofroachtest
When showing the result of a command on each node on the CLI, explicitly display
<ok>
or<err>
followed by output/errorroachtest
skip post-test assertions on a test failure
removal of
execCmd/execCmdRes
which were doing some complex pipe processing before calling roachprod.Run (not required since roachprod is a lib of roachtest)Minor
Informs: #104312
Informs: #104316 (TBD)
Epic: none
Release note: none