Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: roachtest: refactoring #104915

Merged
merged 3 commits into from
Aug 17, 2023

Conversation

smg260
Copy link
Contributor

@smg260 smg260 commented Jun 14, 2023

This commit addresses some usability issues within roachprod/roachtest, and does some minor clean up.

Summary:

roachprod

  • c.Parallel now accepts nodes Nodes and func(ctx context.Context, node Node). Previously, these was an int n representing the first n nodes to operate on, and func(ctx context.Context, int n), where second arg would confusingly represented the index of the node being operated on, which was wholly dependent on a captured-by-reference slice at the call site. This change makes it explicit to the caller and the function, which exact node is targeted.

  • c.Parallel now returns []*RunResultDetails, boolean, error. The slice will contain results, successful and failed, for each node (or thus far, depending on fail-fast behaviour). The boolean is a signal to the caller that at least one of the results in the slice suffered from a command error, which can be inspected in the result itself. The error now more idiomatically represents an unrecoverable error encountered during Parallel; i.e. an un retryable error in roachprod. In the latter case, results will be nil.

  • All calls to execute a remote command on a node are all funnelled to c.runCmdOnSingleNode, ensuring consistency, and reducing duplicated code. This function accepts a RunCmdOptions struct to configure it's behaviour, including stdin/out

  • runCmdOnSingleNode : in error cases, include stdout/err output (truncated after n bytes - effectively tail). This output will show in the failure_n.log and give convenient insight. Full output is always retained in the run log.

  • RunResultDetails.{Stdout,Stderr,CombinedOut} are now all string to represent the most common use case, and a convenience r.Output() added to print out the contents of Stderr,Stdout,CombinedOut

  • Remove complex legacy pipe processing logic from roachtest (execCmd,execCmdEx), previously used when roachprod was not a library of roachtest

  • When showing the result of a command on each node on the CLI, explicitly display <ok> or <err> followed by output/error

roachtest

  • skip post-test assertions on a test failure

  • removal of execCmd/execCmdRes which were doing some complex pipe processing before calling roachprod.Run (not required since roachprod is a lib of roachtest)

Minor

  • Clean up various redundant/unused variables
  • Typos
  • Command formatting

Informs: #104312
Informs: #104316 (TBD)

Epic: none
Release note: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@smg260 smg260 force-pushed the roachprod_refactor_parallel branch 3 times, most recently from f22da9c to dacfacf Compare June 20, 2023 15:58
@smg260 smg260 force-pushed the roachprod_refactor_parallel branch 5 times, most recently from f2827ea to 97e6226 Compare June 23, 2023 12:50
craig bot pushed a commit that referenced this pull request Jun 23, 2023
105343: roachprod: suppress SCP retries for specific error substrings. r=herkolategan,renatolabs a=smg260

Previously, SCP file transfers would be retried regardless of the error encountered, even if it was one that would unlikley be resolved by retries. This PR adds a slice of substrings to match scp output to, and in which cases not to retry, resulting in less noisy logs.

A connect timeout of 10s has also been specified.

Epic: none
Release note: none


This is cherry-picked (and slightly amended) from #104915

Co-authored-by: Miral Gadani <miral@cockroachlabs.com>
@smg260 smg260 force-pushed the roachprod_refactor_parallel branch from 97e6226 to ffaa73f Compare June 23, 2023 15:58
@smg260 smg260 force-pushed the roachprod_refactor_parallel branch 4 times, most recently from b6bc263 to d953c89 Compare July 26, 2023 21:46
@smg260 smg260 changed the title Roachprod refactor parallel roachprod: roachtest: refactoring Jul 26, 2023
@smg260 smg260 marked this pull request as ready for review July 26, 2023 21:51
@smg260 smg260 requested a review from a team as a code owner July 26, 2023 21:51
@smg260 smg260 requested review from herkolategan and renatolabs and removed request for a team July 26, 2023 21:51
- Funnel all ssh command calls to `runCmdOnSingleNode`. A boolean is added
to control whether we prepend ROACHPROD env vars, which are required for
user-supplied commands via `c.Run/RunWithDetails`

- `c.ParallelE` now accepts a node list to make it clear which nodes a
specified command is running on. Previously, and confusingly, this was
achieved via an `int` arg and a captured-by-ref slice of nodes. The int
arg referred to the first `int` nodes in the captured slice.
@smg260
Copy link
Contributor Author

smg260 commented Jul 27, 2023

@smg260 smg260 force-pushed the roachprod_refactor_parallel branch from d953c89 to 9a7d06b Compare July 27, 2023 18:38
Copy link
Collaborator

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a lot of changes and it's hard to digest it all in one sitting. Overall I do quite like the changes though, just left minor comments.

I'm looking at the run on this branch that you linked above. I was curious about the new formatting of the command output, but couldn't find anything. Some things also caught my attention and I have the feeling they are not expected:

  • the .failed log file seems to be missing.
  • The jepsen failure has an empty log file for the jepsen command.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)


pkg/cmd/roachtest/cluster.go line 589 at r2 (raw file):

		err:    err,
		stdout: stdoutString,
		stderr: stderrString,

Nice, 🔥


pkg/cmd/roachtest/cluster.go line 2207 at r2 (raw file):

	logFileFull := l.File.Name()
	if err != nil {
		createFailedFile(logFileFull)

In RunE, we call createFailedFile(logFile). Any reason this needs to be createFailedFile(logFileFull) in this function?


pkg/cmd/roachtest/cluster.go line 2226 at r2 (raw file):

func createFailedFile(logFile string) {
	if logFile == "" {
		return

Why the check? Can this actually be empty in any of the callers?


pkg/roachprod/roachprod.go line 1760 at r2 (raw file):

		res := &install.RunResultDetails{Node: node}

		//TODO: this usage may not always be correct, if the target nodes are not sequential

The current implementation of TargetNodes already guarantees ordering. I'd either remove the TODO to avoid confusion (same for TODO below)


pkg/roachprod/install/cluster_synced.go line 876 at r2 (raw file):

		res = newRunResultDetails(node, cmdErr)
		res.CombinedOut = string(out)
		//output = fmtOut(res.CombinedOut)

Can be removed?


pkg/roachprod/install/cluster_synced.go line 1134 at r2 (raw file):

		runOpts.combinedOut = false
		return c.runCmdOnSingleNode(ctx, l, n, cmd, runOpts)
		//sess := c.newSession(l, 1, cmd, withDebugName("ssh-gen-key"))

Delete?


pkg/roachprod/install/cluster_synced.go line 1148 at r2 (raw file):

		runOpts.stdin = bytes.NewReader(sshTar)
		return c.runCmdOnSingleNode(ctx, l, node, `tar xf -`, runOpts)
		//sess := c.newSession(l, node, cmd, withDebugName("ssh-dist-key"))

Delete.


pkg/roachprod/install/cluster_synced.go line 2452 at r2 (raw file):

// on the cluster.
//
// ParallelE  runs at most `concurrency` (or  `config.MaxConcurrency` if it is lower) in parallel.

Nit: extra space.

@smg260 smg260 force-pushed the roachprod_refactor_parallel branch from 9a7d06b to 829dab0 Compare August 3, 2023 15:30
Copy link
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thankyou for the review:

the .failed log file seems to be missing.
Addressed in comments below

The jepsen failure has an empty log file for the jepsen command.
The cmd log files show stdout/err only, and the jepsen command redirects both to invoke.log, hence no output.

However, I've made an update to show some info in the command logs regardless:

  • the full command being run
  • the final status ( or error)

I'll run another roachtest at 0.05 select to showcase the changes, and will update this thread when it's ready.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)


pkg/cmd/roachtest/cluster.go line 2207 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

In RunE, we call createFailedFile(logFile). Any reason this needs to be createFailedFile(logFileFull) in this function?

Good catch! Explains why there was no failed file for the jepsen test, but there was for knex. They should both be logFileFull


pkg/cmd/roachtest/cluster.go line 2226 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Why the check? Can this actually be empty in any of the callers?

I saw it occur at some point, but is now defensive.


pkg/roachprod/roachprod.go line 1760 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

The current implementation of TargetNodes already guarantees ordering. I'd either remove the TODO to avoid confusion (same for TODO below)

You're right - there is a comment to that effect on c.TargetNodes() that I missed.


pkg/roachprod/install/cluster_synced.go line 876 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Can be removed?

Done.


pkg/roachprod/install/cluster_synced.go line 1134 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Delete?

Done.


pkg/roachprod/install/cluster_synced.go line 1148 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Delete.

Done.

@smg260 smg260 force-pushed the roachprod_refactor_parallel branch from dbdc845 to 8a3205f Compare August 4, 2023 15:33
@smg260
Copy link
Contributor Author

smg260 commented Aug 4, 2023

Latest TC run

roachtest examples

jepsen run log with no command output:
image

failure_1.log with truncated output
image

excerpt of test.log showing first ~30chars of cmd being run, and where to find full details
image

command log for command being run on multiple nodes
image

roachprod cli examples

successful command on all nodes
image

successful command on 2/3 nodes
image

@smg260 smg260 requested a review from srosenberg August 4, 2023 15:57
Miral Gadani added 2 commits August 4, 2023 18:34
roachtest
- Remove `execCmd/Ex` pipes and buffers, and instead dispatch to `roachprod.Run` directly.
- `createFailedFile` fn checks for name length
- `RunE` wraps `err` with location of full output
- `c.RunWithDetails` creates a failed file if any node returns an error
 (previously just for a roachprod error)
- `test.log` shows where a command is logging to
- `c.runCmdOnSingleNode` includes command and output in the error
- fix `fileExistsOnFirstNode` (was returning true if file not exists instead of vice versa)
- update `c.{startNode,initializeCluster,setClusterSettings}` to return (*RunResultDetails, error)
- simplify `c.Start`

roachprod
- `RunResultDetails.Output()` convenience to print cmd output
- `c.runCmdOnSingleNode` includes truncated stdout/err in error object;
the full command output will still be available in the cmd run log.
- `processResults` formatting; use tabs, show `<ok>` or `<err>` prefix instead of blank output
@smg260 smg260 force-pushed the roachprod_refactor_parallel branch from 8a3205f to 3c64af2 Compare August 4, 2023 18:35
@smg260 smg260 requested a review from renatolabs August 10, 2023 10:58
Copy link
Collaborator

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: 🎉

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 8 files at r4, 1 of 3 files at r5, 2 of 2 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)


pkg/roachprod/roachprod.go line 1080 at r6 (raw file):

	httpClient := httputil.NewClientWithTimeout(timeout)
	startTime := timeutil.Now().Unix()
	err = c.Parallel(ctx, l, c.TargetNodes(), func(ctx context.Context, node install.Node) (*install.RunResultDetails, error) {

Nice!

@smg260
Copy link
Contributor Author

smg260 commented Aug 14, 2023

TFTR!

I will hold off merging until I am completely back from PTO so I can handle any unexpected consequences.

@smg260
Copy link
Contributor Author

smg260 commented Aug 17, 2023

bors r=renatolabs,herkolategan

@craig
Copy link
Contributor

craig bot commented Aug 17, 2023

Build succeeded:

@craig craig bot merged commit 8d5ff3f into cockroachdb:master Aug 17, 2023
7 checks passed
@smg260
Copy link
Contributor Author

smg260 commented Aug 21, 2023

blathers backport 23.1

@blathers-crl
Copy link

blathers-crl bot commented Aug 21, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 76860d6 to blathers/backport-release-23.1-104915: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants