Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: c.ParallelE goroutine leak #104316

Closed
srosenberg opened this issue Jun 5, 2023 · 2 comments
Closed

roachprod: c.ParallelE goroutine leak #104316

srosenberg opened this issue Jun 5, 2023 · 2 comments
Assignees
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team

Comments

@srosenberg
Copy link
Member

srosenberg commented Jun 5, 2023

I happened to debug an unrelated issue which required SIGQUIT. Upon examining the goroutine dump, I noticed a large number of goroutines inside ParallelE which appear to be leaked. E.g., the goroutine below is waiting indefinitely to send err from runWithMaybeRetry on the unbuffered channel [1],

1: chan send [1615 minutes] [Created by install.(*SyncedCluster).ParallelE.func1 @ cluster_synced.go:2595]
    runtime      proc.go:363              gopark(func(0x1f4), Pointer(0x0), waitReason(0x0))
    runtime      chan.go:259              chansend(*hchan(#8397), Pointer(#748), true)
    runtime      chan.go:145              chansend1(*hchan(#96), Pointer(#5697))
    install      cluster_synced.go:2601   (*SyncedCluster).ParallelE.func1.1(0)
    install      cluster_synced.go:2605   (*SyncedCluster).ParallelE.func1.2()
    runtime      asm_amd64.s:1594         goexit()

The above goroutine will never return if the following case [2] is taken first. (Note that it's non-deterministic owing to the semantics of select.)

[1]


[2]

Jira issue: CRDB-28457

@srosenberg srosenberg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure T-testeng TestEng Team labels Jun 5, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 5, 2023

cc @cockroachdb/test-eng

srosenberg added a commit to srosenberg/cockroach that referenced this issue Jun 14, 2023
A number of leaking goroutines are suspected, both
user-defined and inside roachprod. To investigate
further, we dump all stacks on exit, including
SIGINT. These are appended to stderr, so they
should end up in the (CI) build log.

Informs: cockroachdb#104314
Informs: cockroachdb#104316

Epic: none

Release note: None
craig bot pushed a commit that referenced this issue Jun 16, 2023
104685: privilege: don't include NOSQLLOGIN in ALL check r=rafiss a=rafiss

fixes #101292

Release note (bug fix): GRANT SYSTEM ALL ... no longer causes the grantee to be unable to login. This was due to a UX oversight/bug where ALL would include the NOSQLLOGIN system privilege. Since NOSQLLOGIN is the only "negative" privilege, it is now excluded from the ALL shorthand, and must be granted explicitly in order to restrict logins.

104901: roachtest: dump all stacks upon (test_runner) exit r=renatolabs a=srosenberg

A number of leaking goroutines are suspected, both user-defined and inside roachprod. To investigate
further, we dump all stacks on exit, including
SIGINT. These are appended to stderr, so they
should end up in the (CI) build log.

Informs: #104314
Informs: #104316

Epic: none

Release note: None

105033: ci: update bazel builder image r=rickystewart a=cockroach-teamcity

Release note: None
Epic: None


105060: lint: re-enable `composite` linter on test files r=rail a=rickystewart

Epic: none
Release note: None

Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Stan Rosenberg <stan.rosenberg@gmail.com>
Co-authored-by: cockroach-teamcity <teamcity@cockroachlabs.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Jun 16, 2023
A number of leaking goroutines are suspected, both
user-defined and inside roachprod. To investigate
further, we dump all stacks on exit, including
SIGINT. These are appended to stderr, so they
should end up in the (CI) build log.

Informs: #104314
Informs: #104316

Epic: none

Release note: None
craig bot pushed a commit that referenced this issue Aug 17, 2023
104915: roachprod: roachtest: refactoring r=renatolabs,herkolategan a=smg260

This commit addresses some usability issues within roachprod/roachtest, and does some minor clean up.

Summary:

`roachprod`

- `c.Parallel` now accepts  `nodes Nodes` and `func(ctx context.Context, node Node)`. Previously, these was an `int n` representing the first `n` nodes to operate on, and `func(ctx context.Context, int n)`, where second arg would confusingly represented the index of the node being operated on, which was wholly dependent on a captured-by-reference slice at the call site.  This change makes it explicit to the caller and the function, which exact node is targeted. 

- `c.Parallel` now returns `[]*RunResultDetails, boolean, error`. The slice will contain results, successful and failed, for each node (or thus far, depending on fail-fast behaviour). The boolean is a signal to the caller that at least one of the results in the slice suffered from a command error, which can be inspected in the result itself. The `error` now more idiomatically represents an unrecoverable error encountered during `Parallel`; i.e. an un retryable error in `roachprod`. In the latter case, results will be `nil`.

- All calls to execute a remote command on a node are all funnelled to `c.runCmdOnSingleNode`, ensuring consistency, and reducing duplicated code. This function accepts a `RunCmdOptions`  struct to configure it's behaviour, including `stdin/out`

- `runCmdOnSingleNode` : in error cases, include stdout/err output (truncated after n bytes - effectively `tail`). This output will show in the `failure_n.log` and give convenient insight. Full output is always retained in the run log.

- `RunResultDetails.{Stdout,Stderr,CombinedOut}` are now all `string` to represent the most common use case, and a convenience `r.Output()` added to print out the contents of `Stderr,Stdout,CombinedOut`

- Remove complex legacy pipe processing logic from `roachtest` (`execCmd,execCmdEx`), previously used when `roachprod` was not a library of `roachtest`

- When showing the result of a command on each node on the CLI, explicitly display `<ok>` or `<err>` followed by output/error


`roachtest`

- skip post-test assertions on a test failure

- removal of `execCmd/execCmdRes` which were doing some complex pipe processing before calling roachprod.Run (not required since roachprod is a lib of roachtest)

Minor

- Clean up various redundant/unused variables
- Typos
- Command formatting

Informs: #104312 
Informs: #104316 (TBD)

Epic: none
Release note: none

107354: asim: add randomized range generation  r=kvoli a=wenyihu6

This patch enables random range configuration to be generated.

TestRandomized can now take another setting parameter rangeGen (default: uniform
rangeGenType, uniform keySpaceGenType, empty weightedRand).

These generators are part of the framework fields which persist across
iterations. The numbers produced by the generator shape the distribution across
iterations.

- rangeKeyGenType: determines range generator type across iterations (default:
uniformGenerator, min = 1, max = 1000)

- keySpaceGenType: determines key space generator type across iterations
(default: uniformGenerator, min = 1000, max = 200000)

- weightedRand: if non-empty, enables weighted randomization for range
distribution

This provides three modes for range generation:
1. Default: currently set to uniform distribution
2. Random: randomly generates range distribution across stores
3. Weighted Randomization: enables weighted randomization for range distribution
if and only if given weightedRand is non-empty

Part of: #106311

Release note: None

108906: changefeedccl: make tests work offline r=miretskiy a=jayshrivastava

This change updates two unit tests to pass when
running offline.

Fixes: #108867
Release note: None
Epic: None

Co-authored-by: Miral Gadani <miral@cockroachlabs.com>
Co-authored-by: wenyihu6 <wenyi.hu@cockroachlabs.com>
Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>
@smg260 smg260 self-assigned this Sep 8, 2023
@smg260
Copy link
Contributor

smg260 commented Sep 8, 2023

Leaked goroutines are no longer showing. Marking as fixed by #104915

@smg260 smg260 closed this as completed Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team
Projects
Status: Done
Development

No branches or pull requests

2 participants