Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: include collected goroutine dumps in debug zip #33318

Closed
tbg opened this issue Dec 21, 2018 · 12 comments

Comments

3 participants
@tbg
Copy link
Member

commented Dec 21, 2018

This should follow a pattern similar to #26797.

@tbg tbg added the C-enhancement label Dec 21, 2018

@tbg tbg added this to the 2.2 milestone Dec 21, 2018

@tbg tbg added this to Incoming in Core via automation Dec 21, 2018

@petermattis

This comment has been minimized.

Copy link
Contributor

commented Dec 21, 2018

I like this idea. Note for implementor: be mildly careful on how frequently this is done as generating a goroutine dump requires temporarily stopping all goroutines. We've previously seen this cause hiccups in performance if done on a regular cadence.

@tbg

This comment has been minimized.

Copy link
Member Author

commented Dec 21, 2018

I was thinking something like

  • poll runtime.NumGoroutine() at some regular enough interval (10s?) and keep the last minute worth of values
  • if the current value is at least double the max seen over the last minute, trigger a dump
  • delete all but the most recent 10 dumps
@petermattis

This comment has been minimized.

Copy link
Contributor

commented Dec 21, 2018

Sounds reasonable.

delete all but the most recent 10 dumps

The goroutine dumps are not large. We could probably keep more than 10.

@tbg

This comment has been minimized.

Copy link
Member Author

commented Feb 18, 2019

The goroutine dumps are not large. We could probably keep more than 10.

Now that we're dealing with goroutine leaks where these dumps can end up in the range of 10^6 goroutines, I'm not convinced this is true. Perhaps better to budget by file size, i.e. keep any number of dumps, as long as their aggregate size is below X (10mb?), then delete smaller files first.

@tbg

This comment has been minimized.

Copy link
Member Author

commented Feb 27, 2019

@stevenliang16 is working on this.

stevenliang16 added a commit to stevenliang16/cockroach that referenced this issue Mar 5, 2019

server: collect goroutine dump when abnormal change detected
We have encountered multiple issues in the field where Cockroach ran
out of memory. Now we only take heap profiles when these issues are
encountered. This change adds the ability to dump goroutines when
there is an abnormal change in number of goroutines `N`:

1. N goes over numGoroutinesThreshold
2. N increases to growthRateThreshold times N(last check),
   and N is larger than lowerLimitForNumGoroutines

Addresses cockroachdb#33318

Release note (general change): add automatic goroutine dump to help
debug OOM issues.

stevenliang16 added a commit to stevenliang16/cockroach that referenced this issue Mar 13, 2019

server: collect goroutine dump when abnormal change detected
We have encountered multiple issues in the field where Cockroach ran
out of memory. Now we only take heap profiles when these issues are
encountered. This change adds the ability to dump goroutines when
there is an abnormal change in number of goroutines `N`:

- N goes over numGoroutinesThreshold
AND
- N has doubled the max dumped number of goroutines.

Note that the user can reset the max dumped number of goroutines by
setting server.goroutine_dump.num_goroutines_threshold to a different
value.

Addresses cockroachdb#33318

Release note (general change): add automatic goroutine dump to help
debug OOM issues.

stevenliang16 added a commit to stevenliang16/cockroach that referenced this issue Mar 13, 2019

server: collect goroutine dump when abnormal change detected
We have encountered multiple issues in the field where Cockroach ran
out of memory. Now we only take heap profiles when these issues are
encountered. This change adds the ability to dump goroutines when
there is an abnormal change in number of goroutines `N`:

- N goes over numGoroutinesThreshold
AND
- N has doubled the max dumped number of goroutines.

Note that the user can reset the max dumped number of goroutines by
setting server.goroutine_dump.num_goroutines_threshold to a different
value.

Addresses cockroachdb#33318

Release note (general change): add automatic goroutine dump to help
debug OOM issues.

stevenliang16 added a commit to stevenliang16/cockroach that referenced this issue Mar 13, 2019

server: collect goroutine dump when abnormal change detected
We have encountered multiple issues in the field where Cockroach ran
out of memory. Now we only take heap profiles when these issues are
encountered. This change adds the ability to dump goroutines when
there is an abnormal change in number of goroutines `N`:

- N goes over numGoroutinesThreshold
AND
- N has doubled the max dumped number of goroutines.

Note that the user can reset the max dumped number of goroutines by
setting server.goroutine_dump.num_goroutines_threshold to a different
value.

Addresses cockroachdb#33318

Release note (general change): add automatic goroutine dump to help
debug OOM issues.

stevenliang16 added a commit to stevenliang16/cockroach that referenced this issue Mar 13, 2019

server: collect goroutine dump when abnormal change detected
We have encountered multiple issues in the field where Cockroach ran
out of memory. Now we only take heap profiles when these issues are
encountered. This change adds the ability to dump goroutines when
there is an abnormal change in number of goroutines `N`:

- N goes over numGoroutinesThreshold
AND
- N has doubled the max dumped number of goroutines.

Note that the user can reset the max dumped number of goroutines by
setting server.goroutine_dump.num_goroutines_threshold to a different
value.

Addresses cockroachdb#33318

Release note (general change): add automatic goroutine dump to help
debug OOM issues.

stevenliang16 added a commit to stevenliang16/cockroach that referenced this issue Mar 14, 2019

server: collect goroutine dump when abnormal change detected
We have encountered multiple issues in the field where Cockroach ran
out of memory. Now we only take heap profiles when these issues are
encountered. This change adds the ability to dump goroutines when
there is an abnormal change in number of goroutines `N`:

- N goes over numGoroutinesThreshold
AND
- N has doubled the max dumped number of goroutines.

Note that the user can reset the max dumped number of goroutines by
setting server.goroutine_dump.num_goroutines_threshold to a different
value.

Addresses cockroachdb#33318

Release note (general change): add automatic goroutine dump to help
debug OOM issues.

craig bot pushed a commit that referenced this issue Mar 18, 2019

Merge #35440
35440: server: collect goroutine dump when abnormal change detected r=tbg a=stevenliang16

We have encountered multiple issues in the field where Cockroach ran
out of memory. Now we only take heap profiles when these issues are
encountered. This change adds the ability to dump goroutines when
there is an abnormal change in number of goroutines `N`:

- N goes over numGoroutinesThreshold
AND
- N has doubled the max dumped number of goroutines.

Note that the user can reset the max dumped number of goroutines by 
setting server.goroutine_dump.num_goroutines_threshold to a different   
value.

Addresses #33318

Release note (general change): add automatic goroutine dump to help
debug OOM issues.


Co-authored-by: Dong Liang <dong.liang@rubrik.com>

craig bot pushed a commit that referenced this issue Mar 18, 2019

Merge #35440
35440: server: collect goroutine dump when abnormal change detected r=tbg a=stevenliang16

We have encountered multiple issues in the field where Cockroach ran
out of memory. Now we only take heap profiles when these issues are
encountered. This change adds the ability to dump goroutines when
there is an abnormal change in number of goroutines `N`:

- N goes over numGoroutinesThreshold
AND
- N has doubled the max dumped number of goroutines.

Note that the user can reset the max dumped number of goroutines by 
setting server.goroutine_dump.num_goroutines_threshold to a different   
value.

Addresses #33318

Release note (general change): add automatic goroutine dump to help
debug OOM issues.


Co-authored-by: Dong Liang <dong.liang@rubrik.com>

craig bot pushed a commit that referenced this issue Mar 18, 2019

Merge #35440
35440: server: collect goroutine dump when abnormal change detected r=tbg a=stevenliang16

We have encountered multiple issues in the field where Cockroach ran
out of memory. Now we only take heap profiles when these issues are
encountered. This change adds the ability to dump goroutines when
there is an abnormal change in number of goroutines `N`:

- N goes over numGoroutinesThreshold
AND
- N has doubled the max dumped number of goroutines.

Note that the user can reset the max dumped number of goroutines by 
setting server.goroutine_dump.num_goroutines_threshold to a different   
value.

Addresses #33318

Release note (general change): add automatic goroutine dump to help
debug OOM issues.


Co-authored-by: Dong Liang <dong.liang@rubrik.com>

craig bot pushed a commit that referenced this issue Mar 18, 2019

Merge #35440
35440: server: collect goroutine dump when abnormal change detected r=tbg a=stevenliang16

We have encountered multiple issues in the field where Cockroach ran
out of memory. Now we only take heap profiles when these issues are
encountered. This change adds the ability to dump goroutines when
there is an abnormal change in number of goroutines `N`:

- N goes over numGoroutinesThreshold
AND
- N has doubled the max dumped number of goroutines.

Note that the user can reset the max dumped number of goroutines by 
setting server.goroutine_dump.num_goroutines_threshold to a different   
value.

Addresses #33318

Release note (general change): add automatic goroutine dump to help
debug OOM issues.


Co-authored-by: Dong Liang <dong.liang@rubrik.com>

@tbg tbg changed the title server: collect goroutine dump when number of goroutines is high cli: include collected goroutine dumps in debug zip Mar 19, 2019

@tbg

This comment has been minimized.

Copy link
Member Author

commented Mar 19, 2019

Re-titling so that this is now about also including the dumps in debug zip. This is completely analogous to heap profiles, see

// GetFiles returns a list of files of type defined in the request.
func (s *statusServer) GetFiles(
ctx context.Context, req *serverpb.GetFilesRequest,
) (*serverpb.GetFilesResponse, error) {
if !debug.GatewayRemoteAllowed(ctx, s.st) {
return nil, remoteDebuggingErr
}
ctx = propagateGatewayMetadata(ctx)
ctx = s.AnnotateCtx(ctx)
nodeID, local, err := s.parseNodeID(req.NodeId)
if err != nil {
return nil, grpcstatus.Errorf(codes.InvalidArgument, err.Error())
}
if !local {
status, err := s.dialNode(ctx, nodeID)
if err != nil {
return nil, err
}
return status.GetFiles(ctx, req)
}
var dir string
switch req.Type {
//TODO(ridwanmsharif): Serve logfiles so debug-zip can fetch them
// intead of reading indididual entries.
case serverpb.FileType_HEAP: // Requesting for saved Heap Profiles.
dir = filepath.Join(s.admin.server.cfg.HeapProfileDirName, heapDir)
default:
return nil, grpcstatus.Errorf(codes.InvalidArgument, "unknown file type: %s", req.Type)
}
var resp serverpb.GetFilesResponse
for _, pattern := range req.Patterns {
if err := checkFilePattern(pattern); err != nil {
return nil, grpcstatus.Errorf(codes.InvalidArgument, err.Error())
}
filepaths, err := filepath.Glob(filepath.Join(dir, pattern))
if err != nil {
return nil, grpcstatus.Errorf(codes.InvalidArgument, "bad pattern: %s", pattern)
}
for _, path := range filepaths {
fileinfo, _ := os.Stat(path)
var contents []byte
if !req.ListOnly {
contents, err = ioutil.ReadFile(path)
if err != nil {
return nil, grpcstatus.Errorf(codes.Internal, err.Error())
}
}
resp.Files = append(resp.Files,
&serverpb.File{Name: fileinfo.Name(), FileSize: fileinfo.Size(), Contents: contents})
}
}
return &resp, err
}

and

cockroach/pkg/cli/zip.go

Lines 303 to 323 in 63448fb

var profiles *serverpb.GetFilesResponse
if err := contextutil.RunWithTimeout(baseCtx, "request files", timeout,
func(ctx context.Context) error {
profiles, err = status.GetFiles(ctx, &serverpb.GetFilesRequest{
NodeId: id,
Type: serverpb.FileType_HEAP,
Patterns: []string{"*"},
})
return err
}); err != nil {
if err := z.createError(prefix+"/heapfiles", err); err != nil {
return err
}
} else {
for _, file := range profiles.Files {
name := prefix + "/heapprof/" + file.Name
if err := z.createRaw(name, file.Contents); err != nil {
return err
}
}
}

@giorgosp

This comment has been minimized.

Copy link
Collaborator

commented Apr 1, 2019

Hi, I have started looking into this. Seems like I just have to use the GoroutineDumper in a new switch case in server/status.go#GetFiles and add a request for the goroutines dump in cli/zip.go#runDebugZip.

@tbg

This comment has been minimized.

Copy link
Member Author

commented Apr 1, 2019

Yep, that's right! You'll need to add the new status type in serverpb/status.proto. Thanks for working on this.

@giorgosp

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2019

Hi, I am looking how to test the goroutine dump fetching. I see that the pkg/server/status_test.go#TestStatusLocalFileRetrieval only tests that no error is returned and is expecting 0 heap profile files to be returned. Also, pkg/cmd/roachtest/cluster.go#FetchDebugZip doesn't check the contents of the downloaded zip file. Is there some test that I am missing that checks the contents of the downloaded debug zip? Thanks!

@tbg

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

Hi @giorgosp, I think you're right that the testing here is a little poor. Could you do adapt TestStatusLocalFileRetrieval as follows:

  1. don't call startServer, instead just lift the serverutils.StartServer call from that method into the test directly, because
  2. you want to use StoreSpecs that are not in-memory, so basically you want to give each node a StoreSpec that has Path set
  3. you get the path from testutils.TempDir(),
  4. now you can test that the endpoints actually read files by putting marker files (i.e. just files containing some string "hey i'm a heap profile" into the right place, and checking that the endpoints return them.

Let me know if you have any questions!

@giorgosp

This comment has been minimized.

Copy link
Collaborator

commented Apr 12, 2019

Hi @tbg, thank you for your quick reply!

I have opened a pull request against my fork, giorgosp#1, looking forward to your comments.

Also, I notice that the debug zip command explicitly asks for a heap profile run and also downloads any other heap profile files created by the heap profiler. Do we want a similar behaviour for the goroutines dump?

@tbg

This comment has been minimized.

Copy link
Member Author

commented Apr 12, 2019

looking forward to your comments.

Just open the PR against the main forkrepo, please!

Also, I notice that the debug zip command explicitly asks for a heap profile run

We already have code in debug zip that fetches the current stacks, so that's already present. Just fetching the previously saved goroutine dumper files will need to be added.

Thanks!

craig bot pushed a commit that referenced this issue Apr 16, 2019

Merge #36813
36813: cli: Include collected goroutine dumps in debug zip r=tbg a=giorgosp

This PR makes the debug zip command include the goroutine dump files in the final zip, if the
GoroutineDumper has run until the command is triggered.

Also enhances the testing of the GetFiles endpoint by testing the names and the contents of the downloaded files.

Fixes #33318 

cc @tbg 

Co-authored-by: George Papadrosou <gpapadrosou@gmail.com>

@craig craig bot closed this in #36813 Apr 16, 2019

Core automation moved this from Incoming to Closed Apr 16, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.