Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use createdump to collect crash dumps where possible in runtime #65422

Open
elinor-fung opened this issue Feb 16, 2022 · 12 comments
Open

Use createdump to collect crash dumps where possible in runtime #65422

elinor-fung opened this issue Feb 16, 2022 · 12 comments

Comments

@elinor-fung
Copy link
Member

elinor-fung commented Feb 16, 2022

System dumps on macOS are large - uploading them has been taking down helix queues. Using the runtime's coredump features should allow for configuration such that we can get smaller and still useful dumps.

Since createdump is part of the test/payload, I think we should just be able to update the libraries runner template to set the dump configuration environment variables for DbgEnableMiniDump, DbgMiniDumpName, and DbgMiniDumpType (cc @mikem8361 @hoyosjs) instead of using ulimit:

if [[ "$(uname -s)" == "Darwin" ]]; then
# On OS X, we will enable core dump generation only if there are no core
# files already in /cores/ at this point. This is being done to prevent
# inadvertently flooding the CI machines with dumps.
if [[ ! -d "/cores" || ! "$(ls -A /cores)" ]]; then
ulimit -c unlimited
fi

See also:
#65405 (comment)
https://github.com/dotnet/core-eng/issues/15333

cc @danmoseley

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 16, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Feb 16, 2022

Tagging subscribers to this area: @dotnet/area-infrastructure-libraries
See info in area-owners.md if you want to be subscribed.

Issue Details

System dumps on macOS are large - uploading them has been taking down helix queues. Using the runtime's coredump features should allow for configuration such that we can get smaller and still useful dumps.

I think we should just be able to update the libraries runner template to set the dump configuration environment variables for DbgEnableMiniDump, DbgMiniDumpName, and DbgMiniDumpType instead of using ulimit:

if [[ "$(uname -s)" == "Darwin" ]]; then
# On OS X, we will enable core dump generation only if there are no core
# files already in /cores/ at this point. This is being done to prevent
# inadvertently flooding the CI machines with dumps.
if [[ ! -d "/cores" || ! "$(ls -A /cores)" ]]; then
ulimit -c unlimited
fi

See also:
#65405 (comment)
https://github.com/dotnet/core-eng/issues/15333

Author: elinor-fung
Assignees: -
Labels:

area-Infrastructure-libraries, untriaged

Milestone: -

@mikem8361
Copy link
Member

I recommend setting the following env vars:

export COMPlus_DbgEnableMiniDump=1
export COMPlus_DbgMiniDumpName=/path/to/coredump.%p
export COMPlus_DbgMiniDumpType=2

This enables a heap dump which has everything needed to diagnose most managed and native problems. The path can also contain these special name formatting chars:

%p  PID of dumped process.
%e  The process executable filename.
%h  Hostname return by gethostname().
%t  Time of dump, expressed as seconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC).

@safern safern removed the untriaged New issue has not been triaged by the area owner label Feb 18, 2022
@safern safern added this to the 7.0.0 milestone Feb 18, 2022
@ghost ghost moved this from Untriaged to 7.0.0 in Infrastructure Backlog Feb 18, 2022
@hoyosjs
Copy link
Member

hoyosjs commented Feb 18, 2022

Type 2 was still somewhat big when we looked at it. It's the best fidelity, but it definitely can get big and we need to improve the doc on debugging coredumps that we include.

@mikem8361
Copy link
Member

mikem8361 commented Feb 18, 2022 via email

@jkotas
Copy link
Member

jkotas commented Feb 19, 2022

We often need full dumps to investigate crashes that only happen intermittently in the CI.

Should the problem be rather solved by throttling the dump uploads? If a PR generates many dumps or if many PRs generate same dump, skip uploading them.

The system should be designed to handle and gracefully recover from situations when we suddenly end up with large volume of crash dumps. There are many ways we can end up in a situation like that. It may be even worth it to create a weekly chaos monkey job that tries to flood the system with many big dumps to validate that it is not killing the system.

@hoyosjs
Copy link
Member

hoyosjs commented Feb 19, 2022

We often need full dumps to investigate crashes that only happen intermittently in the CI.

Should the problem be rather solved by throttling the dump uploads? If a PR generates many dumps or if many PRs generate same dump, skip uploading them.

That change is upcoming. They are capping it in 2 ways. Upload time capped, and total dump size capped to 6 gb. They have the telemetry to say in helix 6 gb is what's safely uploadable while still being able to do the work and report results without timing out. There's two issues still there. The first one is the disk can still get full if many tests in a work item crash. A more concerning one is 6 gb is big, but not crazy for a macOS system dump. CreateDump is a little better here, but often still too big, mini with private memory seems like what we want from a diagnosibility perspective. I just don't know yet if that is going to cap us. I guess the best to do here is run a few experiments.

@ghost
Copy link

ghost commented Feb 23, 2022

Tagging subscribers to this area: @hoyosjs
See info in area-owners.md if you want to be subscribed.

Issue Details

System dumps on macOS are large - uploading them has been taking down helix queues. Using the runtime's coredump features should allow for configuration such that we can get smaller and still useful dumps.

Since createdump is part of the test/payload, I think we should just be able to update the libraries runner template to set the dump configuration environment variables for DbgEnableMiniDump, DbgMiniDumpName, and DbgMiniDumpType (cc @mikem8361 @hoyosjs) instead of using ulimit:

if [[ "$(uname -s)" == "Darwin" ]]; then
# On OS X, we will enable core dump generation only if there are no core
# files already in /cores/ at this point. This is being done to prevent
# inadvertently flooding the CI machines with dumps.
if [[ ! -d "/cores" || ! "$(ls -A /cores)" ]]; then
ulimit -c unlimited
fi

See also:
#65405 (comment)
https://github.com/dotnet/core-eng/issues/15333

cc @danmoseley

Author: elinor-fung
Assignees: -
Labels:

area-Infrastructure-coreclr

Milestone: 7.0.0

@hoyosjs hoyosjs changed the title Use createdump to collect crash dumps for libraries tests Use createdump to collect crash dumps where possible in runtime Jul 16, 2022
@hoyosjs hoyosjs self-assigned this Jul 16, 2022
@ghost
Copy link

ghost commented Jul 16, 2022

Tagging subscribers to this area: @dotnet/runtime-infrastructure
See info in area-owners.md if you want to be subscribed.

Issue Details

System dumps on macOS are large - uploading them has been taking down helix queues. Using the runtime's coredump features should allow for configuration such that we can get smaller and still useful dumps.

Since createdump is part of the test/payload, I think we should just be able to update the libraries runner template to set the dump configuration environment variables for DbgEnableMiniDump, DbgMiniDumpName, and DbgMiniDumpType (cc @mikem8361 @hoyosjs) instead of using ulimit:

if [[ "$(uname -s)" == "Darwin" ]]; then
# On OS X, we will enable core dump generation only if there are no core
# files already in /cores/ at this point. This is being done to prevent
# inadvertently flooding the CI machines with dumps.
if [[ ! -d "/cores" || ! "$(ls -A /cores)" ]]; then
ulimit -c unlimited
fi

See also:
#65405 (comment)
https://github.com/dotnet/core-eng/issues/15333

cc @danmoseley

Author: elinor-fung
Assignees: hoyosjs
Labels:

area-Infrastructure

Milestone: 7.0.0

@ericstj
Copy link
Member

ericstj commented Oct 9, 2023

@hoyosjs would it make since to try to do this with the libraries crash symbolization effort? What's involved?

@hoyosjs
Copy link
Member

hoyosjs commented Oct 10, 2023

For crashes of libraries? It would be setting #65422 (comment) these variables if not present in the wrapper such that they store the dumps in the folder that helix uploads. You can then symbolize all different dumps. For macOS, Jeremy is already staging work for it https://github.com/dotnet/runtime/pull/92967/files

@ericstj
Copy link
Member

ericstj commented Oct 25, 2023

I meant - are you planning on re-enabling dumps for the places it was disabled? Perhaps by adding these settings. In cases where it still might be too expensive to pull the dumps off the machine maybe @ivdiazsa's tool might be used to just dump the relevant info to the log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

8 participants