Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various createdump fixes. Smaller MacOS dump size, better logging/stats, misc cleanup, etc. #71569

Merged
merged 13 commits into from
Jul 12, 2022

Conversation

mikem8361
Copy link
Member

Fix where CombineMemoryRegions is called and perf fix for PAGE_SIZE on MacOS M1.

Remove MEMORY_REGION_FLAG_MEMORY_BACKED flags because it was always set now.

Add better memory tracing and memory region stats.

Fix MacOS native module regions when overlapping with existing.

Fix MacOS adding the managed modules to the module mapping list before the "other mappings" is built.

Don't add share_mode == SM_EMPTY regions for MacOS coredumps. Fixes the large M1 dumps.

@ghost
Copy link

ghost commented Jul 1, 2022

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

Issue Details

Fix where CombineMemoryRegions is called and perf fix for PAGE_SIZE on MacOS M1.

Remove MEMORY_REGION_FLAG_MEMORY_BACKED flags because it was always set now.

Add better memory tracing and memory region stats.

Fix MacOS native module regions when overlapping with existing.

Fix MacOS adding the managed modules to the module mapping list before the "other mappings" is built.

Don't add share_mode == SM_EMPTY regions for MacOS coredumps. Fixes the large M1 dumps.

Author: mikem8361
Assignees: mikem8361
Labels:

area-Diagnostics-coreclr

Milestone: -

@mikem8361
Copy link
Member Author

Large MacOS dump issue: #68604

@hoyosjs hoyosjs linked an issue Jul 2, 2022 that may be closed by this pull request
Copy link
Member

@hoyosjs hoyosjs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will continue next week

src/coreclr/debug/createdump/crashinfo.cpp Outdated Show resolved Hide resolved
src/coreclr/debug/createdump/crashinfo.cpp Outdated Show resolved Hide resolved
src/coreclr/debug/createdump/crashinfo.cpp Outdated Show resolved Hide resolved
src/coreclr/debug/createdump/crashinfo.cpp Show resolved Hide resolved
src/coreclr/debug/createdump/crashinfo.cpp Show resolved Hide resolved
@mikem8361
Copy link
Member Author

I'm hoping we can get this approved for Preview 7 before July 12th.

@mikem8361 mikem8361 requested a review from hoyosjs July 7, 2022 17:30
Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@danmoseley
Copy link
Member

@janvorli's original comment in #68604

Just FYI, I am getting large dump sizes (over 3GB at least) even for a trivial C app with main containing only a dereference of NULL and nothing else.

What is the size of the dump from such an experiment now? I am wondering whether we now have dumps comparable in size to what we get on Linux.

Also I recall compression was hypothesized but if you've now eliminated all the blank pages presumably that's not helpful/not needed.

@mikem8361
Copy link
Member Author

MacOS dump sizes are at least 50% to 75% smaller. For the same smallest test app, the MacOS and Linux sizes are comparable.

@danmoseley
Copy link
Member

That's great to hear. I'm not sure what the state of dump gathering on helix for Mac is right now @MattGal is it switched on and working but size capped or something?

@tommcdon
Copy link
Member

That's great to hear. I'm not sure what the state of dump gathering on helix for Mac is right now @MattGal is it switched on and working but size capped or something?

@MattGal I believe helix is using os dumps (enabled via ulimit). We should consider switch it to use createdump so we can take advantage of the reduced size. This will require setting the correct environment variables to enable coreclr-generated crash dumps: https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dumps#collect-dumps-on-crash.

@mikem8361 mikem8361 merged commit f767163 into dotnet:main Jul 12, 2022
@mikem8361 mikem8361 deleted the dumpfixes branch July 12, 2022 06:43
@MattGal
Copy link
Member

MattGal commented Jul 12, 2022

That's great to hear. I'm not sure what the state of dump gathering on helix for Mac is right now @MattGal is it switched on and working but size capped or something?

Correct. On-premises machines, due to the more limited network connection, are set to upload up to 2 GB of dumps and a max of 1.5 GB per dump. As always if you see issues where you think one should have been uploaded but wasn't, we can investigate.

@danmoseley
Copy link
Member

That's great to hear. I'm not sure what the state of dump gathering on helix for Mac is right now @MattGal is it switched on and working but size capped or something?

@MattGal I believe helix is using os dumps (enabled via ulimit). We should consider switch it to use createdump so we can take advantage of the reduced size. This will require setting the correct environment variables to enable coreclr-generated crash dumps: https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dumps#collect-dumps-on-crash.

This is already tracked for dotnet/runtime by #65422. However is there any reason it wouldn't be strictly better to have the Helix infrastructure set these environment variables globally? In the same place it already sets ulimit -c unlimited globally. Wouldn't the right thing then just happen?

In other words I'm unclear why (aside from historical reasons) we also do it in

if [[ "$(uname -s)" == "Darwin" ]]; then
# On OS X, we will enable core dump generation only if there are no core
# files already in /cores/ at this point. This is being done to prevent
# inadvertently flooding the CI machines with dumps.
if [[ ! -d "/cores" || ! "$(ls -A /cores)" ]]; then
ulimit -c unlimited
fi
I couldn't find in our past discussions (eg) the reasoning.

If I'm right then I'll create an issue in dotnet/arcade and everything will be better in all repos. And that code in dotnet/runtime can just be deleted. @elinor-fung @MattGal @hoyosjs

@hoyosjs
Copy link
Member

hoyosjs commented Jul 12, 2022

I can think of a few:

  • You assume that the runtime is coreclr by doing that. NativeAOT doesn't use managed code in such manner and Mono runtime doesn't respect these. Also, WASM has very different runtime characteristics.
  • ulimit dumps help diagnose early process issues where the handler hasn't been installed. It's very rare, but it's a use case we have but our customers don't.
  • If you set both, you'll have 2 dumps per crash.

@hoyosjs
Copy link
Member

hoyosjs commented Jul 12, 2022

@MattGal All macOS system dumps are huge. Have we ever had a successful upload? If not, then we might as well disable the system support and use the environment variables to get the small chunk of actionable dumps. Or we could try to compress them (we realized a lot of pages are zeros, so they are easily compressible).

@danmoseley
Copy link
Member

I was assuming we'd keep ulimit - so the only one of those is the last one. I suppose you could de duplicate mostly, disregarding PID reuse, but it would be code.

What do you recommend? We change the dotnet/runtime code above to un-set ulimit and set createdump iff it's not WASM, NativeAOT and Mono?

@hoyosjs
Copy link
Member

hoyosjs commented Jul 12, 2022

What do you recommend? We change the dotnet/runtime code above to un-set ulimit and set createdump iff it's not WASM, NativeAOT and Mono?

Yeah, I am looking at doing this to at least get a section of actionable data.

@danmoseley
Copy link
Member

we realized a lot of pages are zeros, so they are easily compressible

That was'nt solved by this change to skip SM_EMPTY ?

@hoyosjs
Copy link
Member

hoyosjs commented Jul 12, 2022

I mean for system dumps. Createdump dumps are not reasonably smaller

@danmoseley
Copy link
Member

Is there a way that createdump could terminate in such a way that system dumps were not triggered? this seems arguably a customer scenario -- it's messy if when createdump is triggered JIT it always leads to two dumps on many machines?

@hoyosjs
Copy link
Member

hoyosjs commented Jul 12, 2022

Is there a way that createdump could terminate in such a way that system dumps were not triggered?

Haven't been able to find such a provision for macOS, and it's a fair amount of work for Linux I believe (I'd have to check how apport does it).

this seems arguably a customer scenario -- it's messy if when createdump is triggered JIT it always leads to two dumps on many machines?

It sort of is to some extent - the user configured both to be collected though. On top of that, system dumps are fairly hard to use for managed code as they don't have information to map native to managed threads. the createdump dumps can deal with native crashes after PAL initialization, so I'd say they give us a lot better UX for crash diagnostics and solves a lot of our size problems for uploads.

@hoyosjs
Copy link
Member

hoyosjs commented Jul 13, 2022

Hmm, thinking about this a bit more. I'll need to take a look at the full dumps createdump generates. There are cases where full dumps are necessary for runtime bugs, but I don't know if they are that much lighter than system dumps. @jkotas, do you know how frequently a full dump is needed over a heap dump in runtime bugs? Right now, we really get no dumps, so a heap dump is better than nothing, I want to see how much we are missing in that intermediate setting

@jkotas
Copy link
Member

jkotas commented Jul 13, 2022

Good percentage of crashes can be diagnosed from heap dumps. I agree that having some dumps is better than no dumps.

@danmoseley
Copy link
Member

If the issue is that enabling createdump would mean 2 dumps, and that's too much, and it's not easy to prevent it, and the alternative is nothing at all, can we set the script so that if a createdump dump is present (identified using DOTNET_DbgMiniDumpName pattern) it only uploads the createdump dump, otherwise it uploads the system dump?

@danmoseley
Copy link
Member

In fact, if that scheme is reasonable, it could be done at the Arcade level for everyone?

@hoyosjs
Copy link
Member

hoyosjs commented Jul 13, 2022

Might have some collisions on PID, though may be more rare than I think. Also, if a process enables one I'd disable the other - they tend to fill the disks fast and make it really hard for subsequent tests to do much. I'd say lets experiment with runtime where we can revise policy quickly and then move the change to arcade (I am not sure helix should always set the env vars during provisioning, but the helix SDK might be a good place to add the support).

@danmoseley
Copy link
Member

Sounds good - do you plan to take care of that @hoyosjs ?

@hoyosjs
Copy link
Member

hoyosjs commented Jul 16, 2022

Won't get to this right now, but I've self assigned this. I'll investigate a bit on dump size and determine what the best approach is for runtime. Then I'll sync with Matt who has more context on the Helix upload policy.

@danmoseley
Copy link
Member

Sounds great, thanks @hoyosjs .

@ghost ghost locked as resolved and limited conversation to collaborators Aug 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate why createdump generated dumps are so large on MacOS
7 participants