-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various createdump fixes. Smaller MacOS dump size, better logging/stats, misc cleanup, etc. #71569
Conversation
Tagging subscribers to this area: @tommcdon Issue DetailsFix where CombineMemoryRegions is called and perf fix for PAGE_SIZE on MacOS M1. Remove MEMORY_REGION_FLAG_MEMORY_BACKED flags because it was always set now. Add better memory tracing and memory region stats. Fix MacOS native module regions when overlapping with existing. Fix MacOS adding the managed modules to the module mapping list before the "other mappings" is built. Don't add share_mode == SM_EMPTY regions for MacOS coredumps. Fixes the large M1 dumps.
|
Large MacOS dump issue: #68604 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will continue next week
Fix MacOS native module regions when overlapping with existing. Fix MacOS adding the managed modules to the module mapping list before the "other mappings" is built.
… takes to get the crash report/dump for VS4Mac
I'm hoping we can get this approved for Preview 7 before July 12th. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
@janvorli's original comment in #68604
What is the size of the dump from such an experiment now? I am wondering whether we now have dumps comparable in size to what we get on Linux. Also I recall compression was hypothesized but if you've now eliminated all the blank pages presumably that's not helpful/not needed. |
MacOS dump sizes are at least 50% to 75% smaller. For the same smallest test app, the MacOS and Linux sizes are comparable. |
That's great to hear. I'm not sure what the state of dump gathering on helix for Mac is right now @MattGal is it switched on and working but size capped or something? |
@MattGal I believe helix is using os dumps (enabled via ulimit). We should consider switch it to use createdump so we can take advantage of the reduced size. This will require setting the correct environment variables to enable coreclr-generated crash dumps: https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dumps#collect-dumps-on-crash. |
Correct. On-premises machines, due to the more limited network connection, are set to upload up to 2 GB of dumps and a max of 1.5 GB per dump. As always if you see issues where you think one should have been uploaded but wasn't, we can investigate. |
This is already tracked for dotnet/runtime by #65422. However is there any reason it wouldn't be strictly better to have the Helix infrastructure set these environment variables globally? In the same place it already sets In other words I'm unclear why (aside from historical reasons) we also do it in runtime/eng/testing/RunnerTemplate.sh Lines 130 to 136 in 8cd701a
If I'm right then I'll create an issue in dotnet/arcade and everything will be better in all repos. And that code in dotnet/runtime can just be deleted. @elinor-fung @MattGal @hoyosjs |
I can think of a few:
|
@MattGal All macOS system dumps are huge. Have we ever had a successful upload? If not, then we might as well disable the system support and use the environment variables to get the small chunk of actionable dumps. Or we could try to compress them (we realized a lot of pages are zeros, so they are easily compressible). |
I was assuming we'd keep ulimit - so the only one of those is the last one. I suppose you could de duplicate mostly, disregarding PID reuse, but it would be code. What do you recommend? We change the dotnet/runtime code above to un-set ulimit and set createdump iff it's not WASM, NativeAOT and Mono? |
Yeah, I am looking at doing this to at least get a section of actionable data. |
That was'nt solved by this change to skip SM_EMPTY ? |
I mean for system dumps. Createdump dumps are not reasonably smaller |
Is there a way that createdump could terminate in such a way that system dumps were not triggered? this seems arguably a customer scenario -- it's messy if when createdump is triggered JIT it always leads to two dumps on many machines? |
Haven't been able to find such a provision for macOS, and it's a fair amount of work for Linux I believe (I'd have to check how apport does it).
It sort of is to some extent - the user configured both to be collected though. On top of that, system dumps are fairly hard to use for managed code as they don't have information to map native to managed threads. the createdump dumps can deal with native crashes after PAL initialization, so I'd say they give us a lot better UX for crash diagnostics and solves a lot of our size problems for uploads. |
Hmm, thinking about this a bit more. I'll need to take a look at the full dumps createdump generates. There are cases where full dumps are necessary for runtime bugs, but I don't know if they are that much lighter than system dumps. @jkotas, do you know how frequently a full dump is needed over a heap dump in runtime bugs? Right now, we really get no dumps, so a heap dump is better than nothing, I want to see how much we are missing in that intermediate setting |
Good percentage of crashes can be diagnosed from heap dumps. I agree that having some dumps is better than no dumps. |
If the issue is that enabling createdump would mean 2 dumps, and that's too much, and it's not easy to prevent it, and the alternative is nothing at all, can we set the script so that if a createdump dump is present (identified using DOTNET_DbgMiniDumpName pattern) it only uploads the createdump dump, otherwise it uploads the system dump? |
In fact, if that scheme is reasonable, it could be done at the Arcade level for everyone? |
Might have some collisions on PID, though may be more rare than I think. Also, if a process enables one I'd disable the other - they tend to fill the disks fast and make it really hard for subsequent tests to do much. I'd say lets experiment with runtime where we can revise policy quickly and then move the change to arcade (I am not sure helix should always set the env vars during provisioning, but the helix SDK might be a good place to add the support). |
Sounds good - do you plan to take care of that @hoyosjs ? |
Won't get to this right now, but I've self assigned this. I'll investigate a bit on dump size and determine what the best approach is for runtime. Then I'll sync with Matt who has more context on the Helix upload policy. |
Sounds great, thanks @hoyosjs . |
Fix where CombineMemoryRegions is called and perf fix for PAGE_SIZE on MacOS M1.
Remove MEMORY_REGION_FLAG_MEMORY_BACKED flags because it was always set now.
Add better memory tracing and memory region stats.
Fix MacOS native module regions when overlapping with existing.
Fix MacOS adding the managed modules to the module mapping list before the "other mappings" is built.
Don't add share_mode == SM_EMPTY regions for MacOS coredumps. Fixes the large M1 dumps.