-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Application created from Worker Service template violates CPU usage limit and consume all available swap space #48780
Comments
This seems very strange since the default worker service doesn't do anything besides print a time-stamped log message once per second. There's a tiny allocation for each one, but it seems unlikely that the allocations would be allowed to build up for so long that it would take 90 seconds of CPU time to GC them. |
I tried this over the weekend with dotnet 7.0.302 on macOS 12.6.4 (Intel). After an initial working set increase (over the first minute or so), things stabilized and stayed flat for the remainder of the run. Same on Windows. Disclaimer: I didn't use a daemon in either repro attempt and it's possible that that matters. From the internal ticket, the original OS was probably 13.2, which might matter. I should be able to test with launchd, but getting a mac with an arbitrary OS is trickier. |
Since it took me a few tries to cobble it together, here's my <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>gh48548</string>
<key>Program</key>
<string>/Users/asplab/repros/gh48548/bin/Debug/net7.0/gh48548</string>
<key>RunAtLoad</key>
<false/>
<key>EnvironmentVariables</key>
<dict>
<key>DOTNET_ROOT</key>
<string>/Users/asplab/dotnet/</string>
<key>DOTNET_DbgEnableMiniDump</key>
<string>1</string>
<key>DOTNET_EnableCrashReport</key>
<string>1</string>
</dict>
<key>StandardOutPath</key>
<string>/tmp/gh48548.stdout</string>
<key>StandardErrorPath</key>
<string>/tmp/gh48548.stderr</string>
</dict>
</plist> |
Hey thanks for looking at this. I want to say it is necessary to be running as a global daemon in order to observe the problem. Also, it is unpredictable as to how long to let it run before it occurs. Sometimes it is within a day, other times it can run up to a week. I have not tried running with your DOTNET_ flags, perhaps I will do that, but my impression is when this occurs we get no warning or opportunity to take crash dumps. |
@parnasm Thanks for the update! I tried again with a global daemon, but it was still using a reasonable amount of memory and CPU a day later. You said you've seen this on both ARM and x64? Also, I notice that the original error message mentioned a |
Worker-04-20-2023_2023-06-02-111037_Marcels-MacBook-Pro-2.cpu_resource.diag.txt @amcasey I no longer have the one from that run in April, but I have the most recent one which is similar. The thing that I believe is constant in these reports is the presence of an FSEvents callback. And yes, for the most part, the sample will run fine. Then something in the OS seems to trigger. And arm64 or x64 makes no appreciable difference. Notice in the attached case, no action was taken. It appears sometimes the app is able to weather the storm! |
Yeah, sorry about that delay - your report got misrouted. 😞 Do I understand correctly that the attached file does not show a failure? |
It shows a sustained period of above average cpu usage, but it was not enough to trigger killing the app. In cases where the app has been killed, it is usually (always?) due to low swap space. |
It's just so unclear what it could be doing with CPU cycles, given that all it does is log once per second... |
My original post on MS Developer Community included more detail than is shown here, including the .plist and the project folder which I no longer see attached to that post or this. Completely agree, it makes no sense that this app itself is doing anything with CPU cycles. My intuition is it is something deep in the dotnet runtime that is watching for file system changes (for hot-swaps perhaps?) and somehow runs into an infinite loop or self-generated feedback loop. |
It looks like maybe it was trying to start a thread? So (even more tenuously) maybe the thread pool is getting exhausted? Somehow? |
That's frustrating. I did find the zip file of your project and some console output, but not the plist, which would have been very helpful. I think attachments expire for privacy (or other) reasons.
That's going to be extremely difficult to debug without a trace or something. I'm sure we could rig up a way to collect such a trace, but it does tend to change the timing of the running program, which seems likely to be important here. Did you say you'd tried this with a global agent (i.e. running as you, rather than root) and it hadn't repro'd? |
I don't see where it is trying to start a thread, the main thread has already started, and is processing a file system event, in the course of which it is doing all kinds of system level calls into libsystem_kernel.dylib, etc. In the past I have been able to attach a debugger to the daemon process just before it was killed, and was able to see a very deep callstack, again involving the file system event callback. I will try to either dig that up or catch it in the act again. Yes I have tried running as the logged on user, and while I will claim to have seen the same thing happen, it was a long time ago at this point, and many revs of Visual Studio and MacOS ago. Running as root is the surest way to see this happen. |
My mistake - I was reading the stack upside down. 😆 |
This is on .net 7, correct? Can you try experimentally setting the environment variable |
I can try that but as I mentioned in my original post, we've been experiencing this problem for over a year, going back to when we were still on .NET Core 3.1. We subsequently moved to .NET 6.0 and then to 7.0, with no improvement. |
Ah, don't bother then. Out of curiosity, given that it works with such a simple toy app, have you tried an even simpler app? e.g. a console app that just logs in a loop? That would eliminate a lot of moving parts. |
I set up a console app yesterday, and left it running side by side with our production app. The console app is fine so far, but I just managed to catch the production app going CPU nuts, and I took a sample and a screenshot of the Activity Monitor details. Note 117% cpu usage and the 2.5GB real memory size is way out of proportion to normal usage, which is usually <1% and ~100MB when idle. The call graphs in the sample are insane. |
I also managed to attach the VisualStudio debugger and break into the Phd process. Attached is a screenshot showing our main thread doing nothing, and no other of our threads doing anything, but note the .NET File Watcher thread is calling the system LStat() function, and also note the Console log processing thread is in TryDequeue(). I have observed this before, and again just intuitively, it seems to me there may be an interaction between the File Watcher and the Logger that is causing this raucous behavior. |
Hey @amcasey, don't want to pester you, just want to make sure you get these updates. Have a good weekend! |
@parnasm I did - thanks! I'm slammed today, so I'll have to dig in next week. Enjoy your weekend! |
I'm not specifically familiar with the function of the I'm more interested in why the stacks are so deep. I don't suppose you can share managed stacks corresponding to the ones in your sample? Feel free to anonymize the frames from your app. |
Oh, and did the console app eventually break? |
Hey, no, the console app is still running fine. Our app I still have trapped in the VS debugger, but I cannot seem to get any stack traces except for our main thread and two worker threads, all of which are simply waiting for something: 311188 0x0 in System.Threading.ManualResetEventSlim.Wait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) // timeout is infinite 311211 .NET ThreadPool Worker 0x0 in System.Threading.WaitHandle.WaitMultiple(System.ReadOnlySpan<System.Threading.WaitHandle> waitHandles, bool waitAll, int millisecondsTimeout) // timeout is infinte 311215 .NET Long Running Task 0x0 in System.Threading.WaitHandle.WaitMultiple(System.ReadOnlySpan<System.Threading.WaitHandle> waitHandles, bool waitAll, int millisecondsTimeout) // timeout is 150 seconds This is typical of our app in its quiescent state, where it's basically waiting for a network connection. All other threads appear to be in .NET code, also generally waiting for something, but if I double click any of them, the Stack window is empty and the source window says "Source not available." The only other interesting thread is: 311208 Console logger queue processing thread 0x0 in Microsoft.Extensions.Logging.Console.ConsoleLoggerProcessor.TryDequeue(out Microsoft.Extensions.Logging.Console.LogMessageEntry item) Any advice how to dig deeper? |
If you really wanted, you could enable Microsoft Symbol Servers and SourceLink support to get a look at those source files, but it's probably not that interesting at the moment. Does VS for Mac have the Parallel Stacks window? Sometimes, that gives a clearer overview of what's going on. It may also offer Parallel Tasks, which would show what's on hold. If that's not an option, you may have better luck inspecting the stack outside VS. You can try saving a dump and running analyzing it with dotnet-dump. (Caveat: I'm not 100% sure they use compatible dump formats.) It's a lot less convenient than VS, but sometimes it exposes different information (i.e. they're complementary). |
I tried to run dotnet-dump --type Mini or Heap and it said: Writing dump with heap to /Users/marcel/Work/Host4Mac/Phd/core_20230612_144117 So I ran it "--Full" and 10 minutes later I had a 36GB file, but I had to detach the VS Debugger in order for it to do its thing. I will not attempt to attach a 36GB file here. I was able to run dotnet-dump analyze and get a parallelstacks display (VS for Mac does not have this ability): Unfortunately, the parallelstacks display does not line up with what I first saw when I took an Activity Monitor Sample of this process and looked at the same thread (i.e. 311188 or 0x4bf94). That showed this crazy looping: Basically, by the time I had detached VS debugger and taken the dump, the process was in a strange state (Activity Monitor showed memory usage fluctuating but no CPU usage) and it was very shortly terminated by the OS. I still have the dump file, and can try to look at whatever you suggest. Meanwhile, I have stopped the Console app and restarted the WorkerTest, which is the barebones app that demonstrates the bad behavior. If I can catch that in the act, I will try to dotnet-dump it without first attaching the VS debugger. |
Huh, I wonder why Heap didn't work - that's usually the best choice. Maybe a macos restriction? Sorry about the 36 GB file. 😆 I noticed that long sequence of repeated frames too, but my concern was that that might be just be the native code that triggers a dotnet method call, in which case, it's not very interesting on its own. This thread looked kind of interesting:
Maybe something is timing out? Some other random thoughts:
|
Looks like I was too hasty (chiefly because I thought we had a repo in a vanilla console app). In my non-reproing service, these are the gcroots of the only
|
As a point of interest, |
I grabbed dumps of the worker running as a service and running locally. Each has a single Edit: I did another experiment and it looks like the local process is watching the working directory. |
Obviously, triald is running on my machine. How did you determine that the service is watching /. ? |
In the dump, you can use |
FWIW, I can see the same thing, it is watching /. Also, at the risk of supplying data we don't need, I'll add that I've long suspected the FileSystemWatcher to be the source of the problem, so at one time (many months ago) I explicitly added a new FileSystemWatcher to the Worker app and had it watch "/" as well. However, without knowing what the trigger was for the bad behavior, this alone did not cause any problem. I think it pretty clearly comes down to the triald cyclical file system links. |
It's going to take some time to disentangle this, but I think you can probably mitigate the issue by setting your content root explicitly so that it doesn't try to watch the whole world. Normally, I'd validate that before suggesting it, but I have to run and you seem both capable and motivated. 😉 |
I'll give it a try, thanks for your help! |
@parnasm Did that work? I'm going to have to shift my focus pretty soon and I'd like to make sure you're unblocked. |
Hi Andrew, Sorry I've had a detour today and unable to give it a try yet. Will get back to you tomorrow.
…________________________________
From: Andrew Casey ***@***.***>
Sent: Wednesday, June 14, 2023 2:28:56 PM
To: dotnet/aspnetcore ***@***.***>
Cc: Marcel Parnas ***@***.***>; Mention ***@***.***>
Subject: Re: [dotnet/aspnetcore] Application created from Worker Service template violates CPU usage limit and consume all available swap space (Issue #48780)
@parnasm<https://github.com/parnasm> Did that work? I'm going to have to shift my focus pretty soon and I'd like to make sure you're unblocked.
—
Reply to this email directly, view it on GitHub<#48780 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AANO45TJXUSO6AMMPFDM3XDXLH7GJANCNFSM6AAAAAAZFGQ6TM>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I know the feeling. 😆 |
Good Morning, and Good News!
The work-around of setting the content root appears to solve the problem. I wanted to have a reliable way to verify that it was working, and discovered I could force the CPU/Memory explosion by zipping up the ~/Library/Trial folder, sending the original to the trash, and then unzipping the zipped folder. As soon as I unzipped the zip, the Worker process went nuts. Then I added the code to set the content root to a harmless folder elsewhere on the drive, rebuilt and reran the test. With the content root set, the process was oblivious of the Trial folder and held steady after the unzip. I took a dump of the process and examined the one and only FileSystemWatcher and verified its directory_ property was set to the harmless folder. Next, I'll be adding similar code to our production app.
Thanks again for your help.
…________________________________
From: Andrew Casey ***@***.***>
Sent: Wednesday, June 14, 2023 3:13 PM
To: dotnet/aspnetcore ***@***.***>
Cc: Marcel Parnas ***@***.***>; Mention ***@***.***>
Subject: Re: [dotnet/aspnetcore] Application created from Worker Service template violates CPU usage limit and consume all available swap space (Issue #48780)
Hi Andrew, Sorry I've had a detour today and unable to give it a try yet. Will get back to you tomorrow.
I know the feeling. 😆
—
Reply to this email directly, view it on GitHub<#48780 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AANO45QZ73KJLHVORID5KETXLIEPFANCNFSM6AAAAAAZFGQ6TM>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Excellent! This has been a fun collaboration. 😄 |
Since it may be a little while until the underlying issue is fixed, here's my best attempt at a summary. When you run an aspnetcore app as a global daemon on macos (and probably anywhere else the CWD is Mitigation: Specify a content root that isn't the FS/drive root. Issue 1: Why is it watching recursively? It seems to only care about two particular settings files. |
One other question! I mentioned we also have a user agent running, which talks to the service via gRPC. The agent also uses Kestrel and I want to make sure its content root is never set to "/", as it might do when running as root on the login window. I can use the same code in the agent as I have in the service, to redirect the content root. However, as the agent is basically a Xamarin (Microsoft.macos) project, I cannot seem to use dotnet-dump on it to verify that there's only the one FileSystemWatcher and it's looking in the right place. The error message I get is "Process xxxx not running compatible .NET runtime." Is there another way to search managed memory for objects of a given type? I've tried some suggestions of things in the Immediate window, but they seem not to be supported on the mac. |
Unfortunately, I don't use VS for Mac often enough to have clear guidance for you. It sounds like it probably doesn't have heap exploration tools, but I don't know one way or the other. However, given that you can modify the code, I'd probably start with the gcroots above to see who has references to the file watcher. If none of them are in scope, you can inject one of the DI components with indirect access and then follow references in the Locals/Watch window. Either that, or look in the service collection for something that would have a path to the file watcher. |
This issue has been moved from a ticket on Developer Community.
Using Visual Studio Professional 2022 for Mac Version 17.5.1 I created a blank Worker Service using the project template of that name. I then built the project without modification, and ran it as a launchd global daemon. After letting it run for less than 24 hours, macOS killed the process with the following report:
23:19:21.591511 process Worker-04-10-202[54918] thread 16118204 caught burning CPU! It used more than 50% CPU over 180 seconds
23:19:21.593692 com.apple.symptomsd Received CPU usage trigger:
Worker-04-10-2023[54918] () used 90.00s of CPU over 169.64 seconds (averaging 53%), violating a CPU usage limit of 90.00s over 180 seconds.
23:19:21.596745 com.apple.symptomsd RESOURCE_NOTIFY trigger for Worker-04-10-2023 [54918] (90000066750 nanoseconds of CPU usage over 169.00s seconds, violating limit of 90000000000 nanoseconds of CPU usage over 180.00s seconds)
23:19:22.032062 Saved cpu_resource.diag report for Worker-04-10-2023 version ??? to Worker-04-10-2023_2023-04-10-231921_Marcels-MacBook-Pro-2.cpu_resource.diag
05:36:33.248530 low swap: killing largest compressed process with pid 54918 (Worker-04-10-202) and size 39587 MB
This same behavior has been plagueing our full-scale application for over six months. The only work-around has been to set the daemon to auto-restart. I have tried Release builds vs. Debug builds, and X64 builds vs. arm64 builds, and tried running on X64 hardware and M1 hardware, all with the same result, the only difference being the amount of time before the application “blows up” and is killed by the OS.
Original Comments
Feedback Bot on 4/13/2023, 07:43 PM:
(private comment, text removed)
Original Solutions
(no solutions)
The text was updated successfully, but these errors were encountered: