Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application created from Worker Service template violates CPU usage limit and consume all available swap space #48780

Open
vsfeedback opened this issue May 31, 2023 · 58 comments
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions Author: Migration Bot 🤖 The issue was created by a issue mover bot. The author may not be the actual author.

Comments

@vsfeedback
Copy link

This issue has been moved from a ticket on Developer Community.


Using Visual Studio Professional 2022 for Mac Version 17.5.1 I created a blank Worker Service using the project template of that name. I then built the project without modification, and ran it as a launchd global daemon. After letting it run for less than 24 hours, macOS killed the process with the following report:

23:19:21.591511 process Worker-04-10-202[54918] thread 16118204 caught burning CPU! It used more than 50% CPU over 180 seconds
23:19:21.593692 com.apple.symptomsd Received CPU usage trigger:
Worker-04-10-2023[54918] () used 90.00s of CPU over 169.64 seconds (averaging 53%), violating a CPU usage limit of 90.00s over 180 seconds.
23:19:21.596745 com.apple.symptomsd RESOURCE_NOTIFY trigger for Worker-04-10-2023 [54918] (90000066750 nanoseconds of CPU usage over 169.00s seconds, violating limit of 90000000000 nanoseconds of CPU usage over 180.00s seconds)
23:19:22.032062 Saved cpu_resource.diag report for Worker-04-10-2023 version ??? to Worker-04-10-2023_2023-04-10-231921_Marcels-MacBook-Pro-2.cpu_resource.diag
05:36:33.248530 low swap: killing largest compressed process with pid 54918 (Worker-04-10-202) and size 39587 MB

This same behavior has been plagueing our full-scale application for over six months. The only work-around has been to set the daemon to auto-restart. I have tried Release builds vs. Debug builds, and X64 builds vs. arm64 builds, and tried running on X64 hardware and M1 hardware, all with the same result, the only difference being the amount of time before the application “blows up” and is killed by the OS.


Original Comments

Feedback Bot on 4/13/2023, 07:43 PM:

(private comment, text removed)


Original Solutions

(no solutions)

@ghost ghost added the Author: Migration Bot 🤖 The issue was created by a issue mover bot. The author may not be the actual author. label May 31, 2023
@amcasey
Copy link
Member

amcasey commented Jun 2, 2023

This seems very strange since the default worker service doesn't do anything besides print a time-stamped log message once per second. There's a tiny allocation for each one, but it seems unlikely that the allocations would be allowed to build up for so long that it would take 90 seconds of CPU time to GC them.

@amcasey
Copy link
Member

amcasey commented Jun 5, 2023

I tried this over the weekend with dotnet 7.0.302 on macOS 12.6.4 (Intel). After an initial working set increase (over the first minute or so), things stabilized and stayed flat for the remainder of the run. Same on Windows.

Disclaimer: I didn't use a daemon in either repro attempt and it's possible that that matters.

From the internal ticket, the original OS was probably 13.2, which might matter. I should be able to test with launchd, but getting a mac with an arbitrary OS is trickier.

@amcasey amcasey self-assigned this Jun 5, 2023
@amcasey
Copy link
Member

amcasey commented Jun 7, 2023

Since it took me a few tries to cobble it together, here's my /Library/LaunchDaemons/gh48548.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
	<dict>
		<key>Label</key>
		<string>gh48548</string>
		<key>Program</key>
		<string>/Users/asplab/repros/gh48548/bin/Debug/net7.0/gh48548</string>
		<key>RunAtLoad</key>
		<false/>
		<key>EnvironmentVariables</key>
		<dict>
			<key>DOTNET_ROOT</key>
			<string>/Users/asplab/dotnet/</string>
			<key>DOTNET_DbgEnableMiniDump</key>
			<string>1</string>
			<key>DOTNET_EnableCrashReport</key>
			<string>1</string>
		</dict>
		<key>StandardOutPath</key>
		<string>/tmp/gh48548.stdout</string>
		<key>StandardErrorPath</key>
		<string>/tmp/gh48548.stderr</string>
	</dict>
</plist>

@parnasm
Copy link

parnasm commented Jun 8, 2023

Hey thanks for looking at this. I want to say it is necessary to be running as a global daemon in order to observe the problem. Also, it is unpredictable as to how long to let it run before it occurs. Sometimes it is within a day, other times it can run up to a week. I have not tried running with your DOTNET_ flags, perhaps I will do that, but my impression is when this occurs we get no warning or opportunity to take crash dumps.

@amcasey
Copy link
Member

amcasey commented Jun 8, 2023

@parnasm Thanks for the update! I tried again with a global daemon, but it was still using a reasonable amount of memory and CPU a day later. You said you've seen this on both ARM and x64?

Also, I notice that the original error message mentioned a .cpu_resource.diag file. Did you have a chance to review that?

@parnasm
Copy link

parnasm commented Jun 8, 2023

Worker-04-20-2023_2023-06-02-111037_Marcels-MacBook-Pro-2.cpu_resource.diag.txt

@amcasey I no longer have the one from that run in April, but I have the most recent one which is similar. The thing that I believe is constant in these reports is the presence of an FSEvents callback. And yes, for the most part, the sample will run fine. Then something in the OS seems to trigger. And arm64 or x64 makes no appreciable difference.

Notice in the attached case, no action was taken. It appears sometimes the app is able to weather the storm!

@amcasey
Copy link
Member

amcasey commented Jun 8, 2023

Yeah, sorry about that delay - your report got misrouted. 😞

Do I understand correctly that the attached file does not show a failure?

@parnasm
Copy link

parnasm commented Jun 8, 2023

It shows a sustained period of above average cpu usage, but it was not enough to trigger killing the app. In cases where the app has been killed, it is usually (always?) due to low swap space.

@amcasey
Copy link
Member

amcasey commented Jun 8, 2023

It's just so unclear what it could be doing with CPU cycles, given that all it does is log once per second...

@parnasm
Copy link

parnasm commented Jun 8, 2023

My original post on MS Developer Community included more detail than is shown here, including the .plist and the project folder which I no longer see attached to that post or this.

Completely agree, it makes no sense that this app itself is doing anything with CPU cycles. My intuition is it is something deep in the dotnet runtime that is watching for file system changes (for hot-swaps perhaps?) and somehow runs into an infinite loop or self-generated feedback loop.

@amcasey
Copy link
Member

amcasey commented Jun 8, 2023

It looks like maybe it was trying to start a thread? So (even more tenuously) maybe the thread pool is getting exhausted? Somehow?

@amcasey
Copy link
Member

amcasey commented Jun 8, 2023

My original post on MS Developer Community included more detail than is shown here, including the .plist and the project folder which I no longer see attached to that post or this.

That's frustrating. I did find the zip file of your project and some console output, but not the plist, which would have been very helpful. I think attachments expire for privacy (or other) reasons.

Completely agree, it makes no sense that this app itself is doing anything with CPU cycles. My intuition is it is something deep in the dotnet runtime that is watching for file system changes (for hot-swaps perhaps?) and somehow runs into an infinite loop or self-generated feedback loop.

That's going to be extremely difficult to debug without a trace or something. I'm sure we could rig up a way to collect such a trace, but it does tend to change the timing of the running program, which seems likely to be important here.

Did you say you'd tried this with a global agent (i.e. running as you, rather than root) and it hadn't repro'd?

@parnasm
Copy link

parnasm commented Jun 8, 2023

I don't see where it is trying to start a thread, the main thread has already started, and is processing a file system event, in the course of which it is doing all kinds of system level calls into libsystem_kernel.dylib, etc.

In the past I have been able to attach a debugger to the daemon process just before it was killed, and was able to see a very deep callstack, again involving the file system event callback. I will try to either dig that up or catch it in the act again.

Yes I have tried running as the logged on user, and while I will claim to have seen the same thing happen, it was a long time ago at this point, and many revs of Visual Studio and MacOS ago. Running as root is the surest way to see this happen.

@amcasey
Copy link
Member

amcasey commented Jun 8, 2023

I don't see where it is trying to start a thread, the main thread has already started, and is processing a file system event, in the course of which it is doing all kinds of system level calls into libsystem_kernel.dylib, etc.

My mistake - I was reading the stack upside down. 😆

@amcasey
Copy link
Member

amcasey commented Jun 8, 2023

This is on .net 7, correct? Can you try experimentally setting the environment variable COMPlus_GCName=clrgc.dll? That'll revert back to an older version of the GC which will help us determine whether changes in that area are affecting your scenario.

@parnasm
Copy link

parnasm commented Jun 8, 2023

I can try that but as I mentioned in my original post, we've been experiencing this problem for over a year, going back to when we were still on .NET Core 3.1. We subsequently moved to .NET 6.0 and then to 7.0, with no improvement.

@amcasey
Copy link
Member

amcasey commented Jun 8, 2023

Ah, don't bother then. Out of curiosity, given that it works with such a simple toy app, have you tried an even simpler app? e.g. a console app that just logs in a loop? That would eliminate a lot of moving parts.

@parnasm
Copy link

parnasm commented Jun 9, 2023

I set up a console app yesterday, and left it running side by side with our production app. The console app is fine so far, but I just managed to catch the production app going CPU nuts, and I took a sample and a screenshot of the Activity Monitor details. Note 117% cpu usage and the 2.5GB real memory size is way out of proportion to normal usage, which is usually <1% and ~100MB when idle. The call graphs in the sample are insane.

Sample of Phd.txt

Screenshot 2023-06-09 at 10 30 30

@parnasm
Copy link

parnasm commented Jun 9, 2023

I also managed to attach the VisualStudio debugger and break into the Phd process. Attached is a screenshot showing our main thread doing nothing, and no other of our threads doing anything, but note the .NET File Watcher thread is calling the system LStat() function, and also note the Console log processing thread is in TryDequeue(). I have observed this before, and again just intuitively, it seems to me there may be an interaction between the File Watcher and the Logger that is causing this raucous behavior.

Screenshot 2023-06-09 at 10 58 27

@parnasm
Copy link

parnasm commented Jun 9, 2023

And just for fun, here's Activity Monitor showing how our Phd process is about to get axed due to consuming all available swap space.

Screenshot 2023-06-09 at 11 11 51

@parnasm
Copy link

parnasm commented Jun 9, 2023

Hey @amcasey, don't want to pester you, just want to make sure you get these updates. Have a good weekend!

@amcasey
Copy link
Member

amcasey commented Jun 9, 2023

@parnasm I did - thanks! I'm slammed today, so I'll have to dig in next week. Enjoy your weekend!

@amcasey
Copy link
Member

amcasey commented Jun 12, 2023

I'm not specifically familiar with the function of the .NET File Watcher thread, but my guess would be that it's just polling and basically any dump you could grab would show it in lstat. My intuition is that that's benign.

I'm more interested in why the stacks are so deep. I don't suppose you can share managed stacks corresponding to the ones in your sample? Feel free to anonymize the frames from your app.

@amcasey
Copy link
Member

amcasey commented Jun 12, 2023

Oh, and did the console app eventually break?

@parnasm
Copy link

parnasm commented Jun 12, 2023

Hey, no, the console app is still running fine. Our app I still have trapped in the VS debugger, but I cannot seem to get any stack traces except for our main thread and two worker threads, all of which are simply waiting for something:

311188 0x0 in System.Threading.ManualResetEventSlim.Wait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) // timeout is infinite

311211 .NET ThreadPool Worker 0x0 in System.Threading.WaitHandle.WaitMultiple(System.ReadOnlySpan<System.Threading.WaitHandle> waitHandles, bool waitAll, int millisecondsTimeout) // timeout is infinte

311215 .NET Long Running Task 0x0 in System.Threading.WaitHandle.WaitMultiple(System.ReadOnlySpan<System.Threading.WaitHandle> waitHandles, bool waitAll, int millisecondsTimeout) // timeout is 150 seconds

This is typical of our app in its quiescent state, where it's basically waiting for a network connection.

All other threads appear to be in .NET code, also generally waiting for something, but if I double click any of them, the Stack window is empty and the source window says "Source not available."

Screenshot 2023-06-12 at 13 22 28

The only other interesting thread is:

311208 Console logger queue processing thread 0x0 in Microsoft.Extensions.Logging.Console.ConsoleLoggerProcessor.TryDequeue(out Microsoft.Extensions.Logging.Console.LogMessageEntry item)

Any advice how to dig deeper?

@amcasey
Copy link
Member

amcasey commented Jun 12, 2023

If you really wanted, you could enable Microsoft Symbol Servers and SourceLink support to get a look at those source files, but it's probably not that interesting at the moment.

Does VS for Mac have the Parallel Stacks window? Sometimes, that gives a clearer overview of what's going on. It may also offer Parallel Tasks, which would show what's on hold.

If that's not an option, you may have better luck inspecting the stack outside VS. You can try saving a dump and running analyzing it with dotnet-dump. (Caveat: I'm not 100% sure they use compatible dump formats.) It's a lot less convenient than VS, but sometimes it exposes different information (i.e. they're complementary).

@parnasm
Copy link

parnasm commented Jun 12, 2023

I tried to run dotnet-dump --type Mini or Heap and it said:

Writing dump with heap to /Users/marcel/Work/Host4Mac/Phd/core_20230612_144117
Process 2444 not running compatible .NET runtime.

So I ran it "--Full" and 10 minutes later I had a 36GB file, but I had to detach the VS Debugger in order for it to do its thing. I will not attempt to attach a 36GB file here. I was able to run dotnet-dump analyze and get a parallelstacks display (VS for Mac does not have this ability):

parallelstacks.txt

Unfortunately, the parallelstacks display does not line up with what I first saw when I took an Activity Monitor Sample of this process and looked at the same thread (i.e. 311188 or 0x4bf94). That showed this crazy looping:
Call graph:
1422 Thread_311188 DispatchQueue_1: com.apple.main-thread (serial)
+ 1422 MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) (in libcoreclr.dylib) + 852 [0x105221fa4]
+ 1422 MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) (in libcoreclr.dylib) + 852 [0x105221fa4]
+ 1422 MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) (in libcoreclr.dylib) + 852 [0x105221fa4]
+ 1422 MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) (in libcoreclr.dylib) + 852 [0x105221fa4]
[etc., ad infinitum]

Basically, by the time I had detached VS debugger and taken the dump, the process was in a strange state (Activity Monitor showed memory usage fluctuating but no CPU usage) and it was very shortly terminated by the OS.

I still have the dump file, and can try to look at whatever you suggest. Meanwhile, I have stopped the Console app and restarted the WorkerTest, which is the barebones app that demonstrates the bad behavior. If I can catch that in the act, I will try to dotnet-dump it without first attaching the VS debugger.

@amcasey
Copy link
Member

amcasey commented Jun 12, 2023

Huh, I wonder why Heap didn't work - that's usually the best choice. Maybe a macos restriction? Sorry about the 36 GB file. 😆

I noticed that long sequence of repeated frames too, but my concern was that that might be just be the native code that triggers a dotnet method call, in which case, it's not very interesting on its own.

This thread looked kind of interesting:

                   4 Grpc.Net.Client.Internal.GrpcCall<System.__Canon, System.__Canon>.IsCancellationOrDeadlineException(Exception)
                   4 Grpc.Net.Client.Internal.GrpcCall<System.__Canon, System.__Canon>+<RunCall>d__73.MoveNext()

Maybe something is timing out?

Some other random thoughts:

  1. Are you logging to a temp directory? If so, does that directory periodically get wiped? (Sorry, I never found your plist file from the original report.)
  2. I'd be interested to know what !dumpheap stat produces, but there's a pretty good chance it will choke on 36 GB. 😆 Knowing what it's doing with all that space might help us narrow down what's going wrong.

@amcasey amcasey removed the needs-area-label Used by the dotnet-issue-labeler to label those issues which couldn't be triaged automatically label Jun 13, 2023
@amcasey amcasey self-assigned this Jun 13, 2023
@amcasey amcasey transferred this issue from dotnet/runtime Jun 13, 2023
@amcasey
Copy link
Member

amcasey commented Jun 13, 2023

Looks like I was too hasty (chiefly because I thought we had a repo in a vanilla console app). In my non-reproing service, these are the gcroots of the only FileSystemWatcher instance:

HandleTable:
    0000000106fe13a8 (strong handle)
          -> 00019f7d7420     System.Object[] 
          -> 0001877e9468     Microsoft.Extensions.Logging.EventSource.LoggingEventSource (static variable: Microsoft.Extensions.Logging.EventSource.ExceptionInfo.<Empty>k__BackingField)
          -> 000187802c58     System.Threading.CancellationTokenSource 
          -> 000187802c88     System.Threading.CancellationTokenSource+Registrations 
          -> 000187802cc8     System.Threading.CancellationTokenSource+CallbackNode 
          -> 000187802b78     Microsoft.Extensions.Primitives.ChangeToken+ChangeTokenRegistration<System.String> 
          -> 000187802b38     System.Action<System.String> 
          -> 000187802950     Microsoft.Extensions.Options.OptionsMonitor<Microsoft.Extensions.Logging.LoggerFilterOptions> 
          -> 0001878026a0     Microsoft.Extensions.Options.OptionsFactory<Microsoft.Extensions.Logging.LoggerFilterOptions> 
          -> 000187802560     Microsoft.Extensions.Options.IConfigureOptions<Microsoft.Extensions.Logging.LoggerFilterOptions>[] 
          -> 0001877e8cb0     Microsoft.Extensions.Logging.LoggerFilterConfigureOptions 
          -> 0001877e8b78     Microsoft.Extensions.Configuration.ConfigurationSection 
          -> 0001877e6a60     Microsoft.Extensions.Configuration.ConfigurationRoot 
          -> 0001877e4e28     System.Collections.Generic.List<Microsoft.Extensions.Configuration.IConfigurationProvider> 
          -> 0001877e4e68     Microsoft.Extensions.Configuration.IConfigurationProvider[] 
          -> 0001877e4ea0     Microsoft.Extensions.Configuration.Json.JsonConfigurationProvider 
          -> 0001877e4d40     Microsoft.Extensions.Configuration.Json.JsonConfigurationSource 
          -> 0001877e4a18     Microsoft.Extensions.FileProviders.PhysicalFileProvider 
          -> 0001877e6eb0     Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher 
          -> 0001877e50a8     System.IO.FileSystemWatcher 

    0000000106fe13e8 (strong handle)
          -> 00019f7d3018     System.Object[] 
          -> 000187814228     System.Threading.TimerQueue[] (static variable: System.ValueTuple<System.Int64, System.DateTime>.s_tickCountToTimeMap)
          -> 000187814570     System.Threading.TimerQueue 
          -> 000187bc6668     System.Threading.TimerQueueTimer 
          -> 000187bc6610     System.Threading.Tasks.Task+DelayPromiseWithCancellation 
          -> 000187818850     System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Threading.Tasks.VoidTaskResult>+AsyncStateMachineBox<gh48548.Worker+<ExecuteAsync>d__2> 
          -> 0001878188b0     System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Threading.Tasks.VoidTaskResult>+AsyncStateMachineBox<Microsoft.Extensions.Hosting.Internal.Host+<TryExecuteBackgroundServiceAsync>d__13> 
          -> 000187808878     gh48548.Worker 
          -> 000187809b70     System.Threading.CancellationTokenSource+Linked1CancellationTokenSource 
          -> 000187808910     System.Threading.CancellationTokenSource+CallbackNode 
          -> 000187809bb0     System.Threading.CancellationTokenSource+Registrations 
          -> 0001878075a0     System.Threading.CancellationTokenSource+Linked1CancellationTokenSource 
          -> 000187807678     System.Threading.CancellationTokenSource+CallbackNode 
          -> 0001878077f0     System.Threading.CancellationTokenSource+CallbackNode 
          -> 000187806f48     Microsoft.Extensions.Hosting.Internal.ConsoleLifetime 
          -> 000187803ff0     Microsoft.Extensions.Hosting.Internal.ApplicationLifetime 
          -> 000187803378     Microsoft.Extensions.Logging.Logger<Microsoft.Extensions.Hosting.Internal.ApplicationLifetime> 
          -> 000187803c48     Microsoft.Extensions.Logging.Logger 
          -> 000187803c70     Microsoft.Extensions.Logging.LoggerInformation[] 
          -> 000187803ce8     Microsoft.Extensions.Logging.Console.ConsoleLogger 
          -> 0001877fda78     Microsoft.Extensions.Logging.Console.SimpleConsoleFormatter 
          -> 0001877fdc88     Microsoft.Extensions.Options.OptionsMonitor<Microsoft.Extensions.Logging.Console.SimpleConsoleFormatterOptions>+ChangeTrackerDisposable 
          -> 0001877fd8a8     Microsoft.Extensions.Options.OptionsMonitor<Microsoft.Extensions.Logging.Console.SimpleConsoleFormatterOptions> 
          -> 0001877fd5b0     Microsoft.Extensions.Options.OptionsFactory<Microsoft.Extensions.Logging.Console.SimpleConsoleFormatterOptions> 
          -> 0001877fd400     Microsoft.Extensions.Options.IConfigureOptions<Microsoft.Extensions.Logging.Console.SimpleConsoleFormatterOptions>[] 
          -> 0001877fd448     Microsoft.Extensions.Logging.ConsoleLoggerFormatterConfigureOptions<Microsoft.Extensions.Logging.Console.SimpleConsoleFormatter, Microsoft.Extensions.Logging.Console.SimpleConsoleFormatterOptions> 
          -> 0001877fd4a0     System.Action<Microsoft.Extensions.Logging.Console.SimpleConsoleFormatterOptions> 
          -> 0001877fd488     Microsoft.Extensions.Options.ConfigureFromConfigurationOptions<Microsoft.Extensions.Logging.Console.SimpleConsoleFormatterOptions>+<>c__DisplayClass0_0 
          -> 0001877fd460     Microsoft.Extensions.Configuration.ConfigurationSection 
          -> 0001877fb5c8     Microsoft.Extensions.Configuration.ConfigurationRoot 
          -> 0001877fb530     System.Collections.Generic.List<Microsoft.Extensions.Configuration.IConfigurationProvider> 
          -> 0001877fb570     Microsoft.Extensions.Configuration.IConfigurationProvider[] 
          -> 0001877fb550     Microsoft.Extensions.Configuration.ChainedConfigurationProvider 
          -> 0001877fb4a8     Microsoft.Extensions.Configuration.ConfigurationSection 
          -> 0001877e6a60     Microsoft.Extensions.Configuration.ConfigurationRoot 
          -> 0001877e4e28     System.Collections.Generic.List<Microsoft.Extensions.Configuration.IConfigurationProvider> 
          -> 0001877e4e68     Microsoft.Extensions.Configuration.IConfigurationProvider[] 
          -> 0001877e4ea0     Microsoft.Extensions.Configuration.Json.JsonConfigurationProvider 
          -> 0001877e4d40     Microsoft.Extensions.Configuration.Json.JsonConfigurationSource 
          -> 0001877e4a18     Microsoft.Extensions.FileProviders.PhysicalFileProvider 
          -> 0001877e6eb0     Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher 
          -> 0001877e50a8     System.IO.FileSystemWatcher 

@amcasey
Copy link
Member

amcasey commented Jun 13, 2023

As a point of interest, triald does not appear to be running on the mac I've been using for repro attempts.

@amcasey
Copy link
Member

amcasey commented Jun 13, 2023

I grabbed dumps of the worker running as a service and running locally. Each has a single FileSystemWatcher. The local process is watching the directory from which I launched the process (or possibly the cwd), whereas the service is watching /. 😲

Edit: I did another experiment and it looks like the local process is watching the working directory.

@parnasm
Copy link

parnasm commented Jun 13, 2023

Obviously, triald is running on my machine.

How did you determine that the service is watching /. ?

@amcasey
Copy link
Member

amcasey commented Jun 13, 2023

In the dump, you can use dumpheap -type FileSystemWatcher to find the MT, then dumpheap -mt XXXX to find the address, then dumpobj YYYY to find the _directory field then dumpobj ZZZZ to see the string value.

@parnasm
Copy link

parnasm commented Jun 13, 2023

FWIW, I can see the same thing, it is watching /.

Also, at the risk of supplying data we don't need, I'll add that I've long suspected the FileSystemWatcher to be the source of the problem, so at one time (many months ago) I explicitly added a new FileSystemWatcher to the Worker app and had it watch "/" as well. However, without knowing what the trigger was for the bad behavior, this alone did not cause any problem. I think it pretty clearly comes down to the triald cyclical file system links.

@amcasey
Copy link
Member

amcasey commented Jun 13, 2023

It's going to take some time to disentangle this, but I think you can probably mitigate the issue by setting your content root explicitly so that it doesn't try to watch the whole world. Normally, I'd validate that before suggesting it, but I have to run and you seem both capable and motivated. 😉

@parnasm
Copy link

parnasm commented Jun 13, 2023

I'll give it a try, thanks for your help!

@amcasey
Copy link
Member

amcasey commented Jun 14, 2023

@parnasm Did that work? I'm going to have to shift my focus pretty soon and I'd like to make sure you're unblocked.

@parnasm
Copy link

parnasm commented Jun 14, 2023 via email

@amcasey
Copy link
Member

amcasey commented Jun 14, 2023

Hi Andrew, Sorry I've had a detour today and unable to give it a try yet. Will get back to you tomorrow.

I know the feeling. 😆

@parnasm
Copy link

parnasm commented Jun 15, 2023 via email

@amcasey
Copy link
Member

amcasey commented Jun 15, 2023

Excellent! This has been a fun collaboration. 😄

@amcasey
Copy link
Member

amcasey commented Jun 15, 2023

Since it may be a little while until the underlying issue is fixed, here's my best attempt at a summary.

When you run an aspnetcore app as a global daemon on macos (and probably anywhere else the CWD is /), it can end up watching the entire file system for changes. On macos, in particular, there's a daemon called triald (apparently only present after a GUI login (vs ssh)) that creates a ~/Library/Trial directory full of symlinks, some of which may form a cycle. When this happens, the file watcher goes nuts trying to watch an infinite graph of directories.

Mitigation: Specify a content root that isn't the FS/drive root.

Issue 1: Why is it watching recursively? It seems to only care about two particular settings files.
Issue 2: Why did it choose to watch the root directory? (This may end up having a good answer.)
Issue 3: Someone is missing cycle detection code (FileSystemWatcher or our layer on top of it).

@parnasm
Copy link

parnasm commented Jun 15, 2023

One other question! I mentioned we also have a user agent running, which talks to the service via gRPC. The agent also uses Kestrel and I want to make sure its content root is never set to "/", as it might do when running as root on the login window. I can use the same code in the agent as I have in the service, to redirect the content root. However, as the agent is basically a Xamarin (Microsoft.macos) project, I cannot seem to use dotnet-dump on it to verify that there's only the one FileSystemWatcher and it's looking in the right place. The error message I get is "Process xxxx not running compatible .NET runtime." Is there another way to search managed memory for objects of a given type? I've tried some suggestions of things in the Immediate window, but they seem not to be supported on the mac.

@amcasey
Copy link
Member

amcasey commented Jun 15, 2023

Unfortunately, I don't use VS for Mac often enough to have clear guidance for you. It sounds like it probably doesn't have heap exploration tools, but I don't know one way or the other.

However, given that you can modify the code, I'd probably start with the gcroots above to see who has references to the file watcher. If none of them are in scope, you can inject one of the DI components with indirect access and then follow references in the Locals/Watch window. Either that, or look in the service collection for something that would have a path to the file watcher.

@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Dec 19, 2023
@dotnet-policy-service dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024
@wtgodbe wtgodbe removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024
@dotnet-policy-service dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024
@wtgodbe wtgodbe removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 13, 2024
@dotnet dotnet deleted a comment from dotnet-policy-service bot Feb 13, 2024
@dotnet dotnet deleted a comment from dotnet-policy-service bot Feb 13, 2024
@amcasey amcasey removed their assignment Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions Author: Migration Bot 🤖 The issue was created by a issue mover bot. The author may not be the actual author.
Projects
None yet
Development

No branches or pull requests

7 participants
@davidfowl @parnasm @amcasey @wtgodbe @vsfeedback @mkArtakMSFT and others