Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemoryException in Thread.StartInternal() on Unix with plenty of mem free #13062

Closed
ThomasHjorslevFcn opened this issue Jul 11, 2019 · 22 comments
Labels
area-System.Threading question Answer questions and provide assistance, not an issue with source code or documentation.
Milestone

Comments

@ThomasHjorslevFcn
Copy link

I'm experiencing an asp.net core application crashing regularly (every 30-60 minutes) with System.OutOfMemoryException. It seems to only happen on the Unix platform.

The stack trace is always:

Unhandled Exception: System.Threading.Tasks.TaskSchedulerException: An exception was thrown by a TaskScheduler. ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Threading.Thread.StartInternal()
   at System.Threading.Tasks.Task.ScheduleAndStart(Boolean needsProtection)
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ScheduleAndStart(Boolean needsProtection)
   at System.Threading.Tasks.Task.InternalStartNew(Task creatingTask, Delegate action, Object state, CancellationToken cancellationToken, TaskScheduler scheduler, TaskCreationOptions options, InternalTaskOptions internalOptions)
   at System.Runtime.IOThreadScheduler.ScheduleCallbackHelper(SendOrPostCallback callback, Object state)
   at System.Runtime.IOThreadScheduler.ScheduleCallbackNoFlow(SendOrPostCallback callback, Object state)
   at System.Runtime.CompilerServices.YieldAwaitable.YieldAwaiter.System.Runtime.CompilerServices.IStateMachineBoxAwareAwaiter.AwaitUnsafeOnCompleted(IAsyncStateMachineBox box)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AwaitUnsafeOnCompleted[TAwaiter,TStateMachine](TAwaiter& awaiter, TStateMachine& stateMachine)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
  • In Thread.CoreRT.cs:370 we can see that StartInternal() can throw OOM when CreateThread() returns false.
  • In Thread.CoreRT.Unix.cs:97 we can see that CreateThread() can only return false if RuntimeThread_CreateThread() does.
  • As far as I can see, that refers to pal_threading.cpp:170 which can return false if either of pthread_attr_init(&attrs), pthread_attr_setstacksize(&attrs, stackSize) or pthread_create(&threadId, &attrs, startAddress, parameter) does not return 0 (SUCCESS).

Since the server is far from out of memory, I suspect some other limit has been reached, but since StartInternal() throws OOM no matter what non-success code is returned, it's hard to know what and why.

The application in question is running in an Docker environment with 16GB memory. Because of the crashes, I have added a per-5-seconds logging of some key process figures. I never see Process.WorkingSet64 go over 2-3GB.

Here is an example logging a few seconds before a crash:

2019-07-10T18:30:16.741Z concurrency:   4 proc thread cnt:   29 avail worker threads: 32,766 avail compl port threads:  1,000 ws: 1,733,996,544
  • concurrency is a counter I increment/decrement on web service call start/end (it varies between a few and around 100)
  • proc thread count is the length of Process.GetCurrentProcess().Length
  • avail worker threads and avail compl port threads is from ThreadPool.GetAvailableThreads()
  • ws is Process.GetCurrentProcess().WorkingSet64

I have been able to reproduce in our test env a few times, under synthetic load, but only after about 1 hour of heavy concurrent load.

Some facts that may be relevant:

  • The base docker image is microsoft/dotnet:2.2-aspnetcore-runtime
  • The app is running on AWS ECS using FARGATE (managed Docker host)
  • Environment.Is64BitOperatingSystem is true
  • Environment.Is64BitProcess is true
  • IntPtrSize is 8
  • uname -a returns Linux ip-172-xxx-xxx-xxx.eu-central-1.compute.internal 4.14.123-111.109.amzn2.x86_64 dotnet/coreclr#1 SMP Mon Jun 10 19:37:57 UTC 2019 x86_64 GNU/Linux
  • The primary operation of the service is to serve binary (PDF) content, which comes from either a MongoDB database and/or a WebRequest HTTP or FTP fetch.
  • All api actions are marked async and all MongoDB access and HTTP/FTP requests make full use of async/await.
@kevingosse
Copy link
Contributor

kevingosse commented Jul 11, 2019

System.Threading.Thread.StartInternal is called from System.Threading.Tasks.Task.ScheduleAndStart. It makes me think that the IOThreadScheduler (where is that coming from?) is using the LongRunning flag when scheduling the task. In that case, it will start a new dedicated thread and you won't see it when querying ThreadPool.GetAvailableThreads. Instead you should try logging the value of Process.GetCurrentProcess().Threads.

@kevingosse
Copy link
Contributor

kevingosse commented Jul 11, 2019

Assuming you're using https://github.com/dotnet/wcf, it confirms that the IOThreadScheduler uses the LongRunning flag:

https://github.com/dotnet/wcf/blob/master/src/System.Private.ServiceModel/src/Internals/System/Runtime/IOThreadScheduler.cs#L167

It's very likely that those threads pile-up for some reason until you reach the system limit.

@mattwarren
Copy link
Contributor

  • In Thread.CoreRT.cs:370 we can see that StartInternal() can throw OOM when CreateThread() returns false.
  • In Thread.CoreRT.Unix.cs:97 we can see that CreateThread() can only return false if RuntimeThread_CreateThread() does.
    ...

I don't know if it makes a difference to this issue or not, but you linked/refered to the CoreRT source code, did you mean to refer to the CoreCLR source instead, e.g. Thread.CoreCLR.cs?

Or did you mean to open this issue on the CoreRT repo?

@ThomasHjorslevFcn
Copy link
Author

ThomasHjorslevFcn commented Jul 11, 2019

Instead you should try logging the value of Process.GetCurrentProcess().Threads.

I am, in the case above, the length of Process.GetCurrentProcess().Threads was 29, seconds before the crash.

@ThomasHjorslevFcn
Copy link
Author

I don't know if it makes a difference to this issue or not, but you linked/refered to the CoreRT source code, did you mean to refer to the CoreCLR source instead, e.g. Thread.CoreCLR.cs?

Or did you mean to open this issue on the CoreRT repo?

I'm embarrassed to say that I didn't notice, sorry about that. I'm not using CoreRT (or atleast I don't think so? - the application is started using dotnet AppName.dll, that means I'm using CoreCLR right?).

The CoreCLR repos does not seem to contain the source code of Thread.StartInternal(), so I'm not really sure how to dig deeper?

However, I still think, like kevingosse says, that I'm likely hitting some other limit than free system memory.

My reason for posting here, was because I think it could be a bug, or at least misleading, that an OutOfMemoryException is thrown.

@ThomasHjorslevFcn
Copy link
Author

Assuming you're using https://github.com/dotnet/wcf, it confirms that the IOThreadScheduler uses the LongRunning flag:

I am actually calling a WCF service once every minute, to maintain an internal lookup table of user id=>name mapping.
Because of some legacy wrapping code, I'm calling it like this:

var users = await Task.Run( ( ) => client.Proxy.ListUsers( new ListUsersInput { } ) );

Where client is wrapping a ClientBase<T> instance.

I may try rewriting this to a REST service, to fully remove WCF from the project, or remove the wrapping code so I can call ListUsersAsync(...) instead.

https://github.com/dotnet/wcf/blob/master/src/System.Private.ServiceModel/src/Internals/System/Runtime/IOThreadScheduler.cs#L167

It's very likely that those threads pile-up for some reason until you reach the system limit.

The length of Process.GetCurrentProcess().Threads is very stable between 27-30 throughout the life of the process. I'm not sure what else I can monitor.

I'm considering trying to regularly log the contents of some of the "files" under the /proc/ filesystem, to see if some values change over time.

@ThomasHjorslevFcn
Copy link
Author

Which, if I've got it right, means the OutOfMemoryException is then thrown from here:

https://github.com/dotnet/coreclr/blob/9bd2787a9dd2aa4d2b7d4f72afebc3dbe896e896/src/vm/comsynchronizable.cpp#L465-L473

Digging even deeper, we end up in Thread::CreateNewThread():

https://github.com/dotnet/coreclr/blob/9bd2787a9dd2aa4d2b7d4f72afebc3dbe896e896/src/vm/threads.cpp#L2007-L2039

And then into Thread::CreateNewOSThread() :

https://github.com/dotnet/coreclr/blob/9bd2787a9dd2aa4d2b7d4f72afebc3dbe896e896/src/vm/threads.cpp#L2207-L2296

Here I'm starting to loose the thread (pun intended), either AllocHandles() or PAL_CreateThread64() could fail here.

At some point, we end up in CorUnix::InternalCreateThread I think:

https://github.com/dotnet/coreclr/blob/835836c9d34409af0f31529201dfd57cb2bd053c/src/pal/src/thread/thread.cpp#L596
It's late here, I'll have to dig into that tomorrow :-)

In any case, I am beginning to see why it's hard to report all the way to the application code, exactly which resource is missing :-)

@janvorli
Copy link
Member

Is it possible that your system has a limit imposed on the amount of virtual memory space it is allowed to use? You can use ulimit -v to check. That could explain OOMs when the working set is much smaller than the available memory.

@ThomasHjorslevFcn
Copy link
Author

Is it possible that your system has a limit imposed on the amount of virtual memory space it is allowed to use? You can use ulimit -v to check. That could explain OOMs when the working set is much smaller than the available memory.

It doesn't seem so, this is the output of ulimit -a:

> /bin/bash -c "ulimit -a"
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 29781
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

@janvorli
Copy link
Member

@ThomasHjorslevFcn strace output from the run could show us where the issue stems from. Would you be able to capture and share it? Please note that it will be really huge after 30..60 minutes of running.
You can get it using strace -f -o logfile.txt your_command_line where your_command_line is the command line you use to launch the app.
I would like to see last say 1000 lines from the log, you could share them using gist.

@ThomasHjorslevFcn
Copy link
Author

@ThomasHjorslevFcn strace output from the run could show us where the issue stems from. Would you be able to capture and share it? Please note that it will be really huge after 30..60 minutes of running.
You can get it using strace -f -o logfile.txt your_command_line where your_command_line is the command line you use to launch the app.

The ECS Docker instances and attached storage are de-allocated as soon as the instance crashes.
But maybe I could try booting up a regular Linux instance with Docker and running the app there and see if I can reproduce.

@ThomasHjorslevFcn
Copy link
Author

ThomasHjorslevFcn commented Jul 15, 2019

We don't have a Linux serve with the 30-40GB free storage required so I'm currently running it like this on our Linux build server, hoping to see it crash:

>  (strace -f  dotnet MyApp.dll  --urls=http://172.31.43.121:5000 > output2.log ) 2>&1 | tail -n 5000 > output.log

In other news, I noticed that our logging system often log this error shortly before a crash:

System.AggregateException: One or more errors occurred. (Resource temporarily unavailable) ---> System.Net.Http.HttpRequestException: Resource temporarily unavailable ---> System.Net.Sockets.SocketException: Resource temporarily unavailable
   at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
   at System.Threading.Tasks.ValueTask`1.get_Result()
   at System.Net.Http.HttpConnectionPool.CreateConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Threading.Tasks.ValueTask`1.get_Result()
   at System.Net.Http.HttpConnectionPool.WaitForCreatedConnectionAsync(ValueTask`1 creationTask)
   at System.Threading.Tasks.ValueTask`1.get_Result()
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
   at FundConnect.Logging.RestLogListener.PostAsJson[T](HttpClient httpClient, String url, T data)
   at FundConnect.Logging.RestLogListener.WriteEntries(List`1 entries)
   at FundConnect.Logging.BufferedLogListener.WriteEntries()

BufferedLogListener.WriteEntries() is called every second from a System.Threading.Timer and protected by if (!Monitor.TryEnter(_writeLock, 0)) return;

Sometimes the exception message is this instead:

System.AggregateException: One or more errors occurred. (Unknown error -1) ---> System.Net.Http.HttpRequestException: Unknown error -1 ---> System.Net.Sockets.SocketException: Unknown error -1

@ThomasHjorslevFcn
Copy link
Author

ThomasHjorslevFcn commented Jul 15, 2019

It's still running, but now that it's running on a Linux server, not Docker, I have a bit more opportunity to examine it.
It seems that the application is leaking "sockets".

$ lsof -p 15613 | grep " sock " | wc -l
3040

The number 3040 is rising steadily and roughly corresponds to the number of requests currently received by the application (3288). I'm not sure if the last part is a coincidence.

$ lsof -p 15613 | grep " sock " | more
dotnet  15613 ec2-user  215u     sock                0,8      0t0  4968805 protocol: TCP
dotnet  15613 ec2-user  219u     sock                0,8      0t0  4968844 protocol: TCP
dotnet  15613 ec2-user  220u     sock                0,8      0t0  4968236 protocol: TCP
dotnet  15613 ec2-user  221u     sock                0,8      0t0  4968247 protocol: TCP
dotnet  15613 ec2-user  222u     sock                0,8      0t0  4969038 protocol: TCP
dotnet  15613 ec2-user  225u     sock                0,8      0t0  4969044 protocol: TCP
dotnet  15613 ec2-user  226u     sock                0,8      0t0  4968420 protocol: TCP
dotnet  15613 ec2-user  286u     sock                0,8      0t0  5002368 protocol: TCP
dotnet  15613 ec2-user  287u     sock                0,8      0t0  4985546 protocol: TCP
dotnet  15613 ec2-user  288u     sock                0,8      0t0  4968291 protocol: TCP
dotnet  15613 ec2-user  289u     sock                0,8      0t0  4987177 protocol: TCP
dotnet  15613 ec2-user  290u     sock                0,8      0t0  4968321 protocol: TCP
dotnet  15613 ec2-user  293u     sock                0,8      0t0  4970561 protocol: TCP
... 3040 lines and counting ...

I'm guessing that at some point it runs against some limit, like max_map_count.
Now I just gotta figure how why it's leaking sockets...

@ThomasHjorslevFcn
Copy link
Author

I was able to provoke the OOM by executing

$ sudo sysctl -w vm.max_map_count=3700

Confirming that max_map_count was the limit being hit.
Here's the last 5000 lines from strace

@danmoseley
Copy link
Member

The sockets leak part of this issue should probably move to corefx. @davidsh @stephentoub

@janvorli
Copy link
Member

@ThomasHjorslevFcn the trace log confirms that the OOM exception stems from that:

[pid 29554] fcntl(4094, F_DUPFD_CLOEXEC, 0 <unfinished ...>
[pid 29605] <... sched_yield resumed> ) = 0
[pid 29554] <... fcntl resumed> )       = -1 EMFILE (Too many open files)

@ThomasHjorslevFcn
Copy link
Author

After finding a way to monitor and reproduce, I think I have isolated the problem to our home grown logging system, specifically one which ships logs every second to a HTTPS endpoint. After disabling it, lsof -p 30737 | grep " sock " | wc -l return 0 and lsof -p 30737 | wc -l remains stable under load.
The IO part of the code is this:

public static HttpResponseMessage PostAsJson<T>( HttpClient httpClient, string url, T data )
{
    var dataAsString = JsonConvert.SerializeObject( data );
    var content = new StringContent( dataAsString );
    content.Headers.ContentType = new MediaTypeHeaderValue( "application/json" );
    return httpClient.PostAsync( url, content ).Result;
}

I'm guessing the problem is using .Result here?
Or maybe the issue is that content is not being Disposed?
I dismissed the logging system earlier, because we are already using it in other production systems for a while, but I'm guessing perhaps they are logging way less, therefore it takes much longer to exhibit the problem.

@stephentoub
Copy link
Member

@ThomasHjorslevFcn, what does the caller of PostAsJson look like? Is it creating a new HttpClient for every request and then not disposing of it? What is it doing with the HttpResponseMessage it gets back?

@ThomasHjorslevFcn
Copy link
Author

Ok, this is embarrassing.

protected override void WriteEntries( List<LogEntry> entries )
{
    var client = new HttpClient( );
    client.DefaultRequestHeaders.Accept.Clear( );
    client.DefaultRequestHeaders.Accept.Add(
        new MediaTypeWithQualityHeaderValue( "application/json" ) );

    foreach( var entries2 in Batch( entries, 100 ) )
    {
        PostAsJson( client, EndpointUrl, entries2 );
    }
}

This library was recently ported to .net standard and I think I even code reviewed this part :-(

So it seems it all boils down to a missing using/Dispose(). I'm sorry for wasting time your here!
I got thrown off by the OOM and where it happened. It would be nice if the OOM could be more specific, but I don't know if the OS provides sufficient info for that?

@stephentoub
Copy link
Member

stephentoub commented Jul 15, 2019

So it seems it all boils down to a missing using/Dispose()

It should definitely be disposed... or better yet, rather than disposing it, storing it as a singleton that all requests use (and then definitely not disposing it :). As currently written, each WriteEntries call is going to need to establish a new connection to the server, since connection pooling is done at the level of the HttpClient (or, more specifically, at the level of the handler it wraps and instantiates in its default ctor if one isn't explicitly provided). If the client/handler were shared across WriteEntries calls, then the connection pooling would then enable connections to be reused as well. By not disposing of the handler, each WriteEntries call is creating a new connection and then stranding it in the HttpClient's connection pool. It'll eventually be cleaned up, but it could take some non-trivial period of time to happen.

Note that there was a related bug here, which could be impacting you further: https://github.com/dotnet/corefx/issues/37044. We rely on Socket's finalizer to close the connection in a case like this, but for various reasons that wasn't working. It was fixed for 3.0 in dotnet/corefx#38499.

@ThomasHjorslevFcn
Copy link
Author

Now converted to a singleton, the application has been running for 24 hours without crashing, thanks everyone!
I'm not sure if I should close this issue? While the problem was entirely our own fault, improving the information in the OutOfMemoryException might still be relevant?

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Threading question Answer questions and provide assistance, not an issue with source code or documentation.
Projects
None yet
Development

No branches or pull requests

7 participants