Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dns.GetHostAddressesAsync is blocking a ThreadPool thread #17224

Closed
JeffCyr opened this issue May 7, 2016 · 25 comments
Closed

Dns.GetHostAddressesAsync is blocking a ThreadPool thread #17224

JeffCyr opened this issue May 7, 2016 · 25 comments
Assignees
Labels
area-System.Net.Sockets bug help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@JeffCyr
Copy link
Contributor

JeffCyr commented May 7, 2016

The implementation of Dns.GetHostAddressesAsync is just a wrapper around the blocking implementation and is queued on the ThreadPool.

https://github.com/dotnet/corefx/blob/1baa9ad06a466a5d6399e1e0a7a7a02564ab51b0/src/System.Net.NameResolution/src/System/Net/DNS.cs#L274

This can lead to a ThreadPool starvation when there are many concurrent calls of GetHostAddressAsync in an environment where the DNS is slow.

The code in .Net 4.6 is slightly different but suffers from the same issue:
http://referencesource.microsoft.com/#System/net/System/Net/DNS.cs,744

Windows and Linux both provides real asynchronous calls to resolve a DNS. The asynchronous implementation could use these.

@davidsh
Copy link
Contributor

davidsh commented May 7, 2016

cc: @CIPop

@roji
Copy link
Member

roji commented May 8, 2016

@JeffCyr, you're right that if Windows and Linux provide real async DNS options, these should be used to implement Dns.GetHostAddressesAsync.

However, how would that fix thread pool starvation? A real async API would also execute its callback in a threadpool thread, leading to starvation in cases where there are many concurrent calls, slow DNS and a pool without many threads (e.g. ramp-up time). In other words, the threadpool starvation issue seems like an inherent issue with any async call...

@GSPP
Copy link

GSPP commented May 8, 2016

The callbacks have almost no work to do. They basically need to execute the user-provided callback and update some state (maybe complete a task). Whereas the IO might run for 100ms the callbacks run for 0.001ms and are purely CPU work.

@roji
Copy link
Member

roji commented May 8, 2016

@GSPP how do you know that the callbacks have almost no work to do? The user-provided callback my be anything at all, including some lengthy operation, computational or otherwise?

@GSPP
Copy link

GSPP commented May 8, 2016

That's not the frameworks problem. If users want non-blocking IO they should not execute Sleep(1000) in their callbacks. The framework enables non-blocking IO for callers where that was not possible before.

Even if users block they will at least save the lengthy DNS IO (here assumed to be 100ms). Alleviating thread pool starvation is not about solving all blocking. You can make do with solving the IO that has the highest (frequency times latency) product. That's the code that actually uses the pool just waiting there.

@svick
Copy link
Contributor

svick commented May 8, 2016

@roji I don't understand why you think proper async calls make thread pool starvation a problem, where it's not a problem for sync calls.

If you make those calls from thread pool threads, then using async clearly decreases thread pool pressure.

If you make those calls from many non-thread pool threads, then you can (and probably should) use TaskScheduler or SynchronizationContext to execute your long callbacks back on those threads and not on the thread pool. This still leaves library callbacks on the thread pool, but those are going to be very fast. So this does increase thread pool pressure, but only by a small amount.

@roji
Copy link
Member

roji commented May 8, 2016

I never anything here was the framework's problem... Rereading the original post I agree that the current framework implementation of shoving a blocking I/O operation to the pool is highly problematic.

@GSPP and @svick I admit I'm continuing the discussion in #15963 - this isn't really relevant to this issue. @svick, the point is a sync blocking operation has the (small!) advantage of not relying on the thread pool in any way - the resource needed (the thread) is "reserved" during the I/O operation and will be there in a 100% reliable way one the operation completes to continue processing. With async I/O, on the other hand, we're not sure what the TP state will be when the I/O operation completes, and cases of load coupled with a currently small TP could lead to starvation etc.

@svick I agree there are ways to deal with this, i.e. schedule long operations in your own threads etc. but this is more complicated and requires a good understanding of what's running where. All I'm saying is that async programming requires a lot more attention from the programmer (compared to sync I/O) and is not always justified by the scalability it brings to the table. My problem is that the current approach seems to be to eliminate sync I/O operations as if to say "always use async no matter what".

But this is me hijacking this thread, better to continue this in #15963.

@benaadams
Copy link
Member

benaadams commented May 8, 2016

Assuming you are not blocking in the threadpool threads (e.g. paused thread not executing anything) and only calling other async methods for pauses, then it should be fairly hard to starve it without maxing CPU before hand.

However if you are going to block in the continuation then you can fire it off to a thread:

await Task.Factory.StartNew( ... , TaskCreationOptions.LongRunning);

From https://github.com/dotnet/corefx/issues/5044#issuecomment-212883399

I'm the maintainer of Npgsql, the PostgreSQL driver for .NET. We've frequently received complaints that under load, and frequently at program startup, DNS queries performed by Npgsql time out. Now, the current stable Npgsql does DNS resolution with Dns.{Begin/End}GetHostAddresses. What seems to be happening, is that when the program starts the thread pool has relatively few threads. If a large amount of requests come in and a lot of async operations are in progress, the thread pool starts allocating new threads - but this process takes time; and during these time it would seem that the operation times out.

@roji the dns issue you have have encountered is as a result of you doing the right thing and the framework doing the wrong thing, which is unfortunate and will undoubtedly give a poor impression of using async methods 😞

@roji
Copy link
Member

roji commented May 8, 2016

@benaadams you're probably right about this specific case. My more general point (in #15963) is that I don't necessarily want to possible TP starvation state to determine when my callback runs (regardless of what it's going to do).

@benaadams
Copy link
Member

benaadams commented May 8, 2016

@roji this is a slightly trolly response, so apologies but you could add to Main

#if net451
    int workerThreads, ioThreads;
    ThreadPool.GetMaxThreads(out workerThreads, out ioThreads);
    ThreadPool.SetMinThreads(workerThreads, ioThreads);
#endif 

for dnx based coreclr set the environment variable

ComPlus_ThreadPool_ForceMaxWorkerThreads=10000

and for dotnet cli add to [yourappname].runtimeconfig.json

{
    "runtimeOptions": {
        "configProperties": {
            "System.GC.Server": true,
            "System.GC.Concurrent": true,
            "System.Threading.ThreadPool.MinThreads": 10000,
            "System.Threading.ThreadPool.MaxThreads": 10000
        },
    }
}

It will ramp up the thread count really really fast when any are blocked

@roji
Copy link
Member

roji commented May 8, 2016

@benaadams that's not trolly at all :) I've recommended this in some cases. But having to do this kind of thing does demonstrate this one problematic aspect of async.

Again, I'm not saying async is bad or problematic in itself, just that there are situations where sync might make more sense. That's all. In other words, don't remove sync APIs.

@CIPop
Copy link
Member

CIPop commented May 9, 2016

@JeffCyr Thanks for reporting this. I agree that we should provide true async wrappers where the OS provides the support for such implementations.

Unfortunately there are probably other APIs in the same situation so we should probably revisit the implementations and search for other cases.

@roji @benaadams While I like the brainstorming regarding sync vs async, I think it would be more useful to have it all in the original thread. Otherwise some of the good ideas and concerns will be lost in multiple places. I'll try to answer on the original thread.

@roji
Copy link
Member

roji commented May 9, 2016

@CIPop you're absolutely right, sorry for hijacking this. For the record I totally agree that the framework should provide truly async DNS where that's possible, regardless of the discussion in #15963. Will be looking forward to see your answer there.

@cnblogs-dudu
Copy link
Contributor

Suffer by this issue. See dotnet/coreclr#8383

@JeffCyr
Copy link
Contributor Author

JeffCyr commented Nov 27, 2017

I investigated the fix for this issue.

Windows

Windows 7

There is no windows API to make async dns resolving.

Windows 8 & up

There is GetAddrInfoEx and DnsQueryEx. These methods can be used for async dns resolving, but the API is a bit complex, an implementation in System.Native might be more robust than straight PInvoke.

Linux

There is getaddrinfo_a, but it's implemented by calling the synchronous getaddrinfo on another thread. c-ares seems to be a popular alternative.

Proposal

The main issue with the current implementation is that blocking calls are queued on the default ThreadPool and this cause starvation. A simple fix could be to implement a custom thread pool specialized for blocking operations.

This specialized ThreadPool could have these properties:

  • Only one permanent thread in the pool
  • New threads are added to the pool immediately if no threads are available when a workitem is enqueued
  • Except the permanent thread, other threads will close if they are unused for 1 second
  • The pool is bounded (e.g. max 64 thread)
  • Threads are started with a small stack (e.g. 128KB)

A custom TaskScheduler could wrap this ThreadPool and minimal code changes would be required to fix this issue. We would just need to replace the TaskScheduler.Default usage in DNS.cs.

This solution is less efficient than true async operations, but it might be good enough. A specialized external dns library could be implemented outside of the framework for the small percentage of applications that needs massive dns query concurrency.

Thoughts?

@benaadams
Copy link
Member

benaadams commented Nov 27, 2017

Could formalize it?

// Queues to secondary thread-pool so as not to cause starvation in main pool by holding threads
Task Task.RunBlockingAsAsync(Action action);
Task Task.RunBlockingAsAsync<TState>(Action<TState> action, TState state);
Task<T> Task.RunBlockingAsAsync<T>(Func<T> func)
Task<T> Task.RunBlockingAsAsync<TState, T>(Func<TState, T> func, TState state)

@stephentoub
Copy link
Member

The main issue with the current implementation is that blocking calls are queued on the default ThreadPool and this cause starvation

The right answer to address that is to a) queue less things that do I/O blocking and/or have them do less I/O blocking, and b) for what remains, help the ThreadPool better deal with blocking workloads, e.g. similar to how pools based on I/O completion ports are able to better adjust to blocking when blocking actually occurs.

A simple fix could be to implement a custom thread pool

This is a very complicated road. If the pool has a fixed upper bound, you have an even greater risk of starvation than you do today. If it's unbounded, such policies could result in potentially huge thread growth that would cause significant problems for the process and potentially the machine.

@stephentoub
Copy link
Member

ps There's already support for using the ThreadPool to ammortize the cost of blocking on something: ThreadPool.RegisterWaitForSingleObject. It uses as few threads as possible to wait for as many objects as are registered, and queues a work item to the pool when the object is signaled.

@JeffCyr
Copy link
Contributor Author

JeffCyr commented Nov 27, 2017

@stephentoub
I agree this isn't a general solution for background blocking I/O, like you said it's a complicated road.

But the current implementation of Dns.GetHostEntryAsync is quite disastrous for a high-performance app that rely on the ThreadPool. Any usage of Dns.GetHostEntryAsync or even Socket.BeginConnect(string host, int port, AsyncCallback requestCallback, object state) will queue a blocking I/O workload on the ThreadPool. At best this will add random [0-500]ms latency to the thread pool, at worst it will completely starve it.

The ThreadPool max thread is bounded at 32767, but thread spawning is throttled at 1 per second, so in practice the bound will be much lower than 32767.

There is only a short/mid-term fix for Windows 8+ because other platform don't have proper async dns resolving built-in. Even with Windows 8+, GetAddrInfoEx and DnsQueryEx requires complex native interop with managed callback.

Having a separate ThreadPool exlusively used by DNS.cs (or potentially other internal stuff) seems to be a quick win.

We can set lower stack size to reduce resources, because we know they threads will only use a small stack. We can set a sensible upper bound to not be worst than the current implementation. With current impl, if there were 100 concurrent async dns resolve, it would have taken more than 1 minute to create all the threads, so it's already impracticable.

@stephentoub
Copy link
Member

With current impl, if there were 100 concurrent async dns resolve, it would have taken more than 1 minute to create all the threads, so it's already impracticable.

You're talking about at the very beginning of a process, before the thread pool has ramped up. Let's say we set the limit on this special pool you're talking about. Pick a number, say 100. Now a few minutes in to a fully loaded server, the thread pool is likely already at more than that, say 200 threads. At that point using this custom pool is now actually slowing things down.

I agree there's an issue here to be addressed, but I don't believe creating a dedicated thread pool is as quick a win as you suggest. We would need real data from real workloads to demonstrate that it actually is a net win.

but thread spawning is throttled at 1 per second

That's the starvation mechanism. There's also the unrelated hill climbing mechanism.
cc: @kouvel

There is only a short/mid-term fix for Windows 8+ because other platform don't have proper async dns resolving built-in. Even with Windows 8+, GetAddrInfoEx and DnsQueryEx requires complex native interop with managed callback.

Complex interop isn't a problem. We do it all the time, especially in these networking libraries. I'm also not clear on why you say it's a short/mid-term fix for Win8+... it seems like it's the fix for Win8+.

@JeffCyr
Copy link
Contributor Author

JeffCyr commented Nov 28, 2017

I'm also not clear on why you say it's a short/mid-term fix for Win8+... it seems like it's the fix for Win8+.

I mispoke there, I meant the fix can be done in the short/mid term, this is indeed the long term fix.

Now a few minutes in to a fully loaded server, the thread pool is likely already at more than that, say 200 threads. At that point using this custom pool is now actually slowing things down.

This special thread pool would be implemented to shrink back quickly to one thread if there is no request. The .Net ThreadPool shrinks too but it takes minutes to readjusts.

Even with the optimal Win8+ fix, there still need a fallback solution for unsupported platforms.

This issue has been opened for a while and it's still flagged as "future", I hoped to make some progress on it. What is the next step to make this happen? Are you more comfortable with doing only the Win8+ fix now and leave the current implementation as fallback?

@stephentoub
Copy link
Member

What is the next step to make this happen? Are you more comfortable with doing only the Win8+ fix now and leave the current implementation as fallback?

Yes, using the better platform-provided API on platforms where it's available is a good thing to do.

@stephentoub
Copy link
Member

@JeffCyr, would you want to submit a PR for using the async APIs on Windows when they're available?

@JeffCyr
Copy link
Contributor Author

JeffCyr commented Feb 5, 2018

@stephentoub I created the PR, waiting for review

@JeffCyr
Copy link
Contributor Author

JeffCyr commented Feb 12, 2018

Fixed on Windows by PR dotnet/corefx#26850

@JeffCyr JeffCyr closed this as completed Feb 12, 2018
@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 2.1.0 milestone Jan 31, 2020
@dotnet dotnet locked as resolved and limited conversation to collaborators Dec 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Sockets bug help wanted [up-for-grabs] Good issue for external contributors
Projects
None yet
Development

No branches or pull requests