Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParallelOptions.MaxDegreeOfParallelism documentation, is the given advice correct? #72981

Closed
theodorzoulias opened this issue Jul 28, 2022 · 15 comments

Comments

@theodorzoulias
Copy link
Contributor

Hi! The documentation of the property ParallelOptions.MaxDegreeOfParallelism gives this advice in the Remarks section:

Generally, you do not need to modify this setting.

Then it goes on describing three specific scenarios where this general advice should not be followed:

  • When you know that a particular algorithm you're using won't scale beyond a certain number of cores.
  • When you're running multiple algorithms concurrently and want to manually define how much of the system each algorithm can utilize.
  • When the thread pool's heuristics is unable to determine the right number of threads to use and could end up injecting too many threads.

My question is: Does this general advice still stands in the modern era of .NET applications, that are increasingly async-enabled and concurrent? Or it's a relic from the early days of the Task-based parallelism (around 2010 if I am not mistaken), that might need to be revised?

I am asking because personally, for quite a long time, I am advising people against relying on the default, and instead to specify explicitly the MaxDegreeOfParallelism in all cases (recent example). Here is my reasoning for giving this advice:

  1. Nowadays the exceptions are more common than the rule.
  2. Parallel APIs that were introduced after the initial release of the TPL library, namely the PLINQ and the Parallel.ForEachAsync, use the Environment.ProcessorCount as the default, not unlimited (-1).
  3. It's hard to go wrong with MaxDegreeOfParallelism = Environment.ProcessorCount as the general advice.

Nevertheless not all people agree with me. There are experienced .NET professionals (including MVPs) who think that it's wiser to let the Parallel.For/Parallel.ForEach/Parallel.Invoke methods adjust the number of worker tasks based on load, by using their internal heuristics, and that meddling with the MaxDegreeOfParallelism option is more likely to cause harm than good.

So I thought that asking Microsoft's opinion on this subject can't be a bad idea. Which is closer to the truth?

  1. The general advice offered in the documentation is as valid as ever.
  2. The general advice offered in the documentation is not as valid as it was 12 years ago, but still holds.
  3. The general advice offered in the documentation is outdated.

Thanks!

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 28, 2022
@ghost
Copy link

ghost commented Jul 28, 2022

Tagging subscribers to this area: @dotnet/area-system-linq-parallel
See info in area-owners.md if you want to be subscribed.

Issue Details

Hi! The documentation of the property ParallelOptions.MaxDegreeOfParallelism gives this advice in the Remarks section:

Generally, you do not need to modify this setting.

Then it goes on describing three specific scenarios where this general advice should not be followed:

  • When you know that a particular algorithm you're using won't scale beyond a certain number of cores.
  • When you're running multiple algorithms concurrently and want to manually define how much of the system each algorithm can utilize.
  • When the thread pool's heuristics is unable to determine the right number of threads to use and could end up injecting too many threads.

My question is: Does this general advice still stands in the modern era of .NET applications, that are increasingly async-enabled and concurrent? Or it's a relic from the early days of the Task-based parallelism (around 2010 if I am not mistaken), that might need to be revised?

I am asking because personally, for quite a long time, I am advising people against relying on the default, and instead to specify explicitly the MaxDegreeOfParallelism in all cases (recent example). Here is my reasoning for giving this advice:

  1. Nowadays the exceptions are more common than the rule.
  2. Parallel APIs that were introduced after the initial release of the TPL library, namely the PLINQ and the Parallel.ForEachAsync, use the Environment.ProcessorCount as the default, not unlimited (-1).
  3. It's hard to go wrong with MaxDegreeOfParallelism = Environment.ProcessorCount as the general advice.

Nevertheless not all people agree with me. There are experienced .NET professionals (including MVPs) who think that it's wiser to let the Parallel.For/Parallel.ForEach/Parallel.Invoke methods adjust the number of worker tasks based on load, by using their internal heuristics, and that meddling with the MaxDegreeOfParallelism option is more likely to cause harm than good.

So I thought that asking Microsoft's opinion on this subject can't be a bad idea. Which is closer to the truth?

  1. The general advice offered in the documentation is as valid as ever.
  2. The general advice offered in the documentation is not as valid as it was 12 years ago, but still holds.
  3. The general advice offered in the documentation is outdated.

Thanks!

Author: theodorzoulias
Assignees: -
Labels:

area-System.Linq.Parallel

Milestone: -

@danmoseley
Copy link
Member

@stephentoub

@stephentoub
Copy link
Member

Nowadays the exceptions are more common than the rule

Even if that's true that doesn't make the guidance incorrect.

Parallel APIs that were introduced after the initial release of the TPL library, namely the PLINQ and the Parallel.ForEachAsync

PLINQ was introduced at the same time, not later, and in general it requires a fixed number of workers because of operations that need to synchronize and exchange with each other. It's not overriding the degree because doing so is better for scale but rather because doing so is necessary for its algorithms.

Similarly ForEachAsync has to say an explicit value as by its nature its generally not throttled based on CPU; if it had an unlimited degree by default, it would very frequently end up processing every input element concurrently by default. We could have kept the unlimited default but then would have had to come up with another heuristic and mechanism for automatic throttling.

MaxDegreeOfParallelism = Environment.ProcessorCount as the general advice

That'd be like saying it'd be hard to go wrong by default setting the thread pool's min and max to ProcessorCount. And that's not the case.

The general advice offered in the documentation is as valid as ever.

I've not seen evidence to say otherwise. As always, if setting it works well for your needs, go for it. If not, don't.

@theodorzoulias
Copy link
Contributor Author

theodorzoulias commented Jul 28, 2022

MaxDegreeOfParallelism = Environment.ProcessorCount as the general advice

That'd be like saying it'd be hard to go wrong by default setting the thread pool's min and max to ProcessorCount. And that's not the case.

The general advice offered in the documentation is as valid as ever.

I've not seen evidence to say otherwise. As always, if setting it works well for your needs, go for it. If not, don't.

@stephentoub thanks for the quick feedback as usual! The reason that the MaxDegreeOfParallelism = Environment.ProcessorCount looks fine to me as the general advice is because:

  • For CPU-bound workloads it utilizes all the available CPUs, without inducing the ThreadPool to create more threads. Adding more threads than CPUs won't help at finishing the calculation faster anyway.
  • For I/O-bound workloads, in case it's beneficial to parallelize with more threads than Environment.ProcessorCount, it discourages the developer from being lazy and depending on the slow injection of new threads for the gradual increase of the degree of parallelism. The code is ready to be configured with a better value for the MaxDegreeOfParallelism. Furthermore the developer will likely notice soon that the higher value is not going to be respected immediately, and after some digging they'll realize that they have to meddle with the ThreadPool.SetMinThreads as well. Or even better to switch to the asynchronous Parallel.ForEachAsync, in case this option is open to them.

Following the current general advice in the documentation, and not configuring the MaxDegreeOfParallelism, has basically two problems:

  1. It saturates the ThreadPool. While the unconfigured Parallel.ForEach loop is running, a concurrent timer callback or asynchronous continuation might not be served immediately, because the ThreadPool will be depleted from available threads. The Parallel.ForEach takes all the threads that the ThreadPool has to offer, and continuously asks for more. Essentially it owns the ThreadPool. That's not a good use of a shared resource IMHO.
  2. The degree of parallelism of the parallel operation becomes dependent on the current state of the ThreadPool. At one point the ThreadPool might have 4 threads, and the degree of parallelism will be 4. At another point the ThreadPool might have 100 threads, and the degree of parallelism will be 100. This makes for a program with inconsistent and not predictable behavior.

Here is a minimal demonstration that an unconfigured Parallel.ForEach loop consumes indeed all the available threads of the ThreadPool. The program below completes in 4 seconds, and reports a maximum concurrency of 103:

Console.WriteLine(RuntimeInformation.FrameworkDescription);
ThreadPool.SetMinThreads(100, 100);
object locker = new();
int concurrencyCounter = 0;
int maxConcurrency = 0;
Parallel.ForEach(Enumerable.Range(1, 200), n =>
{
    int concurrency = Interlocked.Increment(ref concurrencyCounter);
    lock (locker) maxConcurrency = Math.Max(maxConcurrency, concurrency);
    Thread.Sleep(2000); // Simulate a blocking I/O-bound operation
    Interlocked.Decrement(ref concurrencyCounter);
});
Console.WriteLine($"Maximum concurrency: {maxConcurrency}");

Output:

.NET 6.0.0-rtm.21522.10
Maximum concurrency: 103

Online demo.

The parallel loop starts by using the 100 available ThreadPool threads, as well as the main thread of the console application (101 threads in total). During its execution it induced the ThreadPool to create 2 additional threads, for a final total of 103 threads.

@stephentoub
Copy link
Member

stephentoub commented Jul 28, 2022

Here is a minimal demonstration that an unconfigured Parallel.ForEach loop consumes indeed all the available threads of the ThreadPool.

It doesn't require a repro: that is the design goal and purpose when there's no other work to be done. But note that the additional work items ForEach uses are queued locally, not globally, which means other threads would need to exhaust the global queue before assisting this loop.

I'm not sure what the purpose of this Issue is. If you want to set the value in your code to ProcessorCount, go for it.

@theodorzoulias
Copy link
Contributor Author

But note that the additional work items ForEach uses are queued locally, not globally, which means other threads would need to exhaust the global queue before assisting this loop.

@stephentoub could you expand on this a bit more? Maybe this is what I am missing. How is this difference between local vs global queuing can be demonstrated in practice?

I'm not sure what the purpose of this Issue is. If you want to set the value in your code to MaxDegreeOfParallelism, go for it.

The issue is whether the general advice offered in the documentation is valid or outdated. It's not about my code. It's about the advice that I am giving to other people, which is not aligned with the official documentation. I might have to reconsider my advice, or I might have to amend it with an explicit warning that it deviates from the official guidelines.

@stephentoub
Copy link
Member

could you expand on this a bit more?

When thread pool worker threads go in search of the next thing to process, they first look at their own queue (which only they can queue to), then to the global queue, and only if that's empty do they try to steal from other threads' queues. As long as there's work to be processed in a thread's local queue or in the global queue, it won't pick up the extra work item created by the loop in another thread's local queue.

@theodorzoulias
Copy link
Contributor Author

When thread pool worker threads go in search of the next thing to process, they first look at their own queue (which only they can queue to), then to the global queue, and only if that's empty do they try to steal from other threads' queues. As long as there's work to be processed in a thread's local queue or in the global queue, it won't pick up the extra work item created by the loop in another thread's local queue.

@stephentoub does this have any practical effect on the behavior of a .NET program? What would be the visible behavioral difference if the worker threads didn't have their own local queues, and instead they were using only a shared global queue?

Btw I think that I should demonstrate what is the effect of an unconfigured Parallel.ForEach loop on an unrelated asynchronous method, and on an unrelated System.Threading.Timer component:

Console.WriteLine(RuntimeInformation.FrameworkDescription);
ThreadPool.SetMinThreads(100, 100);
Task task = Task.Run(async () =>
{
    Stopwatch stopwatch = new();
    for (int i = 0; i < 10; i++)
    {
        stopwatch.Restart();
        await Task.Delay(100);
        stopwatch.Stop();
        Console.WriteLine($"Duration of await Task.Delay(100): {stopwatch.ElapsedMilliseconds} msec");
    }
});

Stopwatch timerStopwatch = Stopwatch.StartNew();
var timer = new Timer(_ =>
{
    timerStopwatch.Stop();
    Console.WriteLine($"Timer of 500 msec ticked after: {timerStopwatch.ElapsedMilliseconds} msec");
}, null, 500, -1);

Parallel.ForEach(Enumerable.Range(1, 200), n =>
{
    Thread.Sleep(2000); // Simulate a blocking I/O-bound operation
});
GC.KeepAlive(timer);

Output:

.NET 6.0.0-rtm.21522.10
Duration of await Task.Delay(100): 2018 msec
Timer of 500 msec ticked after: 2021 msec
Duration of await Task.Delay(100): 100 msec
Duration of await Task.Delay(100): 100 msec
Duration of await Task.Delay(100): 100 msec
Duration of await Task.Delay(100): 100 msec
Duration of await Task.Delay(100): 100 msec
Duration of await Task.Delay(100): 100 msec
Duration of await Task.Delay(100): 100 msec
Duration of await Task.Delay(100): 100 msec
Duration of await Task.Delay(100): 100 msec

Online demo.

This is not a good behavior IMHO. That's the kind of observations that make me reluctant to follow and promote the general advice offered in the documentation.

@myblindy
Copy link

This is not a good behavior IMHO

That code does what you tell it to do. So when you say it's not good behavior, you mean your code isn't written correctly.

The way I'd write (and I have written) this kind of code is to use separate thread pools. One thread pool for I/O bound operations (your timer would get queued here), one thread pool for CPU bound operations (your plinq thing would get queued here) and the system thread pool which is used by system components, like Asp.Net Core listeners.

Also this:

Parallel.ForEach(Enumerable.Range(1, 200), n =>
{
    Thread.Sleep(2000); // Simulate a blocking I/O-bound operation
});
GC.KeepAlive(timer);

If that is indeed simulating an I/O bound operation, you should have used async operations, and then Task.WhenAll or Parallel.ForEachAsync. So again, your examples are fundamentally broken.

@stephentoub
Copy link
Member

does this have any practical effect on the behavior of a .NET program?

Yes. If it didn't, we wouldn't have the mechanism.

What would be the visible behavioral difference if the worker threads didn't have their own local queues, and instead they were using only a shared global queue?

The primary purpose of the local queue is to allow a work item to be decomposed into multiple pieces such that other threads can help out if available. It enables a thread to receive assistance if others are available but not force itself onto them.

what is the effect of an unconfigured Parallel.ForEach loop on an unrelated asynchronous method, and on an unrelated System.Threading.Timer component

This is an unrepresentative repro in multiple ways. First, Thread.Sleep is an inexhaustible resource, provides zero backpressure, and gives zero information to the pool's scaling heuristics. Second, it's a heterogenous workload involving both asynchronous I/O that's apparently latency-sensitive with a Parallel.ForEach that's not only long-running but is also involving individual long iterations that are purely sleeping; this is not the predominant use of ForEach. Third, this exactly falls into the second exemption already included in the docs: "when you're running multiple algorithms concurrently".

If the default isn't good for your workload, please choose a better value; it's been configurable since the day it was exposed. If the default doesn't work for customers you're consulting with and their workloads, but all means, recommend whatever value you determine is better based on sufficiently profiling and measuring their app (or maybe they shouldn't be using ForEach in the first place). If you believe there's a category of exemption missing in the docs, please feel free to open a suggestion in the docs repro. If you're asking that the default be changed, I don't see that happening; not only is that guaranteed to result in performance regressions (and not just for purely I/O-bound operations), it also has a reasonable chance of introducing deadlocks, if one iteration ends up depending on another executing.

I'm on vacation and am going to stop responding to this thread now.

Thanks.

@filipnavara
Copy link
Member

For CPU-bound workloads it utilizes all the available CPUs, without inducing the ThreadPool to create more threads. Adding more threads than CPUs won't help at finishing the calculation faster anyway.

The logical flaw in that thinking is that all the individual operations take approximately the same time and start executing at the same time. That's not necessarily the case.

@theodorzoulias
Copy link
Contributor Author

If you're asking that the default be changed, I don't see that happening; not only is that guaranteed to result in performance regressions (and not just for purely I/O-bound operations), it also has a reasonable chance of introducing deadlocks, if one iteration ends up depending on another executing.

@stephentoub thanks for stealing time from your vacation for the purpose of resolving this issue! In the future whenever I am giving advice about the configuration of the synchronous Parallel APIs, I will definitely include the above arguments as a footnote.

@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jul 28, 2022
@theodorzoulias
Copy link
Contributor Author

theodorzoulias commented Jul 28, 2022

That code does what you tell it to do. So when you say it's not good behavior, you mean your code isn't written correctly.

Hi Blindy! I'm happy that I am seeing you here. We agree that my code isn't written correctly, but we probably disagree (as we usually do) on how my code can be fixed. IMHO the correct way to fix my code is simply to configure the MaxDegreeOfParallelism, so that the Parallel.ForEach loop does not saturate the ThreadPool. Just adding this line:

ParallelOptions options = new() { MaxDegreeOfParallelism = 100 };

...and look how the output of my program is immediately fixed:

.NET 6.0.0-rtm.21522.10
Duration of await Task.Delay(100): 100 msec
Duration of await Task.Delay(100): 100 msec
Duration of await Task.Delay(100): 100 msec
Duration of await Task.Delay(100): 102 msec
Timer of 500 msec ticked after: 502 msec
Duration of await Task.Delay(100): 100 msec
Duration of await Task.Delay(100): 100 msec
Duration of await Task.Delay(100): 104 msec
Duration of await Task.Delay(100): 104 msec
Duration of await Task.Delay(100): 104 msec
Duration of await Task.Delay(100): 100 msec

Letting the ThreadPool have one available thread, is enough to ensure that the asynchronous continuations and the timer callback can be invoked in a timely manner.

If that is indeed simulating an I/O bound operation, you should have used async operations, and then Task.WhenAll or Parallel.ForEachAsync. So again, your examples are fundamentally broken.

Yes, the Thread.Sleep(2000) simulates an I/O-bound operation. People do these things all the time, because either they don't know better, or they just don't have an asynchronous API available for whatever they are trying to do. But even if we assume that using the synchronous Parallel APIs with I/O-bound workloads can be dismissed as incorrect usage of the APIs, you still have to make a case for configuring CPU-bound workloads with MaxDegreeOfParallelism = -1 (the default). This configuration has the effect of saturating the ThreadPool (correct me if I am wrong). Don't you think that a saturated ThreadPool is a problem for a modern async-enabled application?

@theodorzoulias
Copy link
Contributor Author

theodorzoulias commented Jul 29, 2022

For CPU-bound workloads it utilizes all the available CPUs, without inducing the ThreadPool to create more threads. Adding more threads than CPUs won't help at finishing the calculation faster anyway.

The logical flaw in that thinking is that all the individual operations take approximately the same time and start executing at the same time. That's not necessarily the case.

@filipnavara sure, one could easily devise scenarios where adding more threads than CPUs could help at speeding up CPU-bound calculations. For example you have 4 CPUs, and 5 items to process. In this case I would expect that it'll be beneficial to use 5 threads instead of 4. So I should have prefixed my statement with "In most cases".

How about PLINQ queries though? The same train of thought could induce you increasing PLINQ's degree of parallelism, to something larger than the default Environment.ProcessorCount. For example:

.WithDegreeOfParallelism(Environment.ProcessorCount * 3)

This should also speed up the CPU-bound processing of 5 items on 4 CPUs. Would you advise people for using the above configuration as their new default? If not, why not?

@theodorzoulias
Copy link
Contributor Author

Btw although we don't agree regarding the suitability of the general advice offered in the MaxDegreeOfParallelism documentation (which is to not modify this setting in general), I think that we will agree that this advice is not considering the new asynchronous Parallel APIs that were added on .NET 6 (the ForEachAsync method and its overloads). These APIs are typically used with I/O-bound workloads, and the optimal parallelization level for these APIs is bound to the capabilities of the asynchronous service provider, and not to the capabilities of the local CPU. People who intend to use the Parallel.ForEachAsync method, and search in the official documentation for guidance regarding the configuration of the ParallelOptions, currently are not getting the best possible advice IMHO.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants