Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a guaranteed DOP to Parallel and a max-DOP to PLINQ #4727

Closed
GSPP opened this issue Nov 29, 2015 · 17 comments
Closed

Add a guaranteed DOP to Parallel and a max-DOP to PLINQ #4727

GSPP opened this issue Nov 29, 2015 · 17 comments
Assignees
Milestone

Comments

@GSPP
Copy link

GSPP commented Nov 29, 2015

The Parallel class and PLINQ are inconsistent in the way the degree of parallelism can be specified. For PLINQ the DOP is always exact. For Parallel the DOP is the maximum that will be used.

Whether you want an exact or a max DOP depends on the problem. For example when doing IO we often want an exact DOP that was empirically determined to be optimal. The TPL cannot possibly know the correct DOP. For magnetic disks the best DOP might be 1-2. For SSDs it might be 4-16. For web services it might be extremely high.

This means that we effectively cannot use Parallel for IO. Depending on the number CPUs and the thread-pool state wildly different actual DOPs might be chosen. The app might work fast on one machine, slow on another. Or slow at a certain time of day depending on CPU load. Or, crash the webservice being called on a client with many CPU cores.

I mean the previous paragraph literally. If you disagree please explain how Parallel can be used effectively for IO given the challenges detailed here.

PLINQ then does not support a max DOP which can be an efficiency issue for CPU bound workloads. As far as I can tell PLINQ cooperates with other PLINQ queries through use of the thread-pool to avoid oversubscription. That's good, but it does not work either due to the related issue https://github.com/dotnet/coreclr/issues/1754 (unbounded number of threads). Try this:

System.Threading.Tasks.Task.WaitAll(Enumerable.Range(0, 100).Select(i => System.Threading.Tasks.Task.Run(() => {
  ParallelEnumerable.Range(0, 1000000).ForAll(_ => Thread.Sleep(1000));
})).ToArray());

Also, it is strange from an API perspective to have the very analogous APIs Parallel and PLINQ behave differently in this aspect. I see no fundamental reason this has to be so.

Another problem is that PLINQ has a maximum DOP of 512. This has been raised from 64 (as I just noticed) which is great. Can we make this essentially unlimited?

You might say "Who in his right mind wants to start more than 512 threads?". Well, if there is enough RAM and this is a 64 bit process it can be a very convenient and reliable way to increase the DOP. Often, it is not the right choice for production code. But for lower quality dev tools it can be totally fine to do that.

Another problem is that PLINQ does not support using a custom TaskScheduler. I have needed that recently to globally limit the parallelism induced by certain PLINQ queries. I wanted each query to use a DOP of N but globally in the whole app I wanted to also enforce a max DOP of N. The solution for that would have been to use a custom TaskScheduler that limits the DOP to N and make PLINQ use that.

After a cursory look at the PLINQ source it seems that basic support for custom task schedulers is architecturally present (e.g. SpoolPipeline). I hope this would be reasonably easy to enable.

I propose:

  1. PLINQ should be able to have a max DOP
  2. Parallel should be able to have an exact DOP
  3. The PLINQ DOP limit should be removed.
  4. Make PLINQ be able to execute on a given TaskScheduler.

In this issue I have written up a few challenges that I run into frequently. I'm also an active answering user on Stack Overflow in TPL-related tags. These issues come up all the time.

@stephentoub
Copy link
Member

Thanks for the issue.

Make PLINQ be able to execute on a given TaskScheduler

On the surface, this is easy: just make this WithTaskScheduler method public:
https://github.com/dotnet/corefx/blob/c02d33b18398199f6acc17d375dab154e9a1df66/src/System.Linq.Parallel/src/System/Linq/ParallelEnumerable.cs#L297

As is often the case, it's more complicated than that, though. The reason we didn't make this public initially and then haven't each time the question has come up is due to fear of deadlocks. Some PLINQ operators use barriers, such that all tasks/partitions involved in processing the query need to join before any of them can make further progress. If the underlying scheduler can't guarantee that all of the tasks queued as part of PLINQ's processing will be able to run at the same time, e.g. if it's a scheduler created with a fixed number of threads, then we risk deadlock. We've explored additional surface area for TaskScheduler to be able to let the scheduler provide such a guarantee, but it got complicated quickly and didn't seem worth it. Hence restricting PLINQ to the default scheduler.

Of course, while we pretend that the default scheduler has an unlimited number of threads available to it, that's obviously not true. A developer can change the maximum number of threads available, potentially lowering it to the point where there would be problems. But even with the defaults, while on 64-bit the default max is 32K, on 32-bit the default is 1K. Which is why the PLINQ DOP max limit exists...

The PLINQ DOP limit should be removed.

The original 64 limit was there simply due to a deficiency in the implementation. That concrete limitation was removed, and it was replaced with a fairly arbitrary one of 512, as a value meant to be one that should rarely need to be exceeded while at the same time being unlikely to cause problems with the default maximums in the thread pool. I'd be ok seeing this artificial limitation raised or removed, but we'd need to think through how to mitigate the resulting increased deadlock potential. Suggestions?

PLINQ should be able to have a max DOP

I don't think the difference between DOP and max DOP is as big as you're implying, or as simple to change. For queries that involve only operators that don't need to do the kind of barriers I previously mentioned, PLINQ effectively does have a max rather than a hard DOP. It'll create DOP tasks to participate in the query, but one or more of those tasks may simply get canceled when there's no work for them to do, e.g. the following code:

using System;
using System.Linq;

class Repro
{
    static void Main()
    {
        var source = Enumerable.Range(0, 1000);
        foreach (int tid in source.AsParallel().WithDegreeOfParallelism(512).Select(_ => Task.CurrentId).AsSequential().Distinct())
        {
            Console.WriteLine(tid);
        }
    }
}

just printed out "1 3 2" when I ran it, highlighting that only three tasks were used even though I'd set the DOP of 512 and there were 1000 inputs. In other words, the benefit of being able to provide a max DOP would simply be around changing the task creation scheme and any performance benefits that might come from queueing further tasks in a situation where the extras aren't needed. That, however, could cause problems for queries (and the performance of queries) that did actually require all of the tasks to be executing concurrently. As there's relatively little value to be gained, I don't believe there's enough value in also exposing a WithMaxDegreeOfParallelism.

Parallel should be able to have an exact DOP

Similar to the above, I'm not sure what the value here would be. You wouldn't want to block one of the participating tasks from running until all others were running, as that would simply slow it down. And the scheme employed by Parallel is just to have every task that starts running queue a copy of itself, which means that if the underlying scheduler gives it N threads, it'll immediately use them all (up to the asked for DOP), as each thread will pick up a task which will immediately create a copy of itself that'll be picked up by another thread, etc. For example, the following code will print out 512 threads being used, and will complete as quickly as the system is able to provide them all:

using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Threading;
using System.Threading.Tasks;

class Repro
{
    static void Main()
    {
        var threadIds = new ConcurrentDictionary<int, bool>();
        int dop = 512;
        var b = new Barrier(dop);
        Parallel.For(0, dop, new ParallelOptions { MaxDegreeOfParallelism = dop, TaskScheduler = new DedicatedThreadScheduler() }, i =>
        {
            b.SignalAndWait();
            threadIds[Environment.CurrentManagedThreadId] = true;
        });
        Console.WriteLine(threadIds.Count);
    }

    class DedicatedThreadScheduler : TaskScheduler
    {
        protected override void QueueTask(Task task)
        {
            new Thread(_ => TryExecuteTask(task)) { IsBackground = true }.Start();
        }

        protected override bool TryExecuteTaskInline(Task task, bool taskWasPreviouslyQueued)
        {
            return TryExecuteTask(task);
        }

        protected override IEnumerable<Task> GetScheduledTasks() { return null; }
    }
}

This highlights the limitation here isn't with DegreeOfParallelism or MaxDegreeOfParallelism, but with the underlying scheduler's implementation. We use "DegreeOfParallelism" with PLINQ to highlight that some queries may actually require that DOP to be successful, and we use "MaxDegreeOfParallelism" with Parallel because there's nothing inherent to Parallel that would require that many run concurrently.

This means that we effectively cannot use Parallel for IO

I'm not understanding this. First, it'd typically be better for efficiency if you're talking about I/O at the scales you're talking about to use async instead of blocking pool threads. But if you've measured that for the cases you're talking about blocking threads is preferable, I still don't understand the limitation. It seems like the concern is really just about whether the thread pool has threads to give, so your concern isn't with Parallel, but rather with the thread pool. In other words, even if we were to change Parallel to queue N tasks all at once rather than the scheme I outlined where one task queues another, it would still be at the mercy of the underlying scheduler to provide the threads to run those tasks.

@GSPP
Copy link
Author

GSPP commented Dec 7, 2015

You made very meaningful points here. I will reply in the next few days. Some of my points no longer stand and need to be analyzed.

@stephentoub
Copy link
Member

Thanks. Sounds good.

@stephentoub
Copy link
Member

@GSPP, can this be closed? Thanks.

@GSPP
Copy link
Author

GSPP commented Jan 20, 2016

@stephentoub I'll make another attempt at finding time this week. Please leave this open for a while. Thanks.

I didn't immediately respond because I need to experiment and think this through.

@stephentoub
Copy link
Member

Ok.

@GSPP
Copy link
Author

GSPP commented Jan 23, 2016

Regarding WithTaskScheduler deadlocking: Parallel does not have this problem. My understanding is that Parallel allows all threads to participate and process the remaining work. The internal self-replicating infrastructure is used to queue tasks if execution slots appear to be available. That way even if just one thread is available the operation will always complete. This is a great solution. (If no thread can be made available Parallel might also deadlock according to my understanding of the code.)

Why wouldn't this work for PLINQ? I realize that there are situations that require streams to be merged, but couldn't this be something in the spirit of Parallel.ForEach(streams, s => ProcessStream(s))?

As an alternative solution, all waiting and blocking could be moved to async waiting. That way threads never are a scarce resource and the existing code structure does not need to be changed much. I see CountdownEvent being used in the code. Maybe just make that async? Hopefully, the overhead does not matter too much in the context of PLINQ.

or queries that involve only operators that don't need to do the kind of barriers I previously mentioned, PLINQ effectively does have a max

I did not realize this but it's true. With a custom scheduler this could be made into an exact DOP with no other API surface area required. That would be a great capability.

In fact the same thing could be done with Parallel with no changes to it at all.

me: This means that we effectively cannot use Parallel for IO
you: so your concern isn't with Parallel, but rather with the thread pool

I now realize that this is true. Parallel can be made to provide an exact DOP with a custom TaskScheduler (and only with a custom TaskScheduler). This seems like the right solution to me.

To summarize I now think Parallel is in perfect shape :) If PLINQ got the ability to run on a custom TaskScheduler in the same way Parallel does it also would provide both max as well as exact DOPs. That would be great!

Would one of the solutions for PLINQ given above work?

Regarding IO, a clarification: If the system is at liberty to reduce the DOP applied to the IO system this can cause loss of throughput. If the DOP can increase too much, the IO resource might become overloaded. Therefore I consider it to be very important to use an exact DOP when doing IO-bound work. It is wrong in particular to use any value derived from Environment.ProcessorCount to decide the DOP used for IO. This explanation is not directly material to the issue. It's supposed to serve as a motivating use case.

@stephentoub

@GSPP
Copy link
Author

GSPP commented Jan 28, 2016

I just had this scenario:

jnjkkjn

SpoolForAll has this (decompiled) code:

internal static void SpoolForAll<TInputOutput, TIgnoreKey>(QueryTaskGroupState groupState, PartitionedStream<TInputOutput, TIgnoreKey> partitions, TaskScheduler taskScheduler)
{
    Task rootTask = new Task(delegate {
       ...
    });
    ...
    rootTask.RunSynchronously(taskScheduler);
    ...
}

Isn't it strange that the call to RunSynchronously is waiting on an event instead of running this task synchronously?! The task clearly is unstarted and we are targeting the default scheduler. This task should be eligible for inline execution. The stack depth was low, too.

I'm posting this finding because it fits my point from above: Maybe PLINQ queries don't need to block as much as they are doing right now.

This particular case of blocking seems pointless. It's either a bug, or if this is by design then the blocking could be removed by using the "Parallel.ForEach(streams, s => ProcessStream(s))" idea that I posted above.

My code was:

        ParallelEnumerable.Range(0, 50)
            .WithDegreeOfParallelism(20)
            .ForAll(_ => SynchronousHttpRequest());

I did not mess with thread-pool settings.

@stephentoub
Copy link
Member

The task clearly is unstarted

How do you know the task is unstarted? Could it not have already finished the work assigned to its partition and it's now blocked waiting on the other tasks to finish? (ForAll, after all, is a synchronous call, and can't return until all of the work is done.) There is some subtlety to the mechanisms PLINQ chooses for partitioning, and IIRC, ParallelEnumerable.Range does a strict partitioning that doesn't share between partitions, by design (http://blogs.msdn.com/b/pfxteam/archive/2009/05/28/9648672.aspx). I expect you wouldn't see this, or at least not as much, if you instead did Enumerable.Range(0, 50).AsParallel()...

@GSPP
Copy link
Author

GSPP commented Jan 28, 2016

The task is created using the Task constructor and nothing after that seems to start that task, no? Am I just not seeing the place where it's started?

@stephentoub
Copy link
Member

RunSynchronously starts it, trying to run it on current thread.

@GSPP
Copy link
Author

GSPP commented Jan 28, 2016

Right, but why did that not work? The task did not run inline. I found this interesting because trying to eliminate this kind of blocking seems important to eliminating deadlocks and enabling custom task schedulers.

Shouldn't new Task(() => { }).RunSynchronously(); pretty much always inline the task onto the current thread if the scheduler allows for that?

Indeed, with Enumerable.Range this did not happen, but it should not happen with ParallelEnumerable, either, I think.

@stephentoub
Copy link
Member

Right, but why did that not work?

How do you know it didn't work? That's what I was trying to ask. Can you help me to understand why you believe it's blocking waiting for that task to even start running? I'm not sitting in front of the debugger, but I think it's much more likely that it did run, that it finished all of its work, and now it's waiting for all of its child tasks to complete.

@GSPP
Copy link
Author

GSPP commented Jan 28, 2016

Ahhh, waiting for child tasks! That explains that.

But back to the main issue: How to enable custom task schedulers with PLINQ?

And you said:

Some PLINQ operators use barriers, such that all tasks/partitions involved in processing the query need to join before any of them can make further progress. If the underlying scheduler can't guarantee that all of the tasks queued as part of PLINQ's processing will be able to run at the same time

Which is correct! But Parallel does not have that issue. So I would suggest using a Parallel-like model for these fork-join points in PLINQ queries.

@stephentoub
Copy link
Member

But Parallel does not have that issue.

Parallel doesn't have that issue for the same reason that most of the PLINQ operators don't have this issue: no coordination other than the initial partitioning and the ending merging/joining is required between the operations being performed. Things become more complicated when part way through the operation you need information from all of the other threads participating in that operation, which is why barriers come into play in some of these operators (for some examples, search around the PLINQ code base for use of Barrier). If what you're suggesting is that when one of these threads reaches such a barrier, rather than waiting for the other threads processing their operation on other partitions that it could somehow steal part of that work away in order to help the other thread progress faster, that would be a monumental undertaking and it's not clear it would actually have positive reward.

@GSPP
Copy link
Author

GSPP commented Feb 3, 2016

I have searched the sources now for "barrier" and experimented a bit with First and TakeOrSkip scenarios. I see what you mean. Sometimes, there's a Barrier midway through the enumeration.

Making that barrier async would work but would require MoveNext to return Task (or ValueTask). Wouldn't the overhead be acceptable in case of ValueTask? That refactoring would consist of mostly rote changes because almost all operators would return a completed ValueTask and not use any await at all. The First and TakeOrSkip MoveNext methods would need to be gutted.

In fact this would lay the foundation to enable PLINQ to execute asynchronous functions which surely is a desirable feature.

Besides that I have no idea how to refactor the barrier away... So if that idea from above is not feasible I must give up.

I'd still like to have the ability to use a custom task scheduler. Maybe call the method WithTaskSchedulerDangerous? :) The only fundamental problem here is users misusing the library. Technically, I think it's sound to require the task scheduler to support at least as many threads as PLINQ requires. Possible API design: WithTaskScheduler(TaskScheduler ts, int exactCountOfTasks). That way the caller is responsible for defining a DOP that works well with that particular TaskScheduler.

I feel I've exhausted my ability to contribute. Feel free to close this issue if there is no possible way to make progress.

@stephentoub
Copy link
Member

Thanks for your thoughts, @GSPP.

I'm going to close this out. We can certainly revisit the PLINQ max again if it really starts to be a blocker for someone.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the Future milestone Jan 30, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants