Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CURA-8573] Replace OpenMP with native async for Support Generation #1524

Merged
merged 4 commits into from Nov 12, 2021

Conversation

rburema
Copy link
Member

@rburema rburema commented Nov 3, 2021

Needed for Mac support of multi-threading.

Since not all compilers implement the full C++17 yet, specifically the Parallel Execution TS wasn't added to their respective libraries yet.

done as part of CURA-8573
@Piezoid
Copy link
Contributor

Piezoid commented Nov 3, 2021

Similarly, I've re-implemented GcodeLayerThreader as I wasn't satisfied with the 500ms wait for work. I used std::condition_variable and a different logic to the produce out of order and consume in order. I got better CPU utilization and wall clock time.
I kept the omp parallel used for spawning the workers, allowing nested omp parallelism to behave correctly. On the other hand, it is not recommended in principle to use std/thread synchronization primitives with the omp runtime (but there is no condition wait in omp, without tasks). In practice, I found that it's not problematic on gcc/linux, but YMMV.
If you are interested in this, what thread runtime would you recommend using at this level?

@rburema
Copy link
Member Author

rburema commented Nov 4, 2021

Similarly, I've re-implemented GcodeLayerThreader as I wasn't satisfied with the 500ms wait for work. I used std::condition_variable and a different logic to the produce out of order and consume in order. I got better CPU utilization and wall clock time. I kept the omp parallel used for spawning the workers, allowing nested omp parallelism to behave correctly. On the other hand, it is not recommended in principle to use std/thread synchronization primitives with the omp runtime (but there is no condition wait in omp, without tasks). In practice, I found that it's not problematic on gcc/linux, but YMMV. If you are interested in this, what thread runtime would you recommend using at this level?

Hi @Piezoid ,

We're going to replace all OpenMP instances with native threading, as it's a world of hassle to get that working on Mac (a sizeable chunk of our user-base). As a consequence, none of our Mac builds where using threading in the engine.

In order to not be completely overwhelmed & split chunks of programming into more predictable work, (and keep it reviewable), we decided to replace a small chunk first, to test the waters.

I suppose that does mean we're interested though! :-)

So I think that answers your question?: We're eventually just going to use native over OpenMP.

@jellespijker jellespijker merged commit 1ba64cf into master Nov 12, 2021
@jellespijker jellespijker deleted the CURA-8573_support_generation_threaded branch November 12, 2021 08:33
@Piezoid
Copy link
Contributor

Piezoid commented Jan 12, 2022

Hi @rburema,

I have some reservations about using std::async. With many of the current implementations of stdc++, parallel_for immediately spawns one thread per layer. (One exception being MSCV's, where async calls _Task_async_state that start a task in what appears to be a thread pool). Beside the overhead of thread creation, it will create more CPU tasks than there is available CPU core, potentially resulting in scheduling overhead. This could cause scaling issues with bigger sliced objects.

I made a branch that uses a minimalistic thread pool owned by the Application singleton. I have re-implemented both parallel_for and GcodeLayerThreader using it. OpenMP is entirely stripped from the code base.

Edit: Quick bencmark: On a 50s slice with 24 threads on linux (average of 4 runs):

  • -3.4% in wall clock run time (52.1s to 50.3s)
  • +5% in parallelism (Percent of CPU this job got: 1234% to 1296%)
  • -48% system time (3.63s to 3.93s)
  • -9% in max resident set size (680MiB to 612MiB)
  • -35% in minor page faults

If you're ok with this approach, I will make a PR and a bit more cleanup.

@rburema
Copy link
Member Author

rburema commented Jan 19, 2022

Hi @Piezoid !

Sorry for the slow response.

With many of the current implementations of stdc++

Thanks for the link to the analysis!

That's on me for just assuming these sort-of higher-level calls have all that sort of stuff sorted out under the hood (like MSVC apparently already does -- though reading up on that, this may make them technically not standard-compliant).

The only part that used async however, is hopefully temporary. That said, like we always say in the team "there is nothing more permanent than a temporary solution". So that's not a real argument.

Since there are no thread-pools used (internally), I think the general idea of limiting the number of threads is a decent/good one.

The general problem I have with this is:

This (as in, my original PR) was tested against the code the way it was before, and it already seems to be a little improvement over what the (ancient version of) OpenMP was doing. At the very least not worse. Possibly because each of the many tasks that are launched is quite large in an of itself (even though there may be many of them). I suspect the overhead of calling a new thread compared to what the tasks are actually doing is, in most cases, negligible.

My point here is: Is the increase in complexity worth it, especially since C++ threading is a subject where lots of things are still changing? (Though part of what makes the linked file larger is a function that doesn't seem to be used directly by the linked code.)

A more specific question is about this part of the code:

        else
        {   // Too many tasks queued: do work on the main thread
            func(index, lock);
        }

This is placed in a for-loop, that may run to full completion before any thread has had a chance to get going. That being the case; if one thread finishes before the entire list of parallel-for-each tasks is done, how will that thread ever pick a new one? Won't this just result in the majority of the tasks (all except the first 4n, where n is the number of threads active in the pool) being executed on the main thread of typical scenario's? (Number of layers, or tree-support contact-points, etc.)

Given our use-cases (since the tasks are large, but expected to have roughly the same run-time if averaged over a large number of tasks), I think a better approach might be to divide the tasks into chunks of n, then just link all of those together, so that if task i * n finishes, it starts (i + 1) * n on the same thread. Then wait until those n threads have finished. I think that might be simpler to implement as well.

@rburema
Copy link
Member Author

rburema commented Jan 19, 2022

I agree that a least for the support code that was converted to async, the size of the works completed was enough to mitigate

Looking at the rest of the code, I think we don't have that much small works.

No because the loop is synchronized with a mutex to the queue. It is held

Ah yes, you're right. (If we go through with this) It's probably good to document that, because the connection to the max-in-queue isn't explicit since the extra optimization of running on the main thread when the thread-pool queue is full 'takes care' of that part as well.

Ok, I think the added complexity will be worth it if the speedup is significant (at least a few percent or so) to my original approach. Ideally, I'd like to have that tested because of the (hopefully negligible) overhead you added.

@Piezoid
Copy link
Contributor

Piezoid commented Jan 19, 2022

Hi @rburema !

Sorry, I misfired a reply. Apparently you got the email 😉

The general problem I have with this is:

This (as in, my original PR) was tested against the code the way it was before, and it already seems to be a little improvement over what the (ancient version of) OpenMP was doing. At the very least not worse. Possibly because each of the many tasks that are launched is quite large in an of itself (even though there may be many of them). I suspect the overhead of calling a new thread compared to what the tasks are actually doing is, in most cases, negligible.

Looking at the rest of the code, I think we don't have that much small works.

I agree that, a least for the support code converted to async, the run-time of the tasks is enough to mitigate the overhead of spawning threads. However this might no longer be the case with repeated and future uses in the code base.

My opinion is that it is wasteful to spawn as many threads as there is layers, multiple times during a slice, then let the OS do the M:N scheduling. I confess that's it is not backed with much data for this specific use case.

Ok, I think the added complexity will be worth it if the speedup is significant (at least a few percent or so) to my original approach. Ideally, I'd like to have that tested because of the (hopefully negligible) overhead you added.

I can try to quantify the difference by comparing the async implementation of parallel_for and its ThreadPool replacement, after 0f30863a democratizing parallel_for in the code base. This could tell whether the observed page trashing is caused by OpenMP or repeated std::async calls.

My point here is: Is the increase in complexity worth it, especially since C++ threading is a subject where lots of things are still changing?

I implemented ThreadPool with that complexity balance in mind, hopping to get the simplest code one could write without using higher level APIs. Although it is entirely possible that I missed a simpler design. It's very basic and not intended for direct use. It only use std::thread, a single mutex and a condition_variable that are mostly set in stone since C++11.

On the other hand, I don't see the std::async implementations evolving quickly. They handle many concerns, like exception safety, TLS, etc, and thus have they own level of complexity and inertia.

(Though part of what makes the linked file larger is a function that doesn't seem to be used directly by the linked code.)

I don't understand the part about linked code size. The compiler won't emit code for inline functions and methods in units that don't use them. To optimize code size, the setup and dismantling of the thread pool could be instantiated in its own unit.
Code using parallel_for will cause an instantiation to be emitted, along with closure inlining the task and wrappers. If the monomorphization of parallel_for is unwanted, a second std::function indirection could be added to instantiate parallel_for in a single unit.

A more specific question is about this part of the code:

        else
        {   // Too many tasks queued: do work on the main thread
            func(index, lock);
        }

This is placed in a for-loop, that may run to full completion before any thread has had a chance to get going. That being the case; if one thread finishes before the entire list of parallel-for-each tasks is done, how will that thread ever pick a new one? Won't this just result in the majority of the tasks (all except the first 4n, where n is the number of threads active in the pool) being executed on the main thread of typical scenario's? (Number of layers, or tree-support contact-points, etc.)

The synchronizations and signaling ensure some level of fairness.

When there is more than 4n tasks, the loop will execute some on the main thread. For the duration of the task, func releases the main thread's lock on the queue's mutex, allowing workers to pop tasks on the queue. When func exits, the main thread reacquires the lock, enabling it to push new jobs. The workers waiting on a empty queue are notified by ThreadPool::push and should get priority on the mutex when the main thread releases the lock. If all workers are busy, they will very likely execute at least n tasks during two func executions on the main thread loop (assuming similar run-time of the tasks).

In fact this conditional branch is entirely optional. When removed, all the tasks are queued in one go, then the main thread become a worker and start dequeuing tasks like any other thread. The only purpose of the branch is to avoid allocating too many closures.

It's probably good to document that, because the connection to the max-in-queue isn't explicit since the extra optimization of running on the main thread when the thread-pool queue is full 'takes care' of that part as well.

Yes, I agree that comments should be more detailed on these points.

Given our use-cases (since the tasks are large, but expected to have roughly the same run-time if averaged over a large number of tasks), I think a better approach might be to divide the tasks into chunks of n, then just link all of those together

I considered implementing chunking as a way to queue fewer closures. I found that the complexity added wasn't really warranted since I couldn't detect any overhead or contention on the mutex.
Chunking could also be used to limit the number of threads with the async implementation. This would still require over-provisioning threads to compensate for the loss of granularity.

if task i * n finishes, it starts (i + 1) * n on the same thread

Maybe I don't get it, but as I understand this doesn't allows work redistribution between threads. If one stride of chunks finishes faster, the worker will have nothing to do.

@rburema
Copy link
Member Author

rburema commented Jan 19, 2022

Hi @Piezoid

There are three main things here I think:

  • I was originally misreading the code slightly, but your misfired email already set me straight, so the parts where you explain the code can be considered done now, I think.
  • Your PR is written with a certain future in mind where we'd use this generally all throughout the code, for completely different purposes. The current proposed solution is written with the assumption that, when it comes to threading, any future use cases within the code base will be similar or small enough. (Though there's a chance your implementation will work better even under the latter assumptions.) -- I'm was a bit on the fence here. You seem to have done solid work and it would have been a shame to throw away potential optimizations, on the other hand, this feels close to the 'premature optimization' that we're always warned of, in a way.
  • With 'complexity' I didn't mean so much big-Oh or other asymptotic notation related things -- I meant the complexity of the code, w.r.t. maintainability / readability. Thought that shouldn't be that much of an issue with some added comments.

Also, keep in mind: Ideally, I'd just like to use the 'parallel for' and other such constructs that confirming C++ 17 implementations should provide. That was in fact the original plan for this ticket. We'd probably have very little reason to use any self-rolled thread-pool if that was the case. I bring this up because we might switch 'back' to that once provided.

All of that said however (I had a quick think about how to do it differently, but you always end up with at least some kind of thread-safe queue or vector, and that point, why not go the full thread pool approach?) -- I think the potential benefits of your solution outweigh any negatives I can bring to bear. -- Especially since you said you already parallelized the rest? That would take care of a ticket for us 😁

So, please make the PR, so we can see if it works for all platforms 🙂

P.S.

Maybe I don't get it, but as I understand this doesn't allows work redistribution between threads.

No you do get it. That is a downside of that approach, I just thought it might be an acceptable one. (Again, given our current use cases...)

P.P.S. I know its really small and a bit silly to point out, but it's been bothering me every time I reply to this: 'until' is spelled with only one 'l'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants