New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CURA-8573] Replace OpenMP with native async for Support Generation #1524
Conversation
Since not all compilers implement the full C++17 yet, specifically the Parallel Execution TS wasn't added to their respective libraries yet. done as part of CURA-8573
Similarly, I've re-implemented |
Hi @Piezoid , We're going to replace all OpenMP instances with native threading, as it's a world of hassle to get that working on Mac (a sizeable chunk of our user-base). As a consequence, none of our Mac builds where using threading in the engine. In order to not be completely overwhelmed & split chunks of programming into more predictable work, (and keep it reviewable), we decided to replace a small chunk first, to test the waters. I suppose that does mean we're interested though! :-) So I think that answers your question?: We're eventually just going to use native over OpenMP. |
Hi @rburema, I have some reservations about using I made a branch that uses a minimalistic thread pool owned by the Edit: Quick bencmark: On a 50s slice with 24 threads on linux (average of 4 runs):
If you're ok with this approach, I will make a PR and a bit more cleanup. |
Hi @Piezoid ! Sorry for the slow response.
Thanks for the link to the analysis! That's on me for just assuming these sort-of higher-level calls have all that sort of stuff sorted out under the hood (like MSVC apparently already does -- though reading up on that, this may make them technically not standard-compliant). The only part that used Since there are no thread-pools used (internally), I think the general idea of limiting the number of threads is a decent/good one. The general problem I have with this is: This (as in, my original PR) was tested against the code the way it was before, and it already seems to be a little improvement over what the (ancient version of) OpenMP was doing. At the very least not worse. Possibly because each of the many tasks that are launched is quite large in an of itself (even though there may be many of them). I suspect the overhead of calling a new thread compared to what the tasks are actually doing is, in most cases, negligible. My point here is: Is the increase in complexity worth it, especially since C++ threading is a subject where lots of things are still changing? (Though part of what makes the linked file larger is a function that doesn't seem to be used directly by the linked code.) A more specific question is about this part of the code:
This is placed in a for-loop, that may run to full completion before any thread has had a chance to get going. That being the case; if one thread finishes before the entire list of parallel-for-each tasks is done, how will that thread ever pick a new one? Won't this just result in the majority of the tasks (all except the first Given our use-cases (since the tasks are large, but expected to have roughly the same run-time if averaged over a large number of tasks), I think a better approach might be to divide the tasks into chunks of |
Looking at the rest of the code, I think we don't have that much small works.
Ah yes, you're right. (If we go through with this) It's probably good to document that, because the connection to the max-in-queue isn't explicit since the extra optimization of running on the main thread when the thread-pool queue is full 'takes care' of that part as well. Ok, I think the added complexity will be worth it if the speedup is significant (at least a few percent or so) to my original approach. Ideally, I'd like to have that tested because of the (hopefully negligible) overhead you added. |
Hi @rburema ! Sorry, I misfired a reply. Apparently you got the email 😉
I agree that, a least for the support code converted to My opinion is that it is wasteful to spawn as many threads as there is layers, multiple times during a slice, then let the OS do the M:N scheduling. I confess that's it is not backed with much data for this specific use case.
I can try to quantify the difference by comparing the
I implemented On the other hand, I don't see the
I don't understand the part about linked code size. The compiler won't emit code for inline functions and methods in units that don't use them. To optimize code size, the setup and dismantling of the thread pool could be instantiated in its own unit.
The synchronizations and signaling ensure some level of fairness. When there is more than In fact this conditional branch is entirely optional. When removed, all the tasks are queued in one go, then the main thread become a worker and start dequeuing tasks like any other thread. The only purpose of the branch is to avoid allocating too many closures.
Yes, I agree that comments should be more detailed on these points.
I considered implementing chunking as a way to queue fewer closures. I found that the complexity added wasn't really warranted since I couldn't detect any overhead or contention on the mutex.
Maybe I don't get it, but as I understand this doesn't allows work redistribution between threads. If one stride of chunks finishes faster, the worker will have nothing to do. |
Hi @Piezoid There are three main things here I think:
Also, keep in mind: Ideally, I'd just like to use the 'parallel for' and other such constructs that confirming C++ 17 implementations should provide. That was in fact the original plan for this ticket. We'd probably have very little reason to use any self-rolled thread-pool if that was the case. I bring this up because we might switch 'back' to that once provided. All of that said however (I had a quick think about how to do it differently, but you always end up with at least some kind of thread-safe queue or vector, and that point, why not go the full thread pool approach?) -- I think the potential benefits of your solution outweigh any negatives I can bring to bear. -- Especially since you said you already parallelized the rest? That would take care of a ticket for us 😁 So, please make the PR, so we can see if it works for all platforms 🙂 P.S.
No you do get it. That is a downside of that approach, I just thought it might be an acceptable one. (Again, given our current use cases...) P.P.S. I know its really small and a bit silly to point out, but it's been bothering me every time I reply to this: 'until' is spelled with only one 'l'. |
Needed for Mac support of multi-threading.