-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallel some more #522
parallel some more #522
Conversation
it seems to me that the CI somehow stalled, but I have no idea why (I don't think the way I use mutex can cause deadlock?) |
Maybe enqueue can cause a deadlock if itself is the sole worker thread... I guess I should use task_group instead. |
It seems that face reordering causes |
thinking about it, it should be possible to parallelize this without reordering. will try it tmr |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #522 +/- ##
=======================================
Coverage 90.36% 90.37%
=======================================
Files 35 35
Lines 4433 4456 +23
=======================================
+ Hits 4006 4027 +21
- Misses 427 429 +2
☔ View full report in Codecov by Sentry. |
this avoids potential stack overflow and reduces allocation calls
Admittedly I am getting a bit crazy now, I was seeing so many optimization opportunities and can hardly sleep without implementing them. Samples.Sponge4 now runs in 2600ms with tbb (it took 4000ms previously) and MANIFOLD_DEBUG=off, the entire test suite completes in 4.5s, and the perfTest with tbb:
|
the segfault is weird, it doesn't seem to me that the collider update is doing anything that will cause segfault. |
@elalish is it required that the collider output order agrees with the query input order? e.g. say the query is [a, b] and a collides with [1, 2], b collides with [3, 4], can we output [3, 4, 1, 2]? |
I found the issue: CsgOpNode children is somehow empty in some runs, but I cannot find why it can be empty (I tried adding checks everywhere and can only see it to be empty when we call GetChildren). Perhaps there is some issue with the way I use thread local. Anyway, I am removing that collider optimization which seems to be causing the issue (but I have no idea why...). |
@elalish this should be ready now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Remind me, which of our platforms is TBB not available on? If it's only WASM and that's in progress, then we should put in a TODO to remove the old code when that's ready.
Only not available on WASM. Yes, we can mark the old code as about to be removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! If the WASM compiles with TBB and isn't worse than without it, I'd say we can go ahead and remove the alternate code paths. Take a look at the effect on binary size too, just in case.
I will check that. My main concern is that the PR is still under review, and tbb with wasm is not much tested, so I am afraid that it may not be stable. |
weird, it seems that this somehow causes windows build to fail (when tbb is enabled) |
* parallel some more * use std::mutex * fix compilation error * check if max concurrency > 1 * use task_group * disable face reordering * preserve face order * comments * fix cuda build * fix meshid not found situation * use explicit stack and scratch buffer this avoids potential stack overflow and reduces allocation calls * faster collider * please cuda * missing commit * remove collider optimization * dedup face_op.cpp * dedup boolean_result.cpp * remove ambiguous comment * include array
Use tbb for some optimizations directly:
std::async
will create a thread for every invocation, which is too much overhead, so tbbthread_arena
is needed.AddNewEdgeVerts
. This requiresconcurrent_map
as we will be obtaining elements concurrently.Combining these two optimizations, we can cut down the running time for
Samples.Sponge4
from ~4s to ~3s. WhenMANIFOLD_DEBUG=on
, the running time forSamples.Sponge4
is reduced from ~7s to ~3.7s. The main bottleneck for now should be simplification which takes about 1s.