New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API Example / walkthrough for Parallelism #413
Comments
The rtcOccluded1M and rtcOccluded1 functions both just execute on a single core, thus rtcOccluded1M does not parallelize by itself. |
Hi @svenwoop thanks for the fast reply. It was hard to recognize what 'parallelizes properly' meant, since all cores are doing work, but it is slower. I found the chunksize was one issue, changing
Perhaps its not possible to know, but part of what I was hoping to understand is what the expectation is for overhead and scheduling. For example, I just guessed at 128/256 chunks size, but didn't want to hardcode something that ends up acting very poorly in some cases I haven't tested. Similarly, what is the expected performance increase for I am finding the challenge being a lack of understanding the expected performance differences when scaling, and refining the parallelization and code aspect (finding I am doing something inefficient in C++). |
The parallelization should scale linearly with number of cores and hyperthreading is disabled. When hyperthreading is enabled, the last half of threads will give lower performance gains. You should expect about 0.8 * #threads speedup when hyperthreading is enabled. A problem in your case could be thread threads have not started yet? Best run the benchmark in a loop, first execution could be slow if thread did not yet start. The rtcOccluded1M will give about the same performance as the packet traversal algorithms, rtcIntersect4/8/16. These packet algorithms are about 2x faster for coherent workloads (primary rays) but perform similar as rtcIntersect1 for incoherent rays. |
Hi @svenwoop, I am having some issue optimizing the parallel loop for the stream method. The performance test is setup the same as the use case, which calling it once on a batch of rays. Also I apologize if this is some really simple c++ issue. In trying to reduce overhead of vector resizing per core, I ran into a memory access issue with Parallel version using
I then modified this to use
However that code is about the same speed for me (500,000 rays), where the first code block runs in 2.1ms, this stream one takes 2.5 ms. I thought it may be an issue with creating a new vector and resizing it for each chunk, so I pulled the ray vector out of the loop since each index is independent. However, this is 10x slower (24.6 ms):
So the issue does seem to be related to how the rays are stored and accessed in each thread. I tried both creating a ray vector inside the parallel loop and simply indexing the chunks. Both are still slower than using |
How many rays are you tracing in total? The chunk size of 256 is sufficiently large. The second version creates a probably large array, thus you might just benchmark some memory allocation times. The first case is not optimal either as it does the allocation inside the loop. You can try to use a simple array inside the loop to store the ray stream, thus just RTCRay rays[256], thus will get allocated on the stack. |
I am casting 500k rays (i'm not tracing anything after the occlusion check). the end use is anywhere from 10 to a few million rays. I think typical use is 10k-1mil . I replaced the vector with
Any other ideas? |
Reaching the same performance as the non-streamed version is likely because your rays are not sufficiently coherent. The stream API only gives performance benefits when rays are starting at similar location and going to a similar direction, e.g. primary camera rays. Looks like in your use case you best just go with the non-stream approach. |
Maybe i misunderstood coherent rays, I thought coherent rays have the same direction but can have different origins? in my code above, you can see its the same direction that is always being passed (its determined only once outside the loop), its only origin that changes. |
They rays best trace similar region in space, thus best they have similar origin and similar directions. |
ah! okay well perhaps that is the real issue. thanks for all the quick feedback. I hope this thread will at least help some people understand it, and have some different sample codes for openmp. |
Is there, or would it be possible to make an explanation walkthrough for the API examples (similar to what the oneAPI has now) for parallel intersect/occlusion?
I saw this post about openmp #207 and this one on tbb #301
However, I am a little confused on what is parallelized across cores and what is simd on a single core. In my own project I have 4 cases (2x2): two using
rtcOccluded1M
and two withrtcOccluded1
, in both groups there is a version withpragma omp
and without. I am finding that not usingomp
is faster in both cases, with 10 million rays. I have (x2) Zeon E5-2630 cpus (20 cores) @ 2.20ghz.I assume there is some basic understanding missing from what is going on, so i was trying to find some nice walkthrough explanation that could guide on how to properly balance these things (especially as the other posts about parallelism didn't mention anything about this performance).
The text was updated successfully, but these errors were encountered: