Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Example / walkthrough for Parallelism #413

Closed
cadop opened this issue Nov 29, 2022 · 10 comments
Closed

API Example / walkthrough for Parallelism #413

cadop opened this issue Nov 29, 2022 · 10 comments

Comments

@cadop
Copy link

cadop commented Nov 29, 2022

Is there, or would it be possible to make an explanation walkthrough for the API examples (similar to what the oneAPI has now) for parallel intersect/occlusion?

I saw this post about openmp #207 and this one on tbb #301

However, I am a little confused on what is parallelized across cores and what is simd on a single core. In my own project I have 4 cases (2x2): two using rtcOccluded1M and two with rtcOccluded1, in both groups there is a version with pragma omp and without. I am finding that not using omp is faster in both cases, with 10 million rays. I have (x2) Zeon E5-2630 cpus (20 cores) @ 2.20ghz.

I assume there is some basic understanding missing from what is going on, so i was trying to find some nice walkthrough explanation that could guide on how to properly balance these things (especially as the other posts about parallelism didn't mention anything about this performance).

@svenwoop
Copy link
Contributor

The rtcOccluded1M and rtcOccluded1 functions both just execute on a single core, thus rtcOccluded1M does not parallelize by itself.
The version that parallelized using omp should of course be faster. I assume you use the pragmas wrong. Please first try if code without Embree parallelizes properly. Do you mean #pragma omp parallel instead of #pragma omp?

@cadop
Copy link
Author

cadop commented Nov 30, 2022

Hi @svenwoop thanks for the fast reply.

It was hard to recognize what 'parallelizes properly' meant, since all cores are doing work, but it is slower. I found the chunksize was one issue, changing
#pragma omp parallel for if(use_parallel) schedule(dynamic) to #pragma omp parallel for if(use_parallel) schedule(dynamic, 256) went from ~1400ms to 4ms for my testcase.

{
  out_array.resize(origins.size());
  const auto& direction = directions[0];
  context.flags = RTC_INTERSECT_CONTEXT_FLAG_COHERENT;

#pragma omp parallel for if(use_parallel) schedule(dynamic, 256)
	for (int i = 0; i < origins.size(); i++)
	{
	  const auto& origin = origins[i];
	  out_array[i] = Occluded_IMPL(origin[0], origin[1], origin[2],
					  direction[0], direction[1], direction[2],
					  max_distance, -1);
	}
  context.flags = RTC_INTERSECT_CONTEXT_FLAG_NONE; // reset context
}
bool EmbreeRayTracer::Occluded_IMPL(float x, float y, float z, float dx, float dy, float dz, float distance, int mesh_id)
{
  auto ray = ConstructRay(x, y, z, dx, dy, dz, distance);
  rtcOccluded1(scene, &context, &ray);
  return ray.tfar == -INFINITY;
}

Perhaps its not possible to know, but part of what I was hoping to understand is what the expectation is for overhead and scheduling. For example, I just guessed at 128/256 chunks size, but didn't want to hardcode something that ends up acting very poorly in some cases I haven't tested.

Similarly, what is the expected performance increase for rtcOccluded1M, and when should this be seen? e.g. If using rtcOccluded1 on 128 rays in a loop, I would assume rtcOccluded1M is faster, but by how much. What I mean is, does performance improve when batch increases? If I have 1000 rays, is it like rtcOccluded1 takes X_ms_ and rtcOccluded1M takes X/2_ms_, but if I parallelize on 10 cores, is rtcOccluded1 with 100 rays vs rtcOccluded1M with 100 rays more negligible?

I am finding the challenge being a lack of understanding the expected performance differences when scaling, and refining the parallelization and code aspect (finding I am doing something inefficient in C++).

@svenwoop
Copy link
Contributor

svenwoop commented Dec 1, 2022

The parallelization should scale linearly with number of cores and hyperthreading is disabled. When hyperthreading is enabled, the last half of threads will give lower performance gains. You should expect about 0.8 * #threads speedup when hyperthreading is enabled. A problem in your case could be thread threads have not started yet? Best run the benchmark in a loop, first execution could be slow if thread did not yet start.

The rtcOccluded1M will give about the same performance as the packet traversal algorithms, rtcIntersect4/8/16. These packet algorithms are about 2x faster for coherent workloads (primary rays) but perform similar as rtcIntersect1 for incoherent rays.

@cadop
Copy link
Author

cadop commented Dec 6, 2022

Hi @svenwoop, I am having some issue optimizing the parallel loop for the stream method.

The performance test is setup the same as the use case, which calling it once on a batch of rays. Also I apologize if this is some really simple c++ issue.

In trying to reduce overhead of vector resizing per core, I ran into a memory access issue with rtcOccluded1M.

Parallel version using rtcOccluded1:

	out_array.resize(origins.size());
	const auto& direction = directions[0];

#pragma omp parallel for if(use_parallel) schedule(dynamic, 256)
	for (int i = 0; i < origins.size(); i++) {
		const auto& origin = origins[i];
		auto ray = ConstructRay(origin[0], origin[1], origin[2], direction[0], direction[1], direction[2], max_distance);
		rtcOccluded1(scene, &context, &ray);
		out_array[i] = ray.tfar == -INFINITY;
	}

I then modified this to use rtcOccluded1m, which as I understood, needs to have at least as many rays as a packet, but 256 should give roughly 2x performance?:

	out_array.resize(origins.size());
	const auto& direction = directions[0];

	std::size_t chunks = 256; 

#pragma omp parallel for if(use_parallel) schedule(dynamic)
	for (int start = 0; start < origins.size(); start+=chunks) {

		std::size_t end = min(start + chunks, origins.size());
		std::vector<RTCRay> rays((int)(end-start));

		for (int i = 0; i < end-start; i++) {
			const auto& origin = origins[std::size_t(start)+std::size_t(i)];
			auto ray = ConstructRay(origin[0], origin[1], origin[2], direction[0], direction[1], direction[2], max_distance);
			rays[i] = ray;
		}
		rtcOccluded1M(scene, &context, rays.data(), rays.size(), sizeof(RTCRay));

		for (int i = 0; i < rays.size(); i++) {
			out_array[(std::size_t)start+i] = bool(rays[i].tfar == -INFINITY);
		}

	}

However that code is about the same speed for me (500,000 rays), where the first code block runs in 2.1ms, this stream one takes 2.5 ms.

I thought it may be an issue with creating a new vector and resizing it for each chunk, so I pulled the ray vector out of the loop since each index is independent. However, this is 10x slower (24.6 ms):

	out_array.resize(origins.size());
	const auto& direction = directions[0];

	std::vector<RTCRay> rays(origins.size());

	std::size_t chunks = 256;

#pragma omp parallel for if(use_parallel) schedule(dynamic)

	for (int start = 0; start < origins.size(); start += chunks) {

		std::size_t end = min(start + chunks, origins.size());

		for (int i = start; i < end; i++) {
			const auto& origin = origins[i];
			auto ray = ConstructRay(origin[0], origin[1], origin[2], direction[0], direction[1], direction[2], max_distance);
			rays[i] = ray;
		}

		rtcOccluded1M(scene, &context, &rays[start], end-start, sizeof(RTCRay));

		for (int i = start; i < end; i++){
			out_array[i] = bool(rays[i].tfar == -INFINITY);
		}

	}

So the issue does seem to be related to how the rays are stored and accessed in each thread. I tried both creating a ray vector inside the parallel loop and simply indexing the chunks. Both are still slower than using rtcOccluded1. I would really appreciate some pointers on what an efficient way is for passing rtcOccluded1M rays in separate threads. (when I send all the rays without parallelism, the streaming is faster).

@svenwoop
Copy link
Contributor

svenwoop commented Dec 7, 2022

How many rays are you tracing in total? The chunk size of 256 is sufficiently large. The second version creates a probably large array, thus you might just benchmark some memory allocation times. The first case is not optimal either as it does the allocation inside the loop. You can try to use a simple array inside the loop to store the ray stream, thus just RTCRay rays[256], thus will get allocated on the stack.

@cadop
Copy link
Author

cadop commented Dec 7, 2022

I am casting 500k rays (i'm not tracing anything after the occlusion check). the end use is anywhere from 10 to a few million rays. I think typical use is 10k-1mil .

I replaced the vector with RTCRay rays[256] inside the loop, which helped slightly, but only to the point that its about as fast as the non-stream version (500k rays with chunks of 256). 3 trials (in ms):

rtcOccludeed1: 1.5, 2.0, 1.76
rtcOccludeed1M: 1.9, 1.7, 1.79

Any other ideas?

@svenwoop
Copy link
Contributor

svenwoop commented Dec 7, 2022

Reaching the same performance as the non-streamed version is likely because your rays are not sufficiently coherent. The stream API only gives performance benefits when rays are starting at similar location and going to a similar direction, e.g. primary camera rays. Looks like in your use case you best just go with the non-stream approach.

@cadop
Copy link
Author

cadop commented Dec 7, 2022

Maybe i misunderstood coherent rays, I thought coherent rays have the same direction but can have different origins?

in my code above, you can see its the same direction that is always being passed (its determined only once outside the loop), its only origin that changes.

@svenwoop
Copy link
Contributor

svenwoop commented Dec 7, 2022

They rays best trace similar region in space, thus best they have similar origin and similar directions.

@cadop
Copy link
Author

cadop commented Dec 7, 2022

ah! okay well perhaps that is the real issue. thanks for all the quick feedback.

I hope this thread will at least help some people understand it, and have some different sample codes for openmp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants