Add support for stream of rays in Embree #29

mariuszhermansdorfer · 2022-11-22T07:43:25Z

Currently, DHART only shoots single rays with the rtcOccluded1 method:

dhart/src/Cpp/raytracer/src/embree_raytracer.cpp

Lines 621 to 626 in d7eefe6

    
           bool EmbreeRayTracer::Occluded_IMPL(float x, float y, float z, float dx, float dy, float dz, float distance, int mesh_id) 
        
           { 
        
           	auto ray = ConstructRay(x, y, z, dx, dy, dz, distance); 
        
           	rtcOccluded1(scene, &context, &ray); 
        
           	return ray.tfar == -INFINITY; 
        
           }

Embree, however, supports shooting streams of rays in various configurations:
rtcOccluded1M
rtcOccluded1Mp
rtcOccludedNp

Furthermore, 3 flags can be passed to the intersection context to speed up ray traversal:
https://spec.oneapi.io/oneart/latest/embree-spec.html#rtcinitintersectcontext

enum RTCIntersectContextFlags
{
  RTC_INTERSECT_CONTEXT_FLAG_NONE,
  RTC_INTERSECT_CONTEXT_FLAG_INCOHERENT,
  RTC_INTERSECT_CONTEXT_FLAG_COHERENT,
};

It would be great if these could be added to the c# wrapper as well.

The text was updated successfully, but these errors were encountered:

cadop · 2022-11-22T14:29:26Z

@mariuszhermansdorfer thanks for taking the time to provide the links.

If you could let me know which configuration you think is the most beneficial speedup for you application I can make a specific function/interface for it (assuming you don't want to add one in the C++ source yourself). If it does give some meaningful performance difference i'll try to roll out the interface as you suggested into its own function (explanation below).

There are a few aspects to adding these to the core API:

Speed:

The streaming rays don't guarantee order, so the user would need to do their own sorting and ID checks (inside python or c#), which especially for the python interface will remove any performance benefit.

The implementation of the stream ray query functions may re-order rays arbitrarily and re-pack rays into ray packets of different size. For this reason, callback functions may be invoked with an arbitrary packet size (of size 1, 4, 8, or 16) and different ordering as specified initially. For this reason, one may have to use the rayID component of the ray to identify the original ray, e.g. to access a per-ray payload.

It was a while ago, but we did test the speed differences and it was difficult to see a performance difference. In particular, users should be using:

dhart/src/Cpp/raytracer/src/embree_raytracer.cpp

Lines 564 to 568 in d7eefe6

    
           std::vector<char> EmbreeRayTracer::Occlusions( 
        
           	const std::vector<std::array<float, 3>>& origins, 
        
           	const std::vector<std::array<float, 3>>& directions, 
        
           	float max_distance, bool use_parallel) 
        
           {

Because this parallelizes the rays into chunks and scales to the number of cores on the users computer, we found it to be faster than trying to use the streaming rays.

Explaining the above two items is difficult for most users and can end up in confusion or worse performance. From your other issue I see the application to sun study, so if there is a meaningful speedup for that type of thing we could also make a specific extension/function for it.

Clarity/Flexibility:

(perhaps not of your concern) Although the example code keeps using the term EmbreeRaytracer, we use other raytracers as well. For example, if you use the double precision flag, since embree doesn't support it, we use NanoRT. Naturally, we don't want to confuse users by specifying all these flags in a complex function parameter and they don't all work depending on previous decisions/contexts of the bvh.

mariuszhermansdorfer · 2022-11-22T15:27:46Z

Thanks for your detailed answer @cadop.

As you might imagine, the reason I asked for these additional modes is speed. If it turns out that using ray streams doesn't come with any speed benefit at all, I would be the first one to remove it.

Currently, I'm working on a sunlight analysis. It takes the following inputs:

analysis plane as mesh with around 500.000 cells
context buildings/trees/shading structures etc. joined into a mesh
date & time range

From the context a bvh is created.
For each date & time value (in 1 hour steps) I calculate a sun vector.
Then, for each cell, I shoot a ray with the origin in the cell center and direction opposite to the sun vector. If the ray hits the context mesh, the cell is marked as occluded, otherwise it gets direct sunlight.

From my understanding, this scenario could benefit from shooting rays with the RTC_INTERSECT_CONTEXT_FLAG_COHERENT flag. Also, I'm hoping that grouping rays into packets could speed this up as well.

Again, it'd need to be benchmarked to know for sure.

cadop · 2022-11-22T17:20:36Z

Just to make sure, you are using this function? https://cadop.github.io/dhart/C%23%20Public%20Docs/html/class_d_h_a_r_t_a_p_i_1_1_ray_tracing_1_1_embree_raytracer.html#a1a96d5b61f43fe87e2649ef612dd63ff

and only passing one direction (inverse sun vector) based on:

One direction, multiple origins: Cast a ray in the given direction from each origin point in origins.

You could also try to duplicate the the origins into one large vector for using this version:

Equal amount of directions/origins: Cast a ray for every pair of origin/direction in order.i.e. (origin[0], direction[0]), (origin[1], direction[1]), etc.

This would take more time in generating data in C#, but would mean there is only one call to dhart and all rays would be parallelized. I'm not sure if it would be faster.

Could you also make sure that system monitor shows multiple cores being used and its not single threaded, and how much time is the raycasting 500k cells taking?

The example you give does seem to be the ideal case for coherent rays. I will make an example within the next week or so to test this out. So the plan is:

change the current occlusion method to use the rtcOccluded1m, which seems to decide on its own how to repacket rays. And use RTC_INTERSECT_CONTEXT_FLAG_COHERENT. (probably I will make a temporary hack, where setting the max ray distance to 99999 will switch the occlusion method).
given this new method is meaningfully faster change the case where there are multiple rays and a single direction to always use the stream.

mariuszhermansdorfer · 2022-11-22T18:26:39Z

Yes, I’m using this function and pass an array of origins (500k points) and one reverse sun vector at a time. This code runs in parallel as all the CPU get 100% load.

 _analysisPoints = new List<Point3d>();
            _sunDirections = new List<Vector3d>();
            DA.GetDataList(0, _analysisPoints);
            DA.GetData(1, ref _context);
            DA.GetDataList(2, _sunDirections);

            System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
            sw.Start();

            MeshInfo contextMesh = new MeshInfo(_context.Faces.ToIntArray(true), _context.Vertices.ToFloatArray());

            LogTime(sw, "Initial setup");

            EmbreeBVH bvh = new EmbreeBVH(contextMesh);
            LogTime(sw, "Setup bvh");

            Vector3D[] sunDirections = ConvertListOfVectors(_sunDirections);
            Vector3D[] analysisPoints = ConvertListOfVectors(_analysisPoints);

            LogTime(sw, "Data conversion");

            for (int i = 0; i < sunDirections.Length; i++)
                EmbreeRaytracer.IntersectOccluded(bvh, analysisPoints, new Vector3D[] { sunDirections[i] });

            LogTime(sw, "Embree Ray casting");

The code is quite fast already - here is a test with 100k points and only one sun vector. Embree is 5x faster than native Rhino ray cast:

Do you have Rhino on your machine? I will put this test file into the dedicated branch so that we have a common base for benchmarks.

mariuszhermansdorfer · 2022-11-23T11:06:35Z

Update, here are the results for 500K points and one ray:

Casting 11 rays corresponding to 11 hours of sunlight for a chosen day, gives me this:

EDIT:
Benchmark of the release version:
500K origin points, one ray

500K origin points, 11 rays

Performance scales nearly linearly and compute is definitely multi-threaded. Ideally, I'd like this to run at 30FPS.
Let's see how far we can push it :)

cadop · 2022-11-23T14:55:51Z

Hey @mariuszhermansdorfer , yes I have rhino, and thanks for showing these results. It seems the first was is 50x faster than rhino.

Would you mind making a Discussion on the performance and just mention this issue. I'd like to keep the conversation going in a more longterm format instead of within the issue of ray streaming.

I am not sure about getting to 30fps, which would be ~33ms. I'll followup more on the discussion post for some ways to check bottlenecks since there is some data transfer between c#, c interface, and c++.

cadop added the enhancement New feature or request label Nov 22, 2022

cadop self-assigned this Nov 22, 2022

cadop mentioned this issue Dec 1, 2022

Update dynamic scheduling of openmp to improve occlusion speed #42

Merged

cadop linked a pull request Dec 1, 2022 that will close this issue

Update dynamic scheduling of openmp to improve occlusion speed #42

Merged

cadop added the priority Should deal with this ASAP label Dec 1, 2022

cadop pinned this issue Dec 1, 2022

cadop closed this as completed in #42 Dec 9, 2022

cadop unpinned this issue Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for stream of rays in Embree #29

Add support for stream of rays in Embree #29

mariuszhermansdorfer commented Nov 22, 2022

cadop commented Nov 22, 2022

mariuszhermansdorfer commented Nov 22, 2022

cadop commented Nov 22, 2022

mariuszhermansdorfer commented Nov 22, 2022

mariuszhermansdorfer commented Nov 23, 2022 •

edited

cadop commented Nov 23, 2022

Add support for stream of rays in Embree #29

Add support for stream of rays in Embree #29

Comments

mariuszhermansdorfer commented Nov 22, 2022

cadop commented Nov 22, 2022

mariuszhermansdorfer commented Nov 22, 2022

cadop commented Nov 22, 2022

mariuszhermansdorfer commented Nov 22, 2022

mariuszhermansdorfer commented Nov 23, 2022 • edited

cadop commented Nov 23, 2022

mariuszhermansdorfer commented Nov 23, 2022 •

edited