Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for stream of rays in Embree #29

Closed
mariuszhermansdorfer opened this issue Nov 22, 2022 · 6 comments · Fixed by #42
Closed

Add support for stream of rays in Embree #29

mariuszhermansdorfer opened this issue Nov 22, 2022 · 6 comments · Fixed by #42
Assignees
Labels
enhancement New feature or request priority Should deal with this ASAP

Comments

@mariuszhermansdorfer
Copy link
Contributor

Currently, DHART only shoots single rays with the rtcOccluded1 method:

bool EmbreeRayTracer::Occluded_IMPL(float x, float y, float z, float dx, float dy, float dz, float distance, int mesh_id)
{
auto ray = ConstructRay(x, y, z, dx, dy, dz, distance);
rtcOccluded1(scene, &context, &ray);
return ray.tfar == -INFINITY;
}

Embree, however, supports shooting streams of rays in various configurations:
rtcOccluded1M
rtcOccluded1Mp
rtcOccludedNp

Furthermore, 3 flags can be passed to the intersection context to speed up ray traversal:
https://spec.oneapi.io/oneart/latest/embree-spec.html#rtcinitintersectcontext

enum RTCIntersectContextFlags
{
  RTC_INTERSECT_CONTEXT_FLAG_NONE,
  RTC_INTERSECT_CONTEXT_FLAG_INCOHERENT,
  RTC_INTERSECT_CONTEXT_FLAG_COHERENT,
}; 

It would be great if these could be added to the c# wrapper as well.

@cadop cadop added the enhancement New feature or request label Nov 22, 2022
@cadop
Copy link
Owner

cadop commented Nov 22, 2022

@mariuszhermansdorfer thanks for taking the time to provide the links.

If you could let me know which configuration you think is the most beneficial speedup for you application I can make a specific function/interface for it (assuming you don't want to add one in the C++ source yourself). If it does give some meaningful performance difference i'll try to roll out the interface as you suggested into its own function (explanation below).

There are a few aspects to adding these to the core API:

Speed:

  1. The streaming rays don't guarantee order, so the user would need to do their own sorting and ID checks (inside python or c#), which especially for the python interface will remove any performance benefit.

The implementation of the stream ray query functions may re-order rays arbitrarily and re-pack rays into ray packets of different size. For this reason, callback functions may be invoked with an arbitrary packet size (of size 1, 4, 8, or 16) and different ordering as specified initially. For this reason, one may have to use the rayID component of the ray to identify the original ray, e.g. to access a per-ray payload.

  1. It was a while ago, but we did test the speed differences and it was difficult to see a performance difference. In particular, users should be using:
    std::vector<char> EmbreeRayTracer::Occlusions(
    const std::vector<std::array<float, 3>>& origins,
    const std::vector<std::array<float, 3>>& directions,
    float max_distance, bool use_parallel)
    {
    Because this parallelizes the rays into chunks and scales to the number of cores on the users computer, we found it to be faster than trying to use the streaming rays.

Explaining the above two items is difficult for most users and can end up in confusion or worse performance. From your other issue I see the application to sun study, so if there is a meaningful speedup for that type of thing we could also make a specific extension/function for it.

Clarity/Flexibility:

  1. (perhaps not of your concern) Although the example code keeps using the term EmbreeRaytracer, we use other raytracers as well. For example, if you use the double precision flag, since embree doesn't support it, we use NanoRT. Naturally, we don't want to confuse users by specifying all these flags in a complex function parameter and they don't all work depending on previous decisions/contexts of the bvh.

@mariuszhermansdorfer
Copy link
Contributor Author

Thanks for your detailed answer @cadop.

As you might imagine, the reason I asked for these additional modes is speed. If it turns out that using ray streams doesn't come with any speed benefit at all, I would be the first one to remove it.

Currently, I'm working on a sunlight analysis. It takes the following inputs:

  • analysis plane as mesh with around 500.000 cells
  • context buildings/trees/shading structures etc. joined into a mesh
  • date & time range

From the context a bvh is created.
For each date & time value (in 1 hour steps) I calculate a sun vector.
Then, for each cell, I shoot a ray with the origin in the cell center and direction opposite to the sun vector. If the ray hits the context mesh, the cell is marked as occluded, otherwise it gets direct sunlight.

From my understanding, this scenario could benefit from shooting rays with the RTC_INTERSECT_CONTEXT_FLAG_COHERENT flag. Also, I'm hoping that grouping rays into packets could speed this up as well.

Again, it'd need to be benchmarked to know for sure.

@cadop cadop self-assigned this Nov 22, 2022
@cadop
Copy link
Owner

cadop commented Nov 22, 2022

Just to make sure, you are using this function? https://cadop.github.io/dhart/C%23%20Public%20Docs/html/class_d_h_a_r_t_a_p_i_1_1_ray_tracing_1_1_embree_raytracer.html#a1a96d5b61f43fe87e2649ef612dd63ff

and only passing one direction (inverse sun vector) based on:

One direction, multiple origins: Cast a ray in the given direction from each origin point in origins.

You could also try to duplicate the the origins into one large vector for using this version:

Equal amount of directions/origins: Cast a ray for every pair of origin/direction in order.i.e. (origin[0], direction[0]), (origin[1], direction[1]), etc.

This would take more time in generating data in C#, but would mean there is only one call to dhart and all rays would be parallelized. I'm not sure if it would be faster.

Could you also make sure that system monitor shows multiple cores being used and its not single threaded, and how much time is the raycasting 500k cells taking?

The example you give does seem to be the ideal case for coherent rays. I will make an example within the next week or so to test this out. So the plan is:

  1. change the current occlusion method to use the rtcOccluded1m, which seems to decide on its own how to repacket rays. And use RTC_INTERSECT_CONTEXT_FLAG_COHERENT. (probably I will make a temporary hack, where setting the max ray distance to 99999 will switch the occlusion method).
  2. given this new method is meaningfully faster change the case where there are multiple rays and a single direction to always use the stream.

@mariuszhermansdorfer
Copy link
Contributor Author

Yes, I’m using this function and pass an array of origins (500k points) and one reverse sun vector at a time. This code runs in parallel as all the CPU get 100% load.

 _analysisPoints = new List<Point3d>();
            _sunDirections = new List<Vector3d>();
            DA.GetDataList(0, _analysisPoints);
            DA.GetData(1, ref _context);
            DA.GetDataList(2, _sunDirections);

            System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
            sw.Start();

            MeshInfo contextMesh = new MeshInfo(_context.Faces.ToIntArray(true), _context.Vertices.ToFloatArray());

            LogTime(sw, "Initial setup");

            EmbreeBVH bvh = new EmbreeBVH(contextMesh);
            LogTime(sw, "Setup bvh");

            Vector3D[] sunDirections = ConvertListOfVectors(_sunDirections);
            Vector3D[] analysisPoints = ConvertListOfVectors(_analysisPoints);

            LogTime(sw, "Data conversion");

            for (int i = 0; i < sunDirections.Length; i++)
                EmbreeRaytracer.IntersectOccluded(bvh, analysisPoints, new Vector3D[] { sunDirections[i] });

            LogTime(sw, "Embree Ray casting");

The code is quite fast already - here is a test with 100k points and only one sun vector. Embree is 5x faster than native Rhino ray cast:

Do you have Rhino on your machine? I will put this test file into the dedicated branch so that we have a common base for benchmarks.

@mariuszhermansdorfer
Copy link
Contributor Author

mariuszhermansdorfer commented Nov 23, 2022

Update, here are the results for 500K points and one ray:
image

Casting 11 rays corresponding to 11 hours of sunlight for a chosen day, gives me this:
image

EDIT:
Benchmark of the release version:
500K origin points, one ray
image

500K origin points, 11 rays
image

Performance scales nearly linearly and compute is definitely multi-threaded. Ideally, I'd like this to run at 30FPS.
Let's see how far we can push it :)

@cadop
Copy link
Owner

cadop commented Nov 23, 2022

Hey @mariuszhermansdorfer , yes I have rhino, and thanks for showing these results. It seems the first was is 50x faster than rhino.

Would you mind making a Discussion on the performance and just mention this issue. I'd like to keep the conversation going in a more longterm format instead of within the issue of ray streaming.

I am not sure about getting to 30fps, which would be ~33ms. I'll followup more on the discussion post for some ways to check bottlenecks since there is some data transfer between c#, c interface, and c++.

@cadop cadop linked a pull request Dec 1, 2022 that will close this issue
@cadop cadop added the priority Should deal with this ASAP label Dec 1, 2022
@cadop cadop pinned this issue Dec 1, 2022
@cadop cadop closed this as completed in #42 Dec 9, 2022
@cadop cadop unpinned this issue Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority Should deal with this ASAP
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants