Skip to content

Conversation

@jsukpark
Copy link
Contributor

Overview

The current C++ code for pytorch3d.ops.ball_query() performs floating point multiplication for every coordinate of every pair of points (up until the maximum number of neighbor points is reached). This PR modifies the code (for both CPU and CUDA versions) to implement idea presented here: a D-cube around the D-ball is first constructed, and any point pairs falling outside the cube are skipped, without explicitly computing the squared distances. This change is especially useful for when the dimension D and the number of points P2 are large and the radius is much smaller than the overall volume of space occupied by the point clouds; as much as ~2.5x speedup (CPU case; ~1.8x speedup in CUDA case) is observed when D = 10 and radius = 0.01. In all benchmark cases, points were uniform randomly distributed inside a unit D-cube.

The benchmark code used was different from tests/benchmarks/bm_ball_query.py (only the forward part is benchmarked, larger input sizes were used) and is stored in tests/benchmarks/bm_ball_query_large.py.

Average time comparisons

cpu-03-0 01-avg cuda-03-0 01-avg cpu-03-0 10-avg cuda-03-0 10-avg cpu-10-0 01-avg cuda-10-0 01-avg cpu-10-0 10-avg cuda-10-0 10-avg

Peak time comparisons

cpu-03-0 01-peak cuda-03-0 01-peak cpu-03-0 10-peak cuda-03-0 10-peak cpu-10-0 01-peak cuda-10-0 01-peak cpu-10-0 10-peak cuda-10-0 10-peak

Full benchmark logs

benchmark-before-change.txt
benchmark-after-change.txt

- Skip explicit distance calc if the points are >= radius apart in any one dim
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 17, 2025
Copy link
Contributor

@bottler bottler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good idea. Can you run the original bm_ball_query.py before and after please too, to check that doesn't slow?

@jsukpark
Copy link
Contributor Author

Looks like a good idea. Can you run the original bm_ball_query.py before and after please too, to check that doesn't slow?

Sure, here are the original benchmark results:

SQUARE (average and peak time comparisons)

cpu-square-avg cpu-square-peak cuda-square-avg cuda-square-peak

RAGGED (average and peak time comparisons)

cpu-ragged-avg cpu-ragged-peak cuda-ragged-avg cuda-ragged-peak

Full benchmark logs

benchmark-original-before-change.txt
benchmark-original-after-change.txt

The runtime changes are generally mixed (some gets better, some gets worse), though I do see a lot of average runtimes getting worse for RAGGED benchmark cases with CUDA. These are all cases with smaller or equal P2, smaller K, and larger or equal radii compared to the new benchmark cases; I suppose this makes sense, since the filtering doesn't really help much where each point has enough points in the second cloud within the radius to fill up K (in the new benchmark cases, K was fixed to 500, and radius as small as 0.01 was used). In particular, when radius = 5 for ragged benchmark cases, all points in the second cloud are within the radius (since the overall space is the unit D-cube), so no performance gain is expected in such case.

Any thoughts or any suggestions on how to proceed? @bottler

@bottler
Copy link
Contributor

bottler commented Oct 21, 2025

Do you have a practical application for which this change helps significantly?

@jsukpark
Copy link
Contributor Author

jsukpark commented Oct 21, 2025

An example in my own field could be using ball_query() to help using graph neural nets to predict properties of Moiré materials, which involves constructing large graphs with atoms as nodes and edges connecting atoms less than a certain distance apart (so in this case, the two point clouds are the same). Here, the radius is much smaller than the range of coordinates of the entire graph, and the number of atoms is large (O(10^4) for a particularly interesting example, named twisted bilayer graphene). Existing tools AFAIK can only find neighbors on the CPU side, which then involves GPU -> CPU -> GPU data transfers; I was thinking maybe PyTorch3D would spare the need for that.

Besides application needs in my field, I'd imagine anyone using PointNet++ for a large-scale point cloud classification/segmentation task (O(10^6) points?) would also benefit?

@bottler
Copy link
Contributor

bottler commented Oct 22, 2025

Maybe we should go with the boring, conservative solution: add a boolean flag to let the user pick the algorithm.

- False by default for backwards compatibility
@jsukpark
Copy link
Contributor Author

Done! I've added a boolean flag and set it to False by default so the new code only runs if the user sets it to True.

Let me know if you have any additional thoughts @bottler

@jsukpark
Copy link
Contributor Author

Personally I want to try implementing k-d tree to further improve asymptotic complexity (per here) but I think that's for another PR

@bottler
Copy link
Contributor

bottler commented Oct 23, 2025

Personally I want to try implementing k-d tree to further improve asymptotic complexity (per here) but I think that's for another PR

That question looks like it isn't thinking about GPUs. I would expect there to be existing work extending those types of ideas to GPUs, and for there to already be specialized OSS libraries for it?

@meta-codesync
Copy link

meta-codesync bot commented Oct 23, 2025

@bottler has imported this pull request. If you are a Meta employee, you can view this in D85356394.

@meta-codesync
Copy link

meta-codesync bot commented Oct 30, 2025

@bottler merged this pull request in 2d4d345.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants