Improve `ball_query()` runtime for large-scale cases #2006

jsukpark · 2025-10-17T19:40:14Z

Overview

The current C++ code for pytorch3d.ops.ball_query() performs floating point multiplication for every coordinate of every pair of points (up until the maximum number of neighbor points is reached). This PR modifies the code (for both CPU and CUDA versions) to implement idea presented here: a D-cube around the D-ball is first constructed, and any point pairs falling outside the cube are skipped, without explicitly computing the squared distances. This change is especially useful for when the dimension D and the number of points P2 are large and the radius is much smaller than the overall volume of space occupied by the point clouds; as much as ~2.5x speedup (CPU case; ~1.8x speedup in CUDA case) is observed when D = 10 and radius = 0.01. In all benchmark cases, points were uniform randomly distributed inside a unit D-cube.

The benchmark code used was different from tests/benchmarks/bm_ball_query.py (only the forward part is benchmarked, larger input sizes were used) and is stored in tests/benchmarks/bm_ball_query_large.py.

Average time comparisons

Peak time comparisons

Full benchmark logs

benchmark-before-change.txt
benchmark-after-change.txt

- Skip explicit distance calc if the points are >= radius apart in any one dim

bottler

Looks like a good idea. Can you run the original bm_ball_query.py before and after please too, to check that doesn't slow?

pytorch3d/csrc/ball_query/ball_query_cpu.cpp

jsukpark · 2025-10-20T21:31:19Z

Looks like a good idea. Can you run the original bm_ball_query.py before and after please too, to check that doesn't slow?

Sure, here are the original benchmark results:

SQUARE (average and peak time comparisons)

RAGGED (average and peak time comparisons)

Full benchmark logs

benchmark-original-before-change.txt
benchmark-original-after-change.txt

The runtime changes are generally mixed (some gets better, some gets worse), though I do see a lot of average runtimes getting worse for RAGGED benchmark cases with CUDA. These are all cases with smaller or equal P2, smaller K, and larger or equal radii compared to the new benchmark cases; I suppose this makes sense, since the filtering doesn't really help much where each point has enough points in the second cloud within the radius to fill up K (in the new benchmark cases, K was fixed to 500, and radius as small as 0.01 was used). In particular, when radius = 5 for ragged benchmark cases, all points in the second cloud are within the radius (since the overall space is the unit D-cube), so no performance gain is expected in such case.

Any thoughts or any suggestions on how to proceed? @bottler

bottler · 2025-10-21T10:32:32Z

Do you have a practical application for which this change helps significantly?

jsukpark · 2025-10-21T16:09:52Z

An example in my own field could be using ball_query() to help using graph neural nets to predict properties of Moiré materials, which involves constructing large graphs with atoms as nodes and edges connecting atoms less than a certain distance apart (so in this case, the two point clouds are the same). Here, the radius is much smaller than the range of coordinates of the entire graph, and the number of atoms is large (O(10^4) for a particularly interesting example, named twisted bilayer graphene). Existing tools AFAIK can only find neighbors on the CPU side, which then involves GPU -> CPU -> GPU data transfers; I was thinking maybe PyTorch3D would spare the need for that.

Besides application needs in my field, I'd imagine anyone using PointNet++ for a large-scale point cloud classification/segmentation task (O(10^6) points?) would also benefit?

bottler · 2025-10-22T16:05:14Z

Maybe we should go with the boring, conservative solution: add a boolean flag to let the user pick the algorithm.

- False by default for backwards compatibility

jsukpark · 2025-10-23T16:37:27Z

Done! I've added a boolean flag and set it to False by default so the new code only runs if the user sets it to True.

Let me know if you have any additional thoughts @bottler

jsukpark · 2025-10-23T16:40:24Z

Personally I want to try implementing k-d tree to further improve asymptotic complexity (per here) but I think that's for another PR

bottler · 2025-10-23T17:31:33Z

Personally I want to try implementing k-d tree to further improve asymptotic complexity (per here) but I think that's for another PR

That question looks like it isn't thinking about GPUs. I would expect there to be existing work extending those types of ideas to GPUs, and for there to already be specialized OSS libraries for it?

meta-codesync · 2025-10-23T17:32:25Z

@bottler has imported this pull request. If you are a Meta employee, you can view this in D85356394.

meta-codesync · 2025-10-30T12:05:29Z

@bottler merged this pull request in 2d4d345.

jsukpark added 2 commits October 16, 2025 18:06

Reduce # of float multiplication in ball_query

7661f00

- Skip explicit distance calc if the points are >= radius apart in any one dim

Add ball_query scaling test code to tests/benchmarks

6027064

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 17, 2025

bottler reviewed Oct 20, 2025

View reviewed changes

pytorch3d/csrc/ball_query/ball_query_cpu.cpp Outdated Show resolved Hide resolved

pytorch3d/csrc/ball_query/ball_query_cpu.cpp Outdated Show resolved Hide resolved

Revert unnecessary changes to old code

2e47776

Add a boolean flag that enables filtering with D-cube

91ab68f

- False by default for backwards compatibility

meta-codesync bot closed this in 2d4d345 Oct 30, 2025

facebook-github-bot added the Merged label Oct 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve `ball_query()` runtime for large-scale cases #2006

Improve `ball_query()` runtime for large-scale cases #2006

jsukpark commented Oct 17, 2025

Uh oh!

bottler left a comment

Uh oh!

Uh oh!

Uh oh!

jsukpark commented Oct 20, 2025

Uh oh!

bottler commented Oct 21, 2025

Uh oh!

jsukpark commented Oct 21, 2025 •

edited

Loading

Uh oh!

bottler commented Oct 22, 2025

Uh oh!

jsukpark commented Oct 23, 2025

Uh oh!

jsukpark commented Oct 23, 2025

Uh oh!

bottler commented Oct 23, 2025

Uh oh!

meta-codesync bot commented Oct 23, 2025

Uh oh!

meta-codesync bot commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve ball_query() runtime for large-scale cases #2006

Improve ball_query() runtime for large-scale cases #2006

Conversation

jsukpark commented Oct 17, 2025

Overview

Average time comparisons

Peak time comparisons

Full benchmark logs

Uh oh!

bottler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jsukpark commented Oct 20, 2025

SQUARE (average and peak time comparisons)

RAGGED (average and peak time comparisons)

Full benchmark logs

Uh oh!

bottler commented Oct 21, 2025

Uh oh!

jsukpark commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bottler commented Oct 22, 2025

Uh oh!

jsukpark commented Oct 23, 2025

Uh oh!

jsukpark commented Oct 23, 2025

Uh oh!

bottler commented Oct 23, 2025

Uh oh!

meta-codesync bot commented Oct 23, 2025

Uh oh!

meta-codesync bot commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve `ball_query()` runtime for large-scale cases #2006

Improve `ball_query()` runtime for large-scale cases #2006

jsukpark commented Oct 21, 2025 •

edited

Loading