-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Improve ball_query() runtime for large-scale cases
#2006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Skip explicit distance calc if the points are >= radius apart in any one dim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good idea. Can you run the original bm_ball_query.py before and after please too, to check that doesn't slow?
Sure, here are the original benchmark results: SQUARE (average and peak time comparisons)
RAGGED (average and peak time comparisons)
Full benchmark logsbenchmark-original-before-change.txt The runtime changes are generally mixed (some gets better, some gets worse), though I do see a lot of average runtimes getting worse for Any thoughts or any suggestions on how to proceed? @bottler |
|
Do you have a practical application for which this change helps significantly? |
|
An example in my own field could be using Besides application needs in my field, I'd imagine anyone using PointNet++ for a large-scale point cloud classification/segmentation task (O(10^6) points?) would also benefit? |
|
Maybe we should go with the boring, conservative solution: add a boolean flag to let the user pick the algorithm. |
- False by default for backwards compatibility
|
Done! I've added a boolean flag and set it to False by default so the new code only runs if the user sets it to True. Let me know if you have any additional thoughts @bottler |
|
Personally I want to try implementing k-d tree to further improve asymptotic complexity (per here) but I think that's for another PR |
That question looks like it isn't thinking about GPUs. I would expect there to be existing work extending those types of ideas to GPUs, and for there to already be specialized OSS libraries for it? |








Overview
The current C++ code for
pytorch3d.ops.ball_query()performs floating point multiplication for every coordinate of every pair of points (up until the maximum number of neighbor points is reached). This PR modifies the code (for both CPU and CUDA versions) to implement idea presented here: aD-cube around theD-ball is first constructed, and any point pairs falling outside the cube are skipped, without explicitly computing the squared distances. This change is especially useful for when the dimensionDand the number of pointsP2are large and the radius is much smaller than the overall volume of space occupied by the point clouds; as much as ~2.5x speedup (CPU case; ~1.8x speedup in CUDA case) is observed whenD = 10andradius = 0.01. In all benchmark cases, points were uniform randomly distributed inside a unitD-cube.The benchmark code used was different from
tests/benchmarks/bm_ball_query.py(only the forward part is benchmarked, larger input sizes were used) and is stored intests/benchmarks/bm_ball_query_large.py.Average time comparisons
Peak time comparisons
Full benchmark logs
benchmark-before-change.txt
benchmark-after-change.txt