Official PyTorch implementation for the DAC'26 paper:
FlashFPS: Efficient Farthest Point Sampling for Large-Scale Point Clouds via Pruning and Caching
by Yuzhe Fu, Hancheng Ye, Cong Guo, Junyao Zhang, Qinsi Wang, Yueqian Lin, Changchun Zhou, Hai "Helen" Li, Yiran Chen.
FlashFPS-demo.mp4
FlashFPS is a hardware-agnostic, plug-and-play framework for efficient Farthest Point Sampling (FPS) in point cloud networks. It achieves on average end-to-end 5.16× speedup over the standard CUDA baseline on GPU, with negligible accuracy loss.
This repository reproduces the network accuracy and speedup performance reported in the paper. This repo currently supports FPS-CUDA, FlashFPS, and the SOTA work QuickFPS on the following workloads:
| Network Models | Main Library | Datasets | Supported Methods |
|---|---|---|---|
| PointNeXt-L, PointVector-L | openpoints | S3DIS, ScanNet | FPS-CUDA, FlashFPS, QuickFPS |
Detailed setup and experiment instructions are in the sub-folders below:
FlashFPS-Openpoints/— PointNeXt / PointVector on the openpoints backbone. Ready to use.FlashFPS-PointTransformer/— Point Transformer backbone. To be released.
Hardware note. We recommend TITAN-class, RTX 6000, RTX 3090, or A100 GPUs (all tested successfully). Hopper-architecture GPUs (e.g., H100) are not recommended. All reported numbers in this repo were obtained on TITAN GPUs for consistency.
Minor accuracy variations may occur across GPU architectures due to GPU-dependent numerical behavior; they do not affect the overall conclusions.
- Support FlashFPS and FPS-CUDA for PointNeXt-L and PointVector-L.
- Add QuickFPS for PointNeXt-L and PointVector-L.
- Support FlashFPS on Point Transformer.
- Support FlashFPS performance breakdown.
@article{fu2026flashfps,
title={FlashFPS: Efficient Farthest Point Sampling for Large-Scale Point Clouds via Pruning and Caching},
author={Fu, Yuzhe and Ye, Hancheng and Guo, Cong and Zhang, Junyao and Wang, Qinsi and Lin, Yueqian and Zhou, Changchun and Li, Hai Helen and Chen, Yiran},
journal={arXiv preprint arXiv:2604.17720},
year={2026},
doi={10.48550/arXiv.2604.17720},
}FlashFPS optimizes the Farthest Point Sampling, delivering an average 5.16× end-to-end speedup on GPUs, and no hardware changes required. If you are interested in full-stack hardware–software co-design of point neural networks (PNNs), please check out our another work:
FractalCloud: A Fractal-Inspired Architecture for Efficient Large-Scale Point Cloud Processing, which achieves an average 21.7× speedup on PNN inference through a co-designed accelerator.
Repository: FractalCloud
Tip: FlashFPS and FractalCloud share the same environment. If you've already set up one, the other runs out of the box ^_^
This repository builds upon FractalCloud, PointNeXt and OpenPoints. The QuickFPS implementation is adapted from QuickFPS and FastPoint. We thank the authors for their open-source contributions.
