A C++ and CUDA implementation of the Iterative Closest Point (ICP) algorithm, both with brute force approach and KD-Tree implementation. This project demonstrates point cloud alignment and compares CPU vs GPU performance using parallel computing.
- CPU Implementation: Uses Eigen for SVD-based rigid transformations
- CUDA Implementation: GPU-accelerated version with parallel nearest neighbor search, centroid computation, and transformation (Tested on remote GPU server in Linux only)
- Visualization: Open3D-based replay tool (Tested on Windows only)
- Performance Comparison: Demonstrates 300x speedup with GPU acceleration on large point clouds
- Nearest Neighbor Search: Find closest target point for each source point using K-d Tree
- Centroid Computation: Calculate centers of mass for both point sets
- SVD Transformation: Compute optimal rigid transformation (rotation + translation)
- Apply Transform: Update source points
- Iterate: Repeat until convergence (50 iterations)
Each thread is assigned to process one source point, enabling massive parallelism across the point cloud.
To minimize slow global memory access, target points are cooperatively loaded into fast, on-chip shared memory in blocks (tiles). This reduces global memory reads by a factor equal to the TILE_SIZE.
Data structures were shifted from float3 to float4 to ensure 128-bit memory alignment. This allows the GPU to perform perfectly coalesced memory transactions, maximizing effective bandwidth.
Centroid computation utilizes a parallel tree-reduction pattern in shared memory to avoid race conditions and safely sum point coordinates.
The 3X3 covariance matrix is calculated using a hybrid approach where the GPU performs the heavy summation of outer products via atomicAdd and the CPU handles the final SVD.
Once the optimal rotation and translation are found, a specialized kernel applies the 4X4 transformation matrix to the entire source cloud in a single parallel pass.
Computerphile has videos on both ICP and KD-Tree:
- Windows 10/11
- Visual Studio 2022 (C++ Desktop Development workload)
- CMake 3.18+
- Open3D libraries (e.g.,
D:/Open3D/open3d-devel-windows-amd64-0.19.0/CMake)
- CUDA Toolkit 11.0+ with compatible NVIDIA GPU
- CMake 3.18+
- Eigen3 library
sudo apt-get install libeigen3-dev
sudo apt-get install libopen3d-devIf the sudo command does not work, download pre-built release (https://github.com/isl-org/Open3D/releases)
cd ~
wget https://github.com/isl-org/Open3D/releases/download/v0.18.0/open3d-devel-linux-x86_64-cxx11-abi-0.18.0.tar.xz
tar -xf open3d-devel-linux-x86_64-cxx11-abi-0.18.0.tar.xzUpdate CMakeLists.txt to point to it, e.g: set(Open3D_DIR "$ENV{HOME}/open3d-devel-linux-x86_64-cxx11-abi-0.18.0/lib/cmake/Open3D")
Follow the guide here https://www.open3d.org/docs/release/getting_started.html#c
(Note: Eigen is also installed with Open3D, no need for Eigen installation if you are on Windows)
Update CMakeLists.txt to point to it, e.g: set(Open3D_DIR "D:/Open3D/open3d-devel-windows-amd64-0.19.0/CMake")
# Open x64 Native Tools Command Prompt for VS 2022
mkdir build
cd build
cmake ..
cmake --build . --config Releasemkdir build && cd build
cmake ..
make -j$(nproc)Create a transformed target point cloud:
Windows:
.\build\Release\generate_data.exe data/bun000.plyLinux:
./build/generate_data data/bun000.plyOutput: data/bun000_target.ply
CPU Version: Remove kd_tree argument for brute force nearest neighbor search
Windows:
.\build\Release\icp_engine.exe data/bun000.ply data/bun000_target.ply kd_tree
Linux:
./build/icp_engine data/bun000.ply data/bun000_target.ply kd_treeCUDA Version:
Not tested on Windows.
Linux:
./build/icp_cuda data/bun000.ply data/bun000_target.ply./build/icp_cuda_kdtree data/bun000.ply data/bun000_target.plyOutput: Creates frames/ directory with intermediate alignment steps (iter_0.ply, iter_5.ply, etc.)
.\build\Release\icp_vis.exe data/bun000_target.plyNot tested on Linux.
Opens an interactive Open3D window showing the alignment animation.



