Improve performance scaling with host threads #967

esseivaju · 2023-10-05T09:49:56Z

I was comparing Nsight systems profiling output for the GeoModel-Tilecal integration. In addition to the GPU utilization being low with a single thread because of all the processing done on the CPU to report to Geant4 after each step, there is a shift in which action takes the most time.

The top picture has 1 thread, the bottom has 32 threads. With 32 threads, most of the time is spent in memcpyasync in the extend-from-secondaries (and step-gather) actions which are not async and require a device synchronization, preventing kernels of other streams from executing and leading to many small gaps in GPU utilization. These copies come from thrust calls return values, e.g., remove_if_alive. I suspect than refactoring to have asynchronous memcopy to pinned memory would help scale GPU utilization with many host threads. I don't know of a way to tell Thrust where to copy the return value of the function so if we want to verify that this leads to performance improvements we might have to implement these functions.

sethrj · 2023-10-30T13:21:39Z

@esseivaju I think you've done a great job addressing this. Could you please add links to the relevant PRs, take a new snapshot of the timing, and close this out?

esseivaju · 2023-11-01T05:43:39Z

Comparison of scaling with host threads between Celeritas v0.3.2 and 2332351. Measured on Perlmutter using the Atlas Tilecal example, vecGeom 1.2.5 and Geant4 11.0.1. Relevant PRs:

Use par_nosync execution policy to execute thrust algorithms #908
Add pinned allocator and asynchronous memory operations #910
Implement asynchronous DeviceAllocation #953
Add unified memory support #965 (as of this comment, only helps if track sorting is enabled, which was turned off in this benchmark)

As mentioned in the first comment, the only async copy to pageable memory remaining are return values from thrust calls. There is no way to change that (see this discussion) and the gains would most likely be small since we have to synchronize each stream at the end of each step anyway so it's not a top priority to fix.

esseivaju added the core Software engineering infrastructure label Oct 5, 2023

sethrj changed the title ~~Performance scaling with host threads~~ Improve performance scaling with host threads Oct 30, 2023

sethrj added the enhancement New feature or request label Oct 30, 2023

esseivaju closed this as completed Nov 1, 2023

sethrj added performance Changes for performance optimization and removed core Software engineering infrastructure labels Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance scaling with host threads #967

Improve performance scaling with host threads #967

esseivaju commented Oct 5, 2023

sethrj commented Oct 30, 2023

esseivaju commented Nov 1, 2023

Improve performance scaling with host threads #967

Improve performance scaling with host threads #967

Comments

esseivaju commented Oct 5, 2023

sethrj commented Oct 30, 2023

esseivaju commented Nov 1, 2023