Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance scaling with host threads #967

Closed
esseivaju opened this issue Oct 5, 2023 · 2 comments
Closed

Improve performance scaling with host threads #967

esseivaju opened this issue Oct 5, 2023 · 2 comments
Labels
enhancement New feature or request performance Changes for performance optimization

Comments

@esseivaju
Copy link
Contributor

I was comparing Nsight systems profiling output for the GeoModel-Tilecal integration. In addition to the GPU utilization being low with a single thread because of all the processing done on the CPU to report to Geant4 after each step, there is a shift in which action takes the most time.

The top picture has 1 thread, the bottom has 32 threads. With 32 threads, most of the time is spent in memcpyasync in the extend-from-secondaries (and step-gather) actions which are not async and require a device synchronization, preventing kernels of other streams from executing and leading to many small gaps in GPU utilization. These copies come from thrust calls return values, e.g., remove_if_alive. I suspect than refactoring to have asynchronous memcopy to pinned memory would help scale GPU utilization with many host threads. I don't know of a way to tell Thrust where to copy the return value of the function so if we want to verify that this leads to performance improvements we might have to implement these functions.
image

image image
@esseivaju esseivaju added the core Software engineering infrastructure label Oct 5, 2023
@sethrj
Copy link
Member

sethrj commented Oct 30, 2023

@esseivaju I think you've done a great job addressing this. Could you please add links to the relevant PRs, take a new snapshot of the timing, and close this out?

@sethrj sethrj changed the title Performance scaling with host threads Improve performance scaling with host threads Oct 30, 2023
@sethrj sethrj added the enhancement New feature or request label Oct 30, 2023
@esseivaju
Copy link
Contributor Author

Comparison of scaling with host threads between Celeritas v0.3.2 and 2332351. Measured on Perlmutter using the Atlas Tilecal example, vecGeom 1.2.5 and Geant4 11.0.1. Relevant PRs:

As mentioned in the first comment, the only async copy to pageable memory remaining are return values from thrust calls. There is no way to change that (see this discussion) and the gains would most likely be small since we have to synchronize each stream at the end of each step anyway so it's not a top priority to fix.

scaling_64p_celer

@sethrj sethrj added performance Changes for performance optimization and removed core Software engineering infrastructure labels Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Changes for performance optimization
Projects
None yet
Development

No branches or pull requests

2 participants