-
-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add helpers to measure execution times #2740
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few quick comments...
cupy/testing/time.py
Outdated
func(*args) | ||
|
||
for i in range(n): | ||
ev1.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GPU time before function entry may be counted if this synchronization is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the ongoing work in the warmup loop might be counted in this case. How about moving this sync outside the current loop and putting it right after the warmup loop?
c48daf1
to
733a212
Compare
I will support context manager interface for measurement of time in another PR. |
cupyx/time.py
Outdated
func(*args) | ||
|
||
for i in range(n): | ||
ev1.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to record ev1
first (or synchronize the entire device instead of ev1
) to avoid effect from warmup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I think we should not record ev1
until all GPU threads finish computations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of CUDA event semantics is that:
ev1.record()
queues the event to the current streamev1.synchronize()
blocks until the event completes
https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/
PTAL. |
pfnCI, test this please. |
Successfully created a job for commit 36e9058: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Jenkins CI test (for commit 36e9058, target branch master) succeeded! |
I want to officially support some helpers to measure CPU/GPU execution times.
CuPy organization has cupy-benchmark currently, but it is for the maintainers to show users the performance results. I want to support this feature for also external contributors and users.