Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory utilization, operator execution modes #47

Closed
wants to merge 1 commit into from

Conversation

louisfeng
Copy link
Contributor

Summary:

New features:

  • Collects GPU memory utilization in addition to latency. This done using PyTorch's own memory tracker.
  • Added new modes of operator execution
    • DISCRETE: this is the original method. We time each operator execution individually. A synchronization point is inserted before and after each GPU operator launch. We also clear the cache between each execution. However this introduces various overhead that we may not observe when multiple operators execute sequentially. This is due to the async execution model of CUDA.
    • CONTINUOUS: In this mode, the operator is executed in a tight loop. We only time the overall loop time, then divide by the number of loop count to get the average operator execution time. The overheads are amortized over many iterations so it's closer to the actual kernel execution time.
    • CONTINUOUS_EVENTS: This is similar to CONTINUOUS mode, but we also insert before and after torch.cuda.Event in the loop. The operator time is calculated based on the delta of these events. This will give individual measurements, but may introduce minor overhead that's not observed in CONTINUOUS mode.

Note that the continuous mode is the same way to measure in compute/pt/ benchmarks. As shown below (2nd run), they produce the same result.

A few to do items for the coming diffs:

  • convert to use the new torch nvtx D33155244
  • be consistent using buck build run_batch script
  • nsys support

Differential Revision: D34071165

Summary:
# New features:
* Collects GPU memory utilization in addition to latency. This done using PyTorch's own memory tracker.
* Added new modes of operator execution
  * DISCRETE: this is the original method. We time each operator execution individually. A synchronization point is inserted before and after each GPU operator launch. We also clear the cache between each execution. However this introduces various overhead that we may not observe when multiple operators execute sequentially. This is due to the async execution model of CUDA.
  * CONTINUOUS: In this mode, the operator is executed in a tight loop. We only time the overall loop time, then divide by the number of loop count to get the average operator execution time. The overheads are amortized over many iterations so it's closer to the actual kernel execution time.
  * CONTINUOUS_EVENTS: This is similar to CONTINUOUS mode, but we also insert before and after torch.cuda.Event in the loop. The operator time is calculated based on the delta of these events. This will give individual measurements, but may introduce minor overhead that's not observed in CONTINUOUS mode.

**Note that the continuous mode is the same way to measure in compute/pt/ benchmarks. As shown below (2nd run), they produce the same result.**

# A few to do items for the coming diffs:
* convert to use the new torch nvtx D33155244
* be consistent using buck build run_batch script
* nsys support

Differential Revision: D34071165

fbshipit-source-id: 4713ecdcd2d5fdb6fd7b087748078166c28fb4dc
@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Mar 17, 2022
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34071165

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants