GPU memory utilization, operator execution modes #47

louisfeng · 2022-03-17T07:39:06Z

Summary:

New features:

Collects GPU memory utilization in addition to latency. This done using PyTorch's own memory tracker.
Added new modes of operator execution
- DISCRETE: this is the original method. We time each operator execution individually. A synchronization point is inserted before and after each GPU operator launch. We also clear the cache between each execution. However this introduces various overhead that we may not observe when multiple operators execute sequentially. This is due to the async execution model of CUDA.
- CONTINUOUS: In this mode, the operator is executed in a tight loop. We only time the overall loop time, then divide by the number of loop count to get the average operator execution time. The overheads are amortized over many iterations so it's closer to the actual kernel execution time.
- CONTINUOUS_EVENTS: This is similar to CONTINUOUS mode, but we also insert before and after torch.cuda.Event in the loop. The operator time is calculated based on the delta of these events. This will give individual measurements, but may introduce minor overhead that's not observed in CONTINUOUS mode.

Note that the continuous mode is the same way to measure in compute/pt/ benchmarks. As shown below (2nd run), they produce the same result.

A few to do items for the coming diffs:

convert to use the new torch nvtx D33155244
be consistent using buck build run_batch script
nsys support

Differential Revision: D34071165

Summary: # New features: * Collects GPU memory utilization in addition to latency. This done using PyTorch's own memory tracker. * Added new modes of operator execution * DISCRETE: this is the original method. We time each operator execution individually. A synchronization point is inserted before and after each GPU operator launch. We also clear the cache between each execution. However this introduces various overhead that we may not observe when multiple operators execute sequentially. This is due to the async execution model of CUDA. * CONTINUOUS: In this mode, the operator is executed in a tight loop. We only time the overall loop time, then divide by the number of loop count to get the average operator execution time. The overheads are amortized over many iterations so it's closer to the actual kernel execution time. * CONTINUOUS_EVENTS: This is similar to CONTINUOUS mode, but we also insert before and after torch.cuda.Event in the loop. The operator time is calculated based on the delta of these events. This will give individual measurements, but may introduce minor overhead that's not observed in CONTINUOUS mode. **Note that the continuous mode is the same way to measure in compute/pt/ benchmarks. As shown below (2nd run), they produce the same result.** # A few to do items for the coming diffs: * convert to use the new torch nvtx D33155244 * be consistent using buck build run_batch script * nsys support Differential Revision: D34071165 fbshipit-source-id: 4713ecdcd2d5fdb6fd7b087748078166c28fb4dc

facebook-github-bot · 2022-03-17T07:39:28Z

This pull request was exported from Phabricator. Differential Revision: D34071165

louisfeng requested review from srinivas212, naghshineh and kingchc as code owners March 17, 2022 07:39

facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Mar 17, 2022

facebook-github-bot closed this in 27f0ac0 Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory utilization, operator execution modes #47

GPU memory utilization, operator execution modes #47

louisfeng commented Mar 17, 2022

facebook-github-bot commented Mar 17, 2022

GPU memory utilization, operator execution modes #47

GPU memory utilization, operator execution modes #47

Conversation

louisfeng commented Mar 17, 2022

New features:

A few to do items for the coming diffs:

facebook-github-bot commented Mar 17, 2022