Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Conversation

@sf-wind
Copy link

@sf-wind sf-wind commented Mar 4, 2023

Summary:
Currently EMA computation is in the after step hook. It is in the critical path where no other work is available. This increases the training iteration time. This diff moves the EMA computation to after the backward but before the optimizer step. This way, the majority of the EMA computation time on the CPU can be hidden since CPU at that time is waiting for the GPU to finish the backward anyway. This change may completely hide the EMA CPU time. It reduces the EMA time from 20ms to 4ms, where the 4ms is the GPU time.

However, with this change, the EMA gets its value from the previous iteration value (since it is before step). but since we do many epochs of training, one iteration difference may not be significant.

Reviewed By: tglik

Differential Revision: D43527552

sf-wind and others added 2 commits March 4, 2023 14:47
Summary: Currently the EMA implementation first does the multiplication and then does the addition. It requires two round trips from HBM. With the lerp operator, one kernel can do both. This change uses LERP to compute EMA instead. It reduces the GPU EMA computation time by 40%.

Differential Revision: https://www.internalfb.com/diff/D43525938?entry_point=27

fbshipit-source-id: cc8389a5d93f52bfa472b1533ea52bb8c19834cd
Summary:
Currently EMA computation is in the after step hook. It is in the critical path where no other work is available. This increases the training iteration time. This diff moves the EMA computation to after the backward but before the optimizer step. This way, the majority of the EMA computation time on the CPU can be hidden since CPU at that time is waiting for the GPU to finish the backward anyway. This change may completely hide the EMA CPU time. It reduces the EMA time from 20ms to 4ms, where the 4ms is the GPU time.

However, with this change, the EMA gets its value from the previous iteration value (since it is before step). but since we do many epochs of training, one iteration difference may not be significant.

Reviewed By: tglik

Differential Revision: D43527552

fbshipit-source-id: 4eea88a935befad3cf6f8f20e6198f6b3a3169b6
@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Mar 4, 2023
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43527552

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in a7dc757.

facebook-github-bot pushed a commit to facebookresearch/detectron2 that referenced this pull request Mar 5, 2023
Summary:
X-link: facebookresearch/d2go#494

Currently EMA computation is in the after step hook. It is in the critical path where no other work is available. This increases the training iteration time. This diff moves the EMA computation to after the backward but before the optimizer step. This way, the majority of the EMA computation time on the CPU can be hidden since CPU at that time is waiting for the GPU to finish the backward anyway. This change may completely hide the EMA CPU time. It reduces the EMA time from 20ms to 4ms, where the 4ms is the GPU time.

However, with this change, the EMA gets its value from the previous iteration value (since it is before step). but since we do many epochs of training, one iteration difference may not be significant.

Reviewed By: tglik

Differential Revision: D43527552

fbshipit-source-id: 1faa9d910b20cae0fc77da541bc0ad176bce18a8
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants