Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUGs about amp #64

Closed
wjn1996 opened this issue Apr 28, 2022 · 0 comments
Closed

BUGs about amp #64

wjn1996 opened this issue Apr 28, 2022 · 0 comments

Comments

@wjn1996
Copy link
Collaborator

wjn1996 commented Apr 28, 2022

When I train the models with amp, we find the model cannot be Converged.
image

I find that there are some bugs in /easynlp/core/trainer.py, as shown in follow:
image

I analyze: the code in the red box means clearing grads, but when using amp, it cannot execute this code, which cause the problem.

Now I have resolved it and will commit a pull request in the latter.

We recommend using optimizer with amp in the following four settings:

  • bertadam:self._optimizer.step() + self._optimizer.zero_grad();
  • bertadam+amp:self._scaler.step(self._optimizer) + self._scaler.update() + self._optimizer.zero_grad()
  • adamw:torch.nn.utils.clip_grad_norm_() + self._optimizer.step() + self._lr_scheduler.step() + self._optimizer.zero_grad()
  • adamw+amp:torch.nn.utils.clip_grad_norm_() + self._scaler.step(self._optimizer) self._scaler.update() + self._lr_scheduler.step() + self._optimizer.zero_grad()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants