You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi
I have a quick question. For your transformer or any other application, have you used FP16 when getting gradients from a backward call? In the model I am working with, for any scale factor on the loss that I’ve tried, backward seems to give reasonable gradients when I don’t set create_graph to True. But when I do set it to true, while some of the gradients are the same as with it set to False, many others show up as nan’s. All seems OK when I use FP32 operations, but I’d like to get FP16’s advantages in GPU memory/speed.
Any suggestions you can provide would be appreciated!
The text was updated successfully, but these errors were encountered:
Hi
I have a quick question. For your transformer or any other application, have you used FP16 when getting gradients from a backward call? In the model I am working with, for any scale factor on the loss that I’ve tried, backward seems to give reasonable gradients when I don’t set create_graph to True. But when I do set it to true, while some of the gradients are the same as with it set to False, many others show up as nan’s. All seems OK when I use FP32 operations, but I’d like to get FP16’s advantages in GPU memory/speed.
Any suggestions you can provide would be appreciated!
The text was updated successfully, but these errors were encountered: