You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wonder why alphas is "detach()"ed and before saved to self.alphas in Attention class. I tried self.alphas = alphas, that is, without detach and trained the model. There is no difference in performance. So I believe the reason is in something else.
Thank you for your great teaching in your great book!
The text was updated successfully, but these errors were encountered:
Thank you for supporting my work, and for your kind words :-)
Regarding the "detachment" of the alphas, the main idea is to prevent unintentional changes to the dynamic computation graph.
If you don't detach the alphas, it shouldn't change anything in the training process, as you already noticed.
But let's say you pause training, and decides to take a peek at the alphas. You may end-up performing an operation on them, and, since the graph keeps track of every operation performed on gradient-requiring tensors and its dependencies, it will impact the graph. That may be an issue if you resume training afterward.
In other circumstances, like the validation loop, we wrap the operations with a no_grad context manager to prevent potential problems.
The same goes for the detachment of the alphas - it's there as a safeguard, to make sure that it's totally safe to play with the values in self.alphas. It's also convenient, because you'd need to detach them anyway if you wanted to make the alphas Numpy arrays.
I wonder why alphas is "detach()"ed and before saved to self.alphas in Attention class. I tried self.alphas = alphas, that is, without detach and trained the model. There is no difference in performance. So I believe the reason is in something else.
Thank you for your great teaching in your great book!
The text was updated successfully, but these errors were encountered: