You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ye and I just looked at the wiki. Thanks for the new method. we have a few questions about the method.
You mentioned " α_n and D_n need to approximate the inverse of nI(θ) ". we were wondering why the inverse of nI(θ) would be the optimal learning rate.
I think it would be helpful if you could point me to the literature about the method for the approximation of the inverse of nI(θ). (BTW, there might be a typo in " Take the inverse-square of all components Gi <- Gi^2 ". Do we take the before of Gi before squaring it? )
Is the iterative method to calculate learning rate also applicable to 1 dim learning rate?
It is a theoretical result that if one uses the inverse of n I(θ*) then SGD is optimal (same asymptotic variance as the MLE). I just added two papers about this in the "literature" dropbox folder.
It is but the method with multiple learning rates will be more efficient. We can try in the experiment to simply use the norms of the gradient of the log-likehood, and use this as a 1-dim learning rate.
I believe the user should have the following options for the learning rate.
I suggest we work on the 2 & 3 & 4 for now.
We can add the rest as we go. Any thoughts?
The text was updated successfully, but these errors were encountered: