-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AdaGrad vs d-dim #50
Comments
I generalized the
The observed Fisher information The discount
See the learning rate wiki for more. |
Nice, do you see the sqrt being better even in the normal model? |
Yup. library(sgd)
# Dimensions
N <- 1e5
d <- 1e2
# Generate data.
set.seed(42)
X <- matrix(rnorm(N*d), ncol=d)
theta <- rep(5, d+1)
eps <- rnorm(N)
y <- cbind(1, X) %*% theta + eps
dat <- data.frame(y=y, x=X)
sgd.theta <- sgd(y ~ ., data=dat, model="lm",
sgd.control=list(lr="d-dim"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 0.008434492
sgd.theta <- sgd(y ~ ., data=dat, model="lm",
sgd.control=list(lr="d-dim-weight"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 24.48697
sgd.theta <- sgd(y ~ ., data=dat, model="lm",
sgd.control=list(lr="adagrad"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 0.0003773736 Here Intuitively, AdaGrad is somehow able to use the information stored in the previous history of the gradients, but intelligently so that it doesn't just blow up as with |
I've been trying to dig into the theory and am thoroughly perplexed. The paper looks at minimizing the regret function using the Mahalanobis norm, which generalizes L2. That is, we move from the standard projected gradient stochastic update
to
One can then bound the regret function which leads to solving
The minimal such This seems contradictory to the fact that the minimum should really be the Fisher information and not the square root of it? Or perhaps it's different because it's working under the Mahalanobis norm? |
nice! Yes, exactly. The inverse of Fisher information gives the minimum-variance estimator (MVUE) That's pretty cool actually. Could we try to validate this in experiments? |
Yup, would definitely be interesting to see. That is, we check the variance of the two estimates as |
yes exactly |
As a reminder (to self), this was looked at and briefly mentioned in the current draft for the NIPS submission. The intuition behind why AdaGrad leads to better empirical performance than the non-square rooted version in practice is still a mystery. |
Why is the square root in AdaGrad empirically getting better performance? ... or is it? To be analyzed!
The text was updated successfully, but these errors were encountered: