Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AdaGrad vs d-dim #50

Open
dustinvtran opened this issue May 11, 2015 · 8 comments
Open

AdaGrad vs d-dim #50

dustinvtran opened this issue May 11, 2015 · 8 comments
Assignees
Labels

Comments

@dustinvtran
Copy link
Member

Why is the square root in AdaGrad empirically getting better performance? ... or is it? To be analyzed!

dustinvtran added a commit that referenced this issue May 11, 2015
dustinvtran added a commit that referenced this issue May 11, 2015
@dustinvtran
Copy link
Member Author

I generalized the d-dimensional learning rate D_n to have hyperparameters α and c:

I_hat = α*I_hat + diag(I_hat_new)
D_n = 1/(I_hat)^c

The observed Fisher information I_hat is the sum of two terms: I_hat from the previous iteration, and I_hat_new which is the score function at the data point in the current iteration outer product-ed with itself, i.e., \nabla l(θ_{n-1}; yn)*(\nabla l(θ_{n-1}; yn))^T.

The discount α governs how much to weight previous history of gradients, and c is the exponent. Two special cases are:

  • "adagrad": α=1, c=1/2
  • "d-dim": α=0, c=1

See the learning rate wiki for more.

@ptoulis
Copy link
Contributor

ptoulis commented May 11, 2015

Nice, do you see the sqrt being better even in the normal model?

@dustinvtran
Copy link
Member Author

Yup.

library(sgd)

# Dimensions
N <- 1e5
d <- 1e2

# Generate data.
set.seed(42)
X <- matrix(rnorm(N*d), ncol=d)
theta <- rep(5, d+1)
eps <- rnorm(N)
y <- cbind(1, X) %*% theta + eps
dat <- data.frame(y=y, x=X)

sgd.theta <- sgd(y ~ ., data=dat, model="lm",
                 sgd.control=list(lr="d-dim"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 0.008434492
sgd.theta <- sgd(y ~ ., data=dat, model="lm",
                 sgd.control=list(lr="d-dim-weight"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 24.48697
sgd.theta <- sgd(y ~ ., data=dat, model="lm",
                 sgd.control=list(lr="adagrad"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 0.0003773736

Here d-dim-weight has α=1, c=1; that is, it strongly uses the history of the gradient as in AdaGrad but is not as conservative in reducing the learning rate. Same pattern was true for a bunch of other seeds, i.e., AdaGrad beats d-dim by 1 or two orders, and d-dim-weight never converges. It's also worthy to note that this is done with the default SGD method (implicit).

Intuitively, AdaGrad is somehow able to use the information stored in the previous history of the gradients, but intelligently so that it doesn't just blow up as with d-dim-weight.

@dustinvtran
Copy link
Member Author

I've been trying to dig into the theory and am thoroughly perplexed. The paper looks at minimizing the regret function using the Mahalanobis norm, which generalizes L2. That is, we move from the standard projected gradient stochastic update

θ_n = arg min || θ - (θ_{n-1} +  α_n g_{n-1}) ||_2^2

to

θ_n = arg min || θ - (θ_{n-1} +  α_n A^{-1} g_{n-1}) ||_A^2

One can then bound the regret function which leads to solving

min_A \sum_{t=1}^T g_t^T A^{-1} g_t

The minimal such A is exactly the square root of the Fisher information, and so the learning rate as a diagonal approximation should be the square root of the inverse squared diagonal entries. See these slides roughly at page 9-11.

This seems contradictory to the fact that the minimum should really be the Fisher information and not the square root of it? Or perhaps it's different because it's working under the Mahalanobis norm?

@ptoulis
Copy link
Contributor

ptoulis commented May 21, 2015

nice! Yes, exactly. The inverse of Fisher information gives the minimum-variance estimator (MVUE)
But the sqrt-root gives the minimum-regret (upper bound at least).
So AdaGrad is an inefficient estimator but should have lower regret. I suppose this roughly means
that in small samples it is doing better than MVUE but in the limit the MVUE uses more information.

That's pretty cool actually. Could we try to validate this in experiments?

@dustinvtran
Copy link
Member Author

Yup, would definitely be interesting to see. That is, we check the variance of the two estimates as n -> infty through a plot

@ptoulis
Copy link
Contributor

ptoulis commented May 21, 2015

yes exactly

@dustinvtran dustinvtran added this to the v1.0 milestone May 30, 2015
@dustinvtran dustinvtran self-assigned this May 31, 2015
@dustinvtran
Copy link
Member Author

As a reminder (to self), this was looked at and briefly mentioned in the current draft for the NIPS submission. The intuition behind why AdaGrad leads to better empirical performance than the non-square rooted version in practice is still a mystery.

@dustinvtran dustinvtran removed this from the v1.0 milestone Sep 22, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants