AdaGrad vs d-dim #50

dustinvtran · 2015-05-11T01:22:32Z

Why is the square root in AdaGrad empirically getting better performance? ... or is it? To be analyzed!

dustinvtran · 2015-05-11T03:16:19Z

I generalized the d-dimensional learning rate D_n to have hyperparameters α and c:

I_hat = α*I_hat + diag(I_hat_new)
D_n = 1/(I_hat)^c

The observed Fisher information I_hat is the sum of two terms: I_hat from the previous iteration, and I_hat_new which is the score function at the data point in the current iteration outer product-ed with itself, i.e., \nabla l(θ_{n-1}; yn)*(\nabla l(θ_{n-1}; yn))^T.

The discount α governs how much to weight previous history of gradients, and c is the exponent. Two special cases are:

"adagrad": α=1, c=1/2
"d-dim": α=0, c=1

See the learning rate wiki for more.

ptoulis · 2015-05-11T05:06:18Z

Nice, do you see the sqrt being better even in the normal model?

dustinvtran · 2015-05-11T07:03:59Z

Yup.

library(sgd)

# Dimensions
N <- 1e5
d <- 1e2

# Generate data.
set.seed(42)
X <- matrix(rnorm(N*d), ncol=d)
theta <- rep(5, d+1)
eps <- rnorm(N)
y <- cbind(1, X) %*% theta + eps
dat <- data.frame(y=y, x=X)

sgd.theta <- sgd(y ~ ., data=dat, model="lm",
                 sgd.control=list(lr="d-dim"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 0.008434492
sgd.theta <- sgd(y ~ ., data=dat, model="lm",
                 sgd.control=list(lr="d-dim-weight"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 24.48697
sgd.theta <- sgd(y ~ ., data=dat, model="lm",
                 sgd.control=list(lr="adagrad"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 0.0003773736

Here d-dim-weight has α=1, c=1; that is, it strongly uses the history of the gradient as in AdaGrad but is not as conservative in reducing the learning rate. Same pattern was true for a bunch of other seeds, i.e., AdaGrad beats d-dim by 1 or two orders, and d-dim-weight never converges. It's also worthy to note that this is done with the default SGD method (implicit).

Intuitively, AdaGrad is somehow able to use the information stored in the previous history of the gradients, but intelligently so that it doesn't just blow up as with d-dim-weight.

dustinvtran · 2015-05-21T01:56:39Z

I've been trying to dig into the theory and am thoroughly perplexed. The paper looks at minimizing the regret function using the Mahalanobis norm, which generalizes L2. That is, we move from the standard projected gradient stochastic update

θ_n = arg min || θ - (θ_{n-1} +  α_n g_{n-1}) ||_2^2

to

θ_n = arg min || θ - (θ_{n-1} +  α_n A^{-1} g_{n-1}) ||_A^2

One can then bound the regret function which leads to solving

min_A \sum_{t=1}^T g_t^T A^{-1} g_t

The minimal such A is exactly the square root of the Fisher information, and so the learning rate as a diagonal approximation should be the square root of the inverse squared diagonal entries. See these slides roughly at page 9-11.

This seems contradictory to the fact that the minimum should really be the Fisher information and not the square root of it? Or perhaps it's different because it's working under the Mahalanobis norm?

ptoulis · 2015-05-21T02:08:51Z

nice! Yes, exactly. The inverse of Fisher information gives the minimum-variance estimator (MVUE)
But the sqrt-root gives the minimum-regret (upper bound at least).
So AdaGrad is an inefficient estimator but should have lower regret. I suppose this roughly means
that in small samples it is doing better than MVUE but in the limit the MVUE uses more information.

That's pretty cool actually. Could we try to validate this in experiments?

dustinvtran · 2015-05-21T02:41:55Z

Yup, would definitely be interesting to see. That is, we check the variance of the two estimates as n -> infty through a plot

ptoulis · 2015-05-21T02:44:03Z

yes exactly

dustinvtran · 2015-08-15T06:25:10Z

As a reminder (to self), this was looked at and briefly mentioned in the current draft for the NIPS submission. The intuition behind why AdaGrad leads to better empirical performance than the non-square rooted version in practice is still a mystery.

dustinvtran added the medium priority label May 11, 2015

dustinvtran added a commit that referenced this issue May 11, 2015

generalize the d-dim learning rate; #50

dac45c0

dustinvtran added a commit that referenced this issue May 11, 2015

flip alpha; #50

d698170

dustinvtran added a commit that referenced this issue May 11, 2015

flip alpha; #50

fed5579

dustinvtran added high priority and removed medium priority labels May 22, 2015

dustinvtran added this to the v1.0 milestone May 30, 2015

dustinvtran self-assigned this May 31, 2015

dustinvtran removed the high priority label Sep 22, 2015

dustinvtran removed this from the v1.0 milestone Sep 22, 2015

dustinvtran added the project label Sep 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AdaGrad vs d-dim #50

AdaGrad vs d-dim #50

dustinvtran commented May 11, 2015

dustinvtran commented May 11, 2015

ptoulis commented May 11, 2015

dustinvtran commented May 11, 2015

dustinvtran commented May 21, 2015

ptoulis commented May 21, 2015

dustinvtran commented May 21, 2015

ptoulis commented May 21, 2015

dustinvtran commented Aug 15, 2015

AdaGrad vs d-dim #50

AdaGrad vs d-dim #50

Comments

dustinvtran commented May 11, 2015

dustinvtran commented May 11, 2015

ptoulis commented May 11, 2015

dustinvtran commented May 11, 2015

dustinvtran commented May 21, 2015

ptoulis commented May 21, 2015

dustinvtran commented May 21, 2015

ptoulis commented May 21, 2015

dustinvtran commented Aug 15, 2015