Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DL4J: L2 regularization coefficient should be scaled by learning rate? #7079

Closed
AlexDBlack opened this issue Jan 26, 2019 · 17 comments

Comments

Projects
None yet
5 participants
@AlexDBlack
Copy link
Member

commented Jan 26, 2019

#5843 (comment)
https://www.fast.ai/2018/07/02/adam-weight-decay/

On the other hand, weight decay’s update will look like

moving_avg = alpha * moving_avg + (1-alpha) * w.grad
w = w - lr * moving_avg - lr * wd * w

Currently DL4J implements:

w = w - updater(gradient) - wd * w

Note the absence of the learning rate scaling of the l2 (weight decay) coefficient:

/**
* Apply L1 and L2 regularization, if necessary. Note that L1/L2 may differ for different layers in the same block
*
* @param layer The layer to apply L1/L2 to
* @param paramName Parameter name in the given layer
* @param gradientView Gradient view array for the layer + param
* @param paramsView Parameter view array for the layer + param
*/
public void postApply(Trainable layer, String paramName, INDArray gradientView, INDArray paramsView) {
if( layer instanceof FrozenLayer ){
//TODO this is a quick hack to fix https://github.com/deeplearning4j/deeplearning4j/issues/4250
//The underlying cause seems to be the whole NeuralNetConfiguration l1/l2ByParam maps and layer config separation
// which is being resolved in the upcoming PR here: https://github.com/deeplearning4j/deeplearning4j/pull/4050
return;
}
//TODO: do this for multiple contiguous params/layers (fewer, larger ops)
double l2 = layer.getConfig().getL2ByParam(paramName);
if (l2 > 0) {
//This can be an axpy op, saving an allocation...
//gradientView += params * l2 i.e., dC/dW = dC0/dW + lambda/n * w where C0 is pre-l2 cost function
//Equivalent to gradientView.addi(paramsView.mul(layerConf().getL2ByParam(paramName)));
val length = gradientView.length();
Nd4j.getBlasWrapper().level1().axpy(length, l2, paramsView, gradientView);
}
if (layer.getConfig().getL1ByParam(paramName) > 0) {
gradientView.addi(Transforms.sign(paramsView, true).muli(layer.getConfig().getL1ByParam(paramName)));
}
}

More generally, we should perhaps break this out into a class: default to weight decay, but give people the option of "classic" L2.

Aha! Link: https://skymindai.aha.io/features/ND4J-65

@AlexDBlack AlexDBlack added the DL4J label Jan 26, 2019

@stolsvik

This comment has been minimized.

Copy link
Contributor

commented Jan 26, 2019

For some more reference: #7076 (comment)

@stolsvik

This comment has been minimized.

Copy link
Contributor

commented Jan 26, 2019

I want to highlight some of Ilya's comments:

"Note that when you schedule alpha in the original SGD with L2, you effectively schedule weight decay.
Therefore, when we decouple the two, we use eta to set a schedule
x' = x - eta * alpha * gradient - eta * w * x
where alpha is the initial learning rate used primarily as a scaling factor."

As I understand this, the eta is then effectively the scheduled (learning) rate, and alpha is a (constant) scaling factor for how much each of the two components (gradient and wd) should affect the update.

Also:

"Note that by multiplying your w*x by alpha, you will couple w and alpha again and have some problems of hyperparameter selection that we discussed in the paper. You can consider to use eta if it does not break your existing codebase."

@mbkennel

This comment has been minimized.

Copy link

commented Jan 29, 2019

The principle of least surprise should apply. What do other packages do? What does Keras do? What are assumptions in the typical textbook? I would use notation and API as similar to well known things as possible when expressing the same idea.

As I understand regularization, it's usually a penalty function based approach added to the loss function. Then optimization of that can happen many ways including but not exclusively SGD, but the tradeoff is between the size of the base loss and size of the regularization.

The referenced website on AdamW for the weight decay part still multiplied the regularization hyperparameter with the learning rate in its SGD-like update step, but the long term exponential moving average of the gradient did not use the regularization in its computation of the gradient. which makes sense to me---the point of Adam is to effectively renormalize the size of gradients across weights to be more uniform in the loss function on average; adding in the regularization term which is the same for all blunts the differentiation between the magnitude of the base loss gradients.

In any case, the assumptions ought to be documented well and reasons for the choice provided therein, even if it's "XXX package names it this way even though it's a bit misleading to us as it means ABC, if you want DEF, do it the other way."

I honestly also would love constraint based regularization as it's much easier to tune. E.g. limit average L2 or L1 size of weight matrix elements to be less than a certain value.

@AlexDBlack

This comment has been minimized.

Copy link
Member Author

commented Jan 29, 2019

@mbkennel I don't disagree with any of that. There's really only 3 issues here
(a) Communicating the behaviour clearly and unambiguously to users (via both API and docs)
(b) Selecting an appropriate default
(c) Providing the ability to customize/alter the behaviour if the default is not suitable

I honestly also would love constraint based regularization as it's much easier to tune. E.g. limit average L2 or L1 size of weight matrix elements to be less than a certain value.

FYI we have contraints, have had them for many releases: https://github.com/deeplearning4j/deeplearning4j/tree/master/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/constraint

@treo

This comment has been minimized.

Copy link
Member

commented Jan 30, 2019

my opinion on this is the same as the last time it was brought up (citing https://arxiv.org/pdf/1803.09820.pdf)

Our experiments show that weight decay is not like learning rates or momentum and the best value
should remain constant through the training (i.e., cyclical weight decay is not useful).

@AlexDBlack

This comment has been minimized.

Copy link
Member Author

commented Jan 30, 2019

@treo Unfortunately that's a little ambiguous... is it referring to a constant regularization coefficient or the magnitude of the regularization effect (for a given set of weights)?

w = w - lr * moving_avg - lr * wd * w

In that formulation, a constant weight decay coefficient does not imply a constant regularization effect due to the multiplication with the (possibly changing) learning rate.

@treo

This comment has been minimized.

Copy link
Member

commented Jan 30, 2019

Maybe we can get the input from the papers author @lnsmith54?

For what it's worth, a fixed weight decay has worked well with a cyclical learning rate for me

@mbkennel

This comment has been minimized.

Copy link

commented Jan 30, 2019

@mbkennel I don't disagree with any of that. There's really only 3 issues here
(a) Communicating the behaviour clearly and unambiguously to users (via both API and docs)
(b) Selecting an appropriate default
(c) Providing the ability to customize/alter the behaviour if the default is not suitable

I honestly also would love constraint based regularization as it's much easier to tune. E.g. limit average L2 or L1 size of weight matrix elements to be less than a certain value.

FYI we have contraints, have had them for many releases: https://github.com/deeplearning4j/deeplearning4j/tree/master/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/constraint

Great---when are they applied? After a weight update? I didn't understand the code representation too well in the examples (what is what). Could we make a constraint that would be a near drop-in (conceptual) replacement for L1 and L2 regularization on a weight layer?

That means for a fully connected matrix of N_i inputs and N_h hiddens, apply the constraint 1/(N_i*N_h) \sum |w_jk|^p <= C^p. (i.e. average size of weight is <= C). For p=1 and 2. For L2 that constraint is easy to apply, rescale by the radius ratio computed in Euclidean space. For L1 it's harder (needs some sorting and truncation) but the algorithm is known and I once implemented it.

I wrote a (now legacy) custom MLP trainer in my organization that does this and it's easier to tune than penalty function regularization.

Use of this would obviate this discussion as the concept and effect of the regularization is decoupled conceptually from the update algorithm.

Ideally the updater itself would be knowledgeable about the constraint for optimization purposes (e.g. don't point the update in a direction that is guaranteed to violate the constraint upon step) and be intelligent like the professional classic optimizer packages but a brute-force clipping can work OK.

@AlexDBlack

This comment has been minimized.

Copy link
Member Author

commented Jan 31, 2019

@mbkennel

when are they applied? After a weight update?

They are applied after all other updater steps, yes. So it's basically the final step of a "fit" operation

Could we make a constraint that would be a near drop-in (conceptual) replacement for L1 and L2 regularization on a weight layer?

I don't see why not. Anything implementing the LayerConstraint API doesn't have access to the learning rate or updater or anything, but you've got access to the parameters and can do what you like (including reducing them in magnitude as in L1/L2).
That said, conceptually "constraints" are conceptually used to deterministically enforce a specific condition (NonNegativeConstraint, UnitNormConstraint, etc).

@AlexDBlack

This comment has been minimized.

Copy link
Member Author

commented Jan 31, 2019

OK, first steps towards implementing a solution have been started here - feedback welcome:
#7097
Essentially just the API, L1, L2 and WeightDecay implementations... not yet plugged into DL4J yet.

That should give us the capacity to support all variants:

  • Weight decay with LR product
  • Weight decay without LR product
  • L2 regularization

There's still some open questions around API and defaults, however.
What I think I'll do:

  • Add .regularization(Regularization...) method on net/layer builders
  • Keep .l2(double); internally this calls .regularization(new L2Regularization(double))
  • Keep .l1(double); internally this calls .regularization(new L1Regularization(double))
  • Add .weightDecay(double,boolean applyLR) and .weightDecay(double); latter defaults to applyLR=true
@AlexDBlack

This comment has been minimized.

Copy link
Member Author

commented Feb 1, 2019

Fix is merged, implemented as per my earlier comment.
.l2 gives "classic" l2 regularization
.weightDecay give (optionally LR scaled) weight decay (post updater).

@lnsmith54

This comment has been minimized.

Copy link

commented Feb 1, 2019

I just noticed this thread and that @treo suggested I comment. I will make a few brief remarks.

First, I am trying to find the time to rewrite my tech report because it is not accurate about L2/weight decay (WD). Without going into details, it is important to set the WD coefficient properly (I am not talking about the magnitude of the weights because that is a whole different topic). I my many experiments I found that a good rule of thumb for setting the hyper-parameters is:
LR * WD / (TBS * (1-m)) = 10^6
where LR is learning rate, TBS is total batch size (batch size times number of GPUs or otherwise distributed nodes), and m is the momentum coefficient. I've played with a dynamic WD coefficient and it improves the performance but complicates training so I am not recommending it. However, I use this approximate rule of thumb constantly to make picking LR, WD, BS, and m easy.

In addition, training can be simplified by keeping the scale of the weights constant and eliminating WD altogether. This is the topic of the paper I am currently working on - please wait for the paper for details. I appreciate the patience.

Thanks.

@treo

This comment has been minimized.

Copy link
Member

commented Feb 2, 2019

@lnsmith54 thanks a lot for taking the time to give your input on this 👍

@treo

This comment has been minimized.

Copy link
Member

commented Feb 2, 2019

@lnsmith54 in your rule of thumb, I guess it should have been 10^-6?

@lnsmith54

This comment has been minimized.

Copy link

commented Feb 3, 2019

Yes, 10^-6. Sorry for the mistake.

@stolsvik

This comment has been minimized.

Copy link
Contributor

commented Feb 5, 2019

Anything implementing the LayerConstraint API doesn't have access to the learning rate or updater or anything, but you've got access to the parameters and can do what you like (including reducing them in magnitude as in L1/L2).

Would it not make some sense to have access to variables like the learning rate and similar, basically the entire context of the current minibatch? I figure that could give the developer more flexibility, and to be more creative and more experimental in testing out different constraints. That is, some kind of "minibatch context" would be nice to be provided for the different minibatch-relating APIs. I've mentioned something similar in #6277 (..here requesting the DataSet, another minibatch contextual element)
(and for the IEvaluator interface, which is run after the minibatch is finished, the score would also be nice, since it is available #6278).

@lock

This comment has been minimized.

Copy link

commented Mar 7, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 7, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.