# DL4J: L2 regularization coefficient should be scaled by learning rate? #7079

Closed
opened this issue Jan 26, 2019 · 17 comments

Contributor

### AlexDBlack commented Jan 26, 2019 • edited by SkymindBot

 On the other hand, weight decay’s update will look like moving_avg = alpha * moving_avg + (1-alpha) * w.grad w = w - lr * moving_avg - lr * wd * w  Currently DL4J implements: w = w - updater(gradient) - wd * w  Note the absence of the learning rate scaling of the l2 (weight decay) coefficient: https://github.com/deeplearning4j/deeplearning4j/blob/af7155d61dc810d3e7139f15f98810e0255b2e17/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/updater/UpdaterBlock.java#L179-L208 More generally, we should perhaps break this out into a class: default to weight decay, but give people the option of "classic" L2.

### stolsvik commented Jan 26, 2019

 For some more reference: #7076 (comment)

### stolsvik commented Jan 26, 2019

 I want to highlight some of Ilya's comments: "Note that when you schedule alpha in the original SGD with L2, you effectively schedule weight decay. Therefore, when we decouple the two, we use eta to set a schedule x' = x - eta * alpha * gradient - eta * w * x where alpha is the initial learning rate used primarily as a scaling factor." As I understand this, the eta is then effectively the scheduled (learning) rate, and alpha is a (constant) scaling factor for how much each of the two components (gradient and wd) should affect the update. Also: "Note that by multiplying your w*x by alpha, you will couple w and alpha again and have some problems of hyperparameter selection that we discussed in the paper. You can consider to use eta if it does not break your existing codebase."
referenced this issue Jan 26, 2019

### mbkennel commented Jan 29, 2019

 The principle of least surprise should apply. What do other packages do? What does Keras do? What are assumptions in the typical textbook? I would use notation and API as similar to well known things as possible when expressing the same idea. As I understand regularization, it's usually a penalty function based approach added to the loss function. Then optimization of that can happen many ways including but not exclusively SGD, but the tradeoff is between the size of the base loss and size of the regularization. The referenced website on AdamW for the weight decay part still multiplied the regularization hyperparameter with the learning rate in its SGD-like update step, but the long term exponential moving average of the gradient did not use the regularization in its computation of the gradient. which makes sense to me---the point of Adam is to effectively renormalize the size of gradients across weights to be more uniform in the loss function on average; adding in the regularization term which is the same for all blunts the differentiation between the magnitude of the base loss gradients. In any case, the assumptions ought to be documented well and reasons for the choice provided therein, even if it's "XXX package names it this way even though it's a bit misleading to us as it means ABC, if you want DEF, do it the other way." I honestly also would love constraint based regularization as it's much easier to tune. E.g. limit average L2 or L1 size of weight matrix elements to be less than a certain value.
Contributor Author

### AlexDBlack commented Jan 29, 2019

 @mbkennel I don't disagree with any of that. There's really only 3 issues here (a) Communicating the behaviour clearly and unambiguously to users (via both API and docs) (b) Selecting an appropriate default (c) Providing the ability to customize/alter the behaviour if the default is not suitable I honestly also would love constraint based regularization as it's much easier to tune. E.g. limit average L2 or L1 size of weight matrix elements to be less than a certain value. FYI we have contraints, have had them for many releases: https://github.com/deeplearning4j/deeplearning4j/tree/master/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/constraint

### treo commented Jan 30, 2019

 my opinion on this is the same as the last time it was brought up (citing https://arxiv.org/pdf/1803.09820.pdf) Our experiments show that weight decay is not like learning rates or momentum and the best value should remain constant through the training (i.e., cyclical weight decay is not useful).
Contributor Author

### AlexDBlack commented Jan 30, 2019

 @treo Unfortunately that's a little ambiguous... is it referring to a constant regularization coefficient or the magnitude of the regularization effect (for a given set of weights)? w = w - lr * moving_avg - lr * wd * w  In that formulation, a constant weight decay coefficient does not imply a constant regularization effect due to the multiplication with the (possibly changing) learning rate.

### treo commented Jan 30, 2019

 Maybe we can get the input from the papers author @lnsmith54? For what it's worth, a fixed weight decay has worked well with a cyclical learning rate for me

### mbkennel commented Jan 30, 2019

 @mbkennel I don't disagree with any of that. There's really only 3 issues here (a) Communicating the behaviour clearly and unambiguously to users (via both API and docs) (b) Selecting an appropriate default (c) Providing the ability to customize/alter the behaviour if the default is not suitable I honestly also would love constraint based regularization as it's much easier to tune. E.g. limit average L2 or L1 size of weight matrix elements to be less than a certain value. FYI we have contraints, have had them for many releases: https://github.com/deeplearning4j/deeplearning4j/tree/master/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/constraint Great---when are they applied? After a weight update? I didn't understand the code representation too well in the examples (what is what). Could we make a constraint that would be a near drop-in (conceptual) replacement for L1 and L2 regularization on a weight layer? That means for a fully connected matrix of N_i inputs and N_h hiddens, apply the constraint 1/(N_i*N_h) \sum |w_jk|^p <= C^p. (i.e. average size of weight is <= C). For p=1 and 2. For L2 that constraint is easy to apply, rescale by the radius ratio computed in Euclidean space. For L1 it's harder (needs some sorting and truncation) but the algorithm is known and I once implemented it. I wrote a (now legacy) custom MLP trainer in my organization that does this and it's easier to tune than penalty function regularization. Use of this would obviate this discussion as the concept and effect of the regularization is decoupled conceptually from the update algorithm. Ideally the updater itself would be knowledgeable about the constraint for optimization purposes (e.g. don't point the update in a direction that is guaranteed to violate the constraint upon step) and be intelligent like the professional classic optimizer packages but a brute-force clipping can work OK.
Contributor Author

### AlexDBlack commented Jan 31, 2019

 @mbkennel when are they applied? After a weight update? They are applied after all other updater steps, yes. So it's basically the final step of a "fit" operation Could we make a constraint that would be a near drop-in (conceptual) replacement for L1 and L2 regularization on a weight layer? I don't see why not. Anything implementing the LayerConstraint API doesn't have access to the learning rate or updater or anything, but you've got access to the parameters and can do what you like (including reducing them in magnitude as in L1/L2). That said, conceptually "constraints" are conceptually used to deterministically enforce a specific condition (NonNegativeConstraint, UnitNormConstraint, etc).
Contributor Author

### AlexDBlack commented Jan 31, 2019 • edited

 OK, first steps towards implementing a solution have been started here - feedback welcome: #7097 Essentially just the API, L1, L2 and WeightDecay implementations... not yet plugged into DL4J yet. That should give us the capacity to support all variants: Weight decay with LR product Weight decay without LR product L2 regularization There's still some open questions around API and defaults, however. What I think I'll do: Add .regularization(Regularization...) method on net/layer builders Keep .l2(double); internally this calls .regularization(new L2Regularization(double)) Keep .l1(double); internally this calls .regularization(new L1Regularization(double)) Add .weightDecay(double,boolean applyLR) and .weightDecay(double); latter defaults to applyLR=true
referenced this issue Feb 1, 2019

### AlexDBlack closed this in #7097 Feb 1, 2019

Contributor Author

### AlexDBlack commented Feb 1, 2019

 Fix is merged, implemented as per my earlier comment. .l2 gives "classic" l2 regularization .weightDecay give (optionally LR scaled) weight decay (post updater).

### lnsmith54 commented Feb 1, 2019

 I just noticed this thread and that @treo suggested I comment. I will make a few brief remarks. First, I am trying to find the time to rewrite my tech report because it is not accurate about L2/weight decay (WD). Without going into details, it is important to set the WD coefficient properly (I am not talking about the magnitude of the weights because that is a whole different topic). I my many experiments I found that a good rule of thumb for setting the hyper-parameters is: LR * WD / (TBS * (1-m)) = 10^6 where LR is learning rate, TBS is total batch size (batch size times number of GPUs or otherwise distributed nodes), and m is the momentum coefficient. I've played with a dynamic WD coefficient and it improves the performance but complicates training so I am not recommending it. However, I use this approximate rule of thumb constantly to make picking LR, WD, BS, and m easy. In addition, training can be simplified by keeping the scale of the weights constant and eliminating WD altogether. This is the topic of the paper I am currently working on - please wait for the paper for details. I appreciate the patience. Thanks.

### treo commented Feb 2, 2019

 @lnsmith54 thanks a lot for taking the time to give your input on this 👍

### treo commented Feb 2, 2019

 @lnsmith54 in your rule of thumb, I guess it should have been 10^-6?

### lnsmith54 commented Feb 3, 2019

 Yes, 10^-6. Sorry for the mistake.

### stolsvik commented Feb 5, 2019 • edited

 Anything implementing the LayerConstraint API doesn't have access to the learning rate or updater or anything, but you've got access to the parameters and can do what you like (including reducing them in magnitude as in L1/L2). Would it not make some sense to have access to variables like the learning rate and similar, basically the entire context of the current minibatch? I figure that could give the developer more flexibility, and to be more creative and more experimental in testing out different constraints. That is, some kind of "minibatch context" would be nice to be provided for the different minibatch-relating APIs. I've mentioned something similar in #6277 (..here requesting the DataSet, another minibatch contextual element) (and for the IEvaluator interface, which is run after the minibatch is finished, the score would also be nice, since it is available #6278).

### lock bot commented Mar 7, 2019

 This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.