You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What I see is that when using a schedule for lr, and the lr becomes very low and stays there, then it seems like the l2 "overwhelms" the parameters (another observation is that parameters:updates ratio goes up, and that the network "collapses"), and I've speculated that the result is basically zeroing out of all the weights.
Though fixed L1/L2 is most common in practice, that's a reasonable observation, and I can see how that could occur in some cases. Adding L1/L2 schedules would provide extra flexibility for situations like this.
(Side note: it woudl be nice to report in the UI the contribution of the loss function and the L2)
Nice. But I also have a question of whether dl4j actually does this wrong? In that the more common setup for l2, AFAIU, is that l2 is affected by the learning rate with the l2-decay is "within the parenthesis" before being multiplied by lr (thus it will be proportionally affected by lr). But since you do the l2-correction as a separate step, you have fixed the l2-effect. (It just hit me: If the lr goes below the l2, you'd actually negate the gradient?!)
Here's a comment from another issue, which raises the same question: #5843 (comment)
.. it might be interesting to read a couple of comments upstream for that.
Currently, the L1/L2 regularization coefficients are fixed values.
From gitter, @stolsvik
Though fixed L1/L2 is most common in practice, that's a reasonable observation, and I can see how that could occur in some cases. Adding L1/L2 schedules would provide extra flexibility for situations like this.
(Side note: it woudl be nice to report in the UI the contribution of the loss function and the L2)
Aha! Link: https://skymindai.aha.io/features/ND4J-37
The text was updated successfully, but these errors were encountered: