Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

L2 regularization seems to be reduplicated for FTRL optimization #223

Closed
matricer opened this issue Feb 22, 2019 · 7 comments
Closed

L2 regularization seems to be reduplicated for FTRL optimization #223

matricer opened this issue Feb 22, 2019 · 7 comments

Comments

@matricer
Copy link

L2 regularization seems to be reduplicated for FTRL optimization. Take LR as an example.
snp20190222174325758_lx
Proximal operator in FTRL has cover the L2 regularization, so the former one seems to be reduplicated. FM and FFM have similar problem.

@aksnzhy
Copy link
Owner

aksnzhy commented Feb 22, 2019

@matricer Thanks for your issue. I will check it as soon as possible.

@etveritas
Copy link
Collaborator

@matricer
This is the pseudocde in paper Ad Click Prediction: a View from the Trenches
_20190225230115
I think line 141 calculate the gradient with L2 regularization, and line 150-152 update the paramter w.

@matricer
Copy link
Author

@etveritas,
you can see the pseudocde:
p_t = \sigma (x_t * w)
g_i = (p_t - y_t) x_t
there is no L2 regularization attached to the gradient computing, because the following proximal operator (updating w_{t,i}) has covered both L1 regularization and L2 regularization.
In fact, these are two ways to implement L2 regularization: weight decay and proximal operator. Weight decay is also called shrinkage-type L2 (which is the addition of an L2 penalty to the loss function). Proximal operator is called online L2 (the L2 penalty given in the paper above).
You can also refer to MXNet or Tensorflow FTRL implementation.

@etveritas
Copy link
Collaborator

@matricer I see what you mean.I find that comment in tensorflow, it says:

See this paper.This version has support for both online L2 (the L2 penalty given in the paper
above) and shrinkage-type L2 (which is the addition of an L2 penalty to the
loss function

I guess it's same as tensorflow, they both use both online L2 and shrinkage-type L2, and this two L2 is same value in xlearn.

@matricer
Copy link
Author

matricer commented Feb 26, 2019

@etveritas I get your idea. In tensorflow, in the absence of L1 regularization, FTRL updating gives:
w_{t+1} = w_t - lr_t / (1 + 2 * L2 * lr_t) * g_t - 2 * L2_shrinkage * lr_t / (1 + 2 * L2 * lr_t) * w_t
here, L2 is online L2 and L2_shrinkage is shrinkage-type L2. In XLearn, without L1 regularization FTRL updating gives:
w_{t+1} = w_t - lr_t / (1 + 2 * L2* lr_t) * g_t - 2 * L2 * lr_t / (1 + 2 * L2 * lr_t) * w_t

@etveritas
Copy link
Collaborator

@matricer yep.

@matricer
Copy link
Author

@aksnzhy @etveritas thanks~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants