Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Role of centering in preventing collapse #101

Open
pmgautam opened this issue Aug 5, 2021 · 4 comments
Open

Role of centering in preventing collapse #101

pmgautam opened this issue Aug 5, 2021 · 4 comments

Comments

@pmgautam
Copy link

pmgautam commented Aug 5, 2021

I am not able to interpret the statement centering prevents one dimension to dominate but encourages collapse to the uniform distribution. Since we are subtracting the number and doing softmax, the distribution remains the same which is similar to stabilizing the softmax function. Can someone help me as I feel I am missing something here. Thanks in advance!

@woctezuma
Copy link

woctezuma commented Aug 5, 2021

For information for others, this is the paragraph which you are referring to:

Article

@mathildecaron31
Copy link
Contributor

Hi @pmgautam

You are right that if you subtract a number to logit before applying softmax does not change the distribution (it is the classical trick to stabilize softmax): exp(t_i - c) / sum_j (exp(t_j - c)) = exp(t_i) / sum_j (exp(t_j))

However, here the center is a vector and not a scalar and so the operation we are doing is: exp(t_i - c_i) / sum_j (exp(t_j - c_j))

Hope that helps to clarify the centering operation.

@anniepank
Copy link

anniepank commented Jan 31, 2022

Hi @mathildecaron31,

but even if one has a multidimensional distribution and changes each dimension by subtracting some number and doing softmax, the overall distribution will still remain the same. So I am still don't see how it would become a uniform distribution.

@ratthachat
Copy link

ratthachat commented Dec 24, 2023

Is this empirical concept rather than theoretical concept, and that is why there is no clear logical explanation here?

Edit the closet explanation i could find is : BYOL needs batch normalization in place of negative contrasting:

https://imbue.com/research/2020-08-24-understanding-self-supervised-contrastive-learning/

And the centering technique used in this paper is directly simplification of what is essential in batchnorm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants