Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Barlow Twins loss on identical vector #16

Closed
ChanLIM opened this issue Apr 20, 2021 · 6 comments
Closed

Barlow Twins loss on identical vector #16

ChanLIM opened this issue Apr 20, 2021 · 6 comments

Comments

@ChanLIM
Copy link

ChanLIM commented Apr 20, 2021

Hello, I really enjoyed reading the paper and thought about the intentions of the loss.

However, I was wondering if setting the target matrix as identity matrix is eligible.

As far as I understand, each element of cross correlation matrix is matrix multiplication on each feature element.
Barlow Twins loss aims to have correlation of 1 on the diagonal and 0(no correlation) on the non-diagonal elements.

So, if two identical representation vector were fed to the loss, I thought it should give loss of zero, but it didn't.

For the sake of simplicity, let's say we have 2 pairs of representation vectors with identical values. (that's 4 vectors)
However, when I take two identical 1d vectors for 2 data, took the batch norm and computed Barlow Twins loss with them,
I got 1 on the diagonal but not 0 on the non-diagonal elements.

Same thing goes for the case when the batch size is 1. (batch norm makes the value to be normalized to zero though)

I'm not sure how the loss will learn invariance and redundancy with the target of identity matrix, especially on the redundancy term.
Can you please elaborate on how representation vector learns within redundancy term?

Here's a simple example I tried. (I followed the code implementation)

image

Thank you!

@sallymmx
Copy link

sallymmx commented Apr 20, 2021

Same question!!
When I train with different augmentation, while validate with same augmentation, the loss of training become lower but the loss of validate becomes larger and larger, much large than training loss.

This really confuse me, could someone tell about the reason?

@jingli9111
Copy link
Contributor

@ChanLIM
Thanks for the example. This is very insightful!
This is why our method is fundamentally different from contrastive learning like SimCLR.

So, if two identical representation vector were fed to the loss, I thought it should give loss of zero, but it didn't.
It shouldn't. It's not enough.

Our loss function wants features to be decorrelated. But in your example, they are correlated. Feature 3 can be completely determined by feature 1 and feature 2.

If you add sample 3 so that there's no redundant dimension, I believe you can get zero loss.

Let me know if you have further questions.

@jingli9111
Copy link
Contributor

@sallymmx
See above about the question of "same augmentations".
The validation loss going up may have different reasons. Can you specify your experiment setting?

@ChanLIM
Copy link
Author

ChanLIM commented Apr 21, 2021

@jingli9111
Thanks for your kind explanation.
I think I got the gist of it, but I'm not sure I understood the idea completely.

But in your example, they are correlated. Feature 3 can be completely determined by feature 1 and feature 2.
The part that I'm mostly confused with is about how we determine each feature is decorrelated from each other.

As you suggested, I tried re-calculating the loss after adding an another sample(img3), but I still don't get the results I expected.
It seems to me that simply adding another pair of identical vectors to the batch is not solving the problem in the previous example.
I also tried to change the values of vectors so that the sum of the elements won't sum up to 1 (to ensure no correlation between img1, 2, 3).

Here's another example for you.
image

It would be great for me to understand the problem if you could provide me with an example of a batch where each features are not correlated and what it means to be not correlated.

Thanks in advance!

@jingli9111
Copy link
Contributor

@ChanLIM
One example:
[ 0.3927, 0.0957, 0.9147], [-0.9112, -0.0944, 0.4011], [-0.1247, 0.9909, -0.0501]
You can generate a random orthogonal matrix. It should give zero loss.
By definition, if z_a ^T @ z_b = I and z_a = z_b, then z_a is an orthogonal matrix.

Sorry my previous statement "If you add sample 3 so that there's no redundant dimension, I believe you can get zero loss." is wrong.
It cannot be completely random. Determined means correlation = 1. Only not determined is not enough.
decorrelated means correlation = 0 (sum_i x_i y_i = 0)

Let me know if you have more questions.

@ChanLIM
Copy link
Author

ChanLIM commented Apr 22, 2021

Then, I guess the objectives of the Barlow Loss function are

1)making two representation vectors to be similar to each other (relationship between different views on the same image)
and
2)making the (batch size * representation dim sized) data matrix semi-orthogonal (relationship between different data points within the batch) at the same time.

Thanks for the kind replies. It was of great help for me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants