Barlow Twins implementation #230

OlivierDehaene · 2021-03-10T23:16:45Z

Required (TBC)

closes #229

facebook-github-bot · 2021-03-10T23:16:50Z

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

facebook-github-bot · 2021-03-10T23:20:40Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

facebook-github-bot · 2021-03-11T00:17:01Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

configs/config/test/integration_test/quick_barlow_twins.yaml

OlivierDehaene · 2021-03-15T21:05:25Z

@prigoyal, @jzbontar,

Using the following configuration for pretraining, and this one for evaluation, I obtain 85.7 Top 1 Accuracy on Imagenette 160.
Is it in the ballpark of what you would expect on this dataset?

If so, I think the BarlowTwinsLoss and Criterion are ready for review!

QuentinDuval · 2021-03-15T21:48:14Z

@prigoyal, @jzbontar,

Using the following configuration for pretraining, and this one for evaluation, I obtain 85.7 Top 1 Accuracy on Imagenette 160.
Is it in the ballpark of what you would expect on this dataset?

If so, I think the BarlowTwinsLoss and Criterion are ready for review!

I do not know for Barlow Twins specifically (@jzbontar can answer on that more precisely), but this is definitely in the same ball park than my experiments with SimCLR pre-trained and then evaluated on Imagenette 160 (where the typical results I got are between 85% to 88% top-1 accuracy) 👍

prigoyal

thank you so much @OlivierDehaene this is looking great. I have left some inline comments. Please take a look at them :) Also a next step I see:
Can you follow the https://github.com/facebookresearch/vissl/blob/master/.github/CONTRIBUTING.md#coding-style to try the dev/linter.sh on this PR ?

Afterwards, I think the open items in the PR checklist are next steps. We should aim to provide a config file for at least ImageNet. Further, a model in the model_zoo would be great .

vissl/losses/barlow_twins_loss.py

OlivierDehaene · 2021-03-16T14:22:18Z

@prigoyal,

My implementation is based on the SimCLR loss implementation. A lot of the comments you added are relevant for both (for example the style comments). Do you want me to patch both at the same time?

prigoyal · 2021-03-17T12:04:48Z

@prigoyal,

My implementation is based on the SimCLR loss implementation. A lot of the comments you added are relevant for both (for example the style comments). Do you want me to patch both at the same time?

indeed. I certainly welcome the improvements to SimCLR as well but in a separate PR :) thank you so much for this.

prigoyal · 2021-03-17T12:07:46Z

@OlivierDehaene , I am pretty excited about this PR. I think we are extremely close to getting this merged and I appreciate you a lot for this work. :)

vissl/losses/barlow_twins_loss.py

OlivierDehaene · 2021-03-22T09:29:10Z

@prigoyal, @QuentinDuval,

The 2nd iteration of the criterion is done. The criterion now normalises the embedding and combines the cross-correlation matrix across all workers.

@prigoyal, do I resolve the comments that are solved? Or do you do it? What is the process here between maintainers and contributors?

prigoyal · 2021-03-22T12:24:59Z

@prigoyal, @QuentinDuval,

The 2nd iteration of the criterion is done. The criterion now normalises the embedding and combines the cross-correlation matrix across all workers.

I implemented the normalisation in _sync_normalise instead of using torch.nn.SyncBatchNorm(..., affine=False) to not have an unnecessary train/eval state in the criterion (since BatchNorm changes its behaviour).

@prigoyal, do I resolve the comments that are solved? Or do you do it? What is the process here between maintainers and contributors?

thank you @OlivierDehaene , for the comments that pertain to the code changes here, contributors should resolve the comments once they have been addressed. If the comments require further discussion , they can stay open and we can discuss/maintainers can close :)

OlivierDehaene · 2021-03-22T16:16:29Z

@jzbontar,

Excuse me if I'm wrong but I thought that torch.distributed.all_reduce cut gradients. However you use it here to sum the cross-correlation matrices. Wouldn't that cause issues in the backward pass?

jzbontar · 2021-03-22T18:08:37Z

@jzbontar,

Excuse me if I'm wrong but I thought that torch.distributed.all_reduce cut gradients. However you use it here to sum the cross-correlation matrices. Wouldn't that cause issues in the backward pass?

Hm, good point. I checked whether torch.distributed.all_reduce cuts the gradients when I was writing code for Barlow Twins and I found that it doesn't and that the gradients were computed correctly. For example, when I run the following code on a machine with 2 GPUs

import torch

def main_worker(gpu):
    torch.distributed.init_process_group(backend='nccl', init_method='tcp://localhost:12345', world_size=2, rank=gpu)
    torch.cuda.set_device(gpu)
    x = torch.ones(1).cuda().requires_grad_()
    xx = (gpu + 2) * x
    torch.distributed.all_reduce(xx)
    xx.backward()
    print(f'gpu={gpu}, grad={x.grad}')

if __name__ == '__main__':
    torch.multiprocessing.spawn(main_worker, nprocs=2)

it outputs

gpu=1, grad=tensor([3.], device='cuda:1')
gpu=0, grad=tensor([2.], device='cuda:0')

which seems okay to me.

The PyTorch documentation, however, suggests that the autograd-enabled communication primitives reside in torch.distributed.nn, so I probably should have used (and we should use in the VISSL implementation) torch.distributed.nn.all_reduce instead of torch.distributed.all_reduce.

Good catch, @odelalleau!

EDIT: whoops! I meant @OlivierDehaene. Sorry.

odelalleau · 2021-03-22T18:52:28Z

Good catch, @odelalleau!

Giving credit where it's due: good catch @OlivierDehaene! ;)

prigoyal · 2021-03-23T12:02:17Z

@OlivierDehaene , thank you for your continued awesome work on this PR :) Just wanted to chime in for an early recommendation for a next step (EDIT: seems like you have planned this already?): in order for us to ensure reproducibility , we should aim for an entry with Barlow Twins in the VISSL Model zoo. This entry should have a pre-trained model on ImageNet (cc @jzbontar) and we should list the evaluation accuracy of this model in the table.

For training on ImageNet, @jzbontar mentioned he might be able to help by running your code so it would be good to coordinate a plan for that. For the linear evaluation on ImageNet, please do reach out if you run into any issues or have any questions :)

prigoyal

looking really good. Small inline comments and thank you @OlivierDehaene for your hard work on this :)

configs/config/pretrain/barlow_twins/barlow_twins_4node_resnet.yaml

docs/source/ssl_approaches/barlow_twins.rst

tests/test_losses.py

vissl/losses/barlow_twins_loss.py

prigoyal · 2021-03-23T12:17:44Z

vissl/losses/barlow_twins_loss.py

+
+class SyncNormalizeFunction(Function):
+    """
+    Adapted from: https://github.com/NVIDIA/apex/blob/master/apex/parallel/sync_batchnorm.py


curious question: why can't we use the function directly from NVIDIA? if we want to not introduce dependency on apex and that's the reason, that's valid and in that case we could introduce vissl/utils/apex_helpers.py and put this function there. This will allow usage beyond this loss function. :)

For me there are several advantages to this:

I think it is important that the criterion is not dependant on train() and eval() cycles. Using a BatchNorm layer would add hidden running statistics that could lead to weird behaviour. It is possible to disable said statistics using track_running_stats =False, however this argument is bugged in the Apex implementation of SyncBatchNorm. See my issue here: [BUG] Apex otimized SyncBatchnormFunction crashes with track_running_stats=False NVIDIA/apex#1071.

We do not need to add additional logic to switch between PyTorch SyncBatchNorm and the faster Apex SyncBatchNorm depending on if Apex was installed by the user.

One could argue that it makes it clearer to the end user what is actually happening here

we could introduce vissl/utils/apex_helpers.py and put this function there. This will allow usage beyond this loss function. :)

100 % agree. I wasn't sure where to add it. It also needs to be tested.

vissl/losses/barlow_twins_loss.py

OlivierDehaene · 2021-03-23T15:41:17Z

@jzbontar, @QuentinDuval,

I did a rerun of the integration test on Imagenette 160 to check if the new version of the loss was working. The rerun achieved 88.8% Top1 accuracy.
I think we are getting close. :)

OlivierDehaene · 2021-03-23T22:05:31Z

@jzbontar,

I saw your post on facebookresearch/barlowtwins#8 and I think there is still a weird interaction between torch.all_reduce and gradients.

For example, the following code doesn't yield the same gradients when using torch.all_reduce and GatherLayer, a specific gather op that doesn't cut gradients:

import torch

from vissl.utils.distributed_gradients import GatherLayer

n, m = 8, 4

def main_worker(i):
    torch.distributed.init_process_group(backend='gloo', init_method='tcp://localhost:12345', world_size=4, rank=i)
    x = torch.full((n, m), i+1, dtype=torch.float32).requires_grad_()

    # compute xTx using all_reduce
    xTx = x.T @ x
    torch.distributed.all_reduce(xTx)
    xTx.sum().backward()
    if i == 0:
        print(x.grad)

# tensor([[8., 8., 8., 8.],
#         [8., 8., 8., 8.],
#         [8., 8., 8., 8.],
#         [8., 8., 8., 8.],
#         [8., 8., 8., 8.],
#         [8., 8., 8., 8.],
#         [8., 8., 8., 8.],
#         [8., 8., 8., 8.]])

    x.grad.zero_()

    # compute xTx using GatherLayer
    z = torch.cat(GatherLayer.apply(x))
    zTz = z.T @ z
    zTz.sum().backward()
    if i == 0:
        print(x.grad)

# tensor([[32., 32., 32., 32.],
#         [32., 32., 32., 32.],
#         [32., 32., 32., 32.],
#         [32., 32., 32., 32.],
#         [32., 32., 32., 32.],
#         [32., 32., 32., 32.],
#         [32., 32., 32., 32.],
#         [32., 32., 32., 32.]])

# xTx == zTz
# tensor([[240., 240., 240., 240.],
#         [240., 240., 240., 240.],
#         [240., 240., 240., 240.],
#         [240., 240., 240., 240.]], grad_fn=<MmBackward>)


if __name__ == '__main__':
    torch.multiprocessing.spawn(main_worker, nprocs=4)

Could this also be related to your scaling issue?

EDIT: it may actually be an error in GatherLayer.
@prigoyal, @QuentinDuval, shouldn't there be a rescaling by 1/world_size here? Since the default reduce op is SUM. https://github.com/facebookresearch/vissl/blob/master/vissl/utils/distributed_gradients.py#L27

jzbontar · 2021-03-24T09:39:42Z

EDIT: it may actually be an error in GatherLayer. @prigoyal, @QuentinDuval, shouldn't there be a rescaling by 1/world_size here?

Yes, right? I think so too. The following example shows that the torch.distributed.all_reduce solution computes gradients that are the same as the baseline one-process solution. Whereas the loss in the torch.distributed.nn.all_reduce solution needs to be scaled by 1 / world_size.

import torch
import torch.distributed.nn

# compute gradient of xTy**2
# the gradient wrt x is 2 * xTy * y
# the gradient wrt y is 2 * xTy * x

def main_worker(rank):
    torch.distributed.init_process_group(backend='gloo', init_method='tcp://localhost:12345', world_size=4, rank=rank)

    # torch.distributed.all_reduce solution
    x = torch.randn(4).requires_grad_()
    y = torch.randn(4).requires_grad_()
    xTy = torch.dot(x, y)
    torch.distributed.all_reduce(xTy)
    loss = xTy.pow(2)
    loss.backward()
    assert torch.equal(x.grad, 2 * xTy * y) # this is correct and matches the baseline
    assert torch.equal(y.grad, 2 * xTy * x)

    # torch.distributed.nn.all_reduce solution
    x = torch.randn(4).requires_grad_()
    y = torch.randn(4).requires_grad_()
    xTy_1gpu = torch.dot(x, y)
    xTy = torch.distributed.nn.all_reduce(xTy_1gpu)
    loss = xTy.pow(2)
    loss /= 4  # have to divide loss by world_size to match the baseline solution
    loss.backward()
    assert torch.equal(x.grad, 2 * xTy * y)
    assert torch.equal(y.grad, 2 * xTy * x)

if __name__ == '__main__':
    # single process baseline
    x = torch.randn(16).requires_grad_()
    y = torch.randn(16).requires_grad_()
    xTy = torch.dot(x, y)
    loss = xTy.pow(2)
    loss.backward()
    assert torch.equal(x.grad, 2 * xTy * y)
    assert torch.equal(y.grad, 2 * xTy * x)

    torch.multiprocessing.spawn(main_worker, nprocs=4)

Maybe we should really be using reduce instead of all_reduce?

prigoyal · 2021-03-24T12:26:59Z

@prigoyal, @QuentinDuval, shouldn't there be a rescaling by 1/world_size here? Since the default reduce op is SUM. https://github.com/facebookresearch/vissl/blob/master/vissl/utils/distributed_gradients.py#L27

@OlivierDehaene , in VISSL, https://github.com/facebookresearch/vissl/blob/master/vissl/trainer/train_steps/standard_train_step.py#L158 is the place where we call all_reduce_mean from classy vision which divides by the world size https://github.com/facebookresearch/ClassyVision/blob/master/classy_vision/generic/distributed_util.py#L69-L73

… gradients and save one all-to-all communication.

Added 300epochs model

facebook-github-bot · 2021-04-29T15:45:13Z

@OlivierDehaene has updated the pull request. You must reimport the pull request before landing.

OlivierDehaene · 2021-04-29T15:46:19Z

@prigoyal,

I forgot I was on a fork and rebased on my fork instead of this one... Should be good now!

facebook-github-bot · 2021-04-29T16:05:20Z

@prigoyal has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

OlivierDehaene · 2021-04-30T11:55:55Z

@prigoyal,

I can't see the details of the failing tests so IDK what to patch.

prigoyal · 2021-04-30T11:57:49Z

@prigoyal,

I can't see the details of the failing tests so IDK what to patch.

Hi @OlivierDehaene , I will comment on relevant places in the PR shortly :)

prigoyal

I left some last remaining inline comments

Additionally, we need to run linter on the PR. Follow the instructions https://github.com/facebookresearch/vissl/blob/master/dev/README.md#practices-for-coding-quality to run the linter provided by VISSL and update the PR :)

configs/config/pretrain/barlow_twins/barlow_twins_4node_resnet.yaml

configs/config/test/integration_test/quick_barlow_twins.yaml

configs/config/debugging/pretrain/barlow_twins/barlow_twins_1node_resnet_imagenette_160.yaml

docs/source/ssl_approaches/barlow_twins.rst

configs/config/debugging/pretrain/barlow_twins/barlow_twins_1node_resnet_imagenette_160.yaml

configs/config/pretrain/barlow_twins/barlow_twins_4node_resnet.yaml

configs/config/test/integration_test/quick_barlow_twins.yaml

facebook-github-bot · 2021-04-30T12:35:27Z

@OlivierDehaene has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-04-30T12:37:30Z

@OlivierDehaene has updated the pull request. You must reimport the pull request before landing.

OlivierDehaene · 2021-04-30T12:38:27Z

@prigoyal,

Some lints fail on file that were not touched by this PR. I left them as is.

prigoyal · 2021-04-30T12:39:28Z

@prigoyal,

Some lints fail on file that were not touched by this PR. I left them as is.

@OlivierDehaene , that sounds great. I'll work on linting the other files :)

OlivierDehaene · 2021-04-30T12:54:56Z

Sorry I misunderstood what you meant with TRUNK and TRUNK_PARAMS. It should be good now!

facebook-github-bot · 2021-04-30T13:09:15Z

@prigoyal has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: ## Required (TBC) - [X] BarlowTwinsLoss and Criterion - [x] Documentation - [X] Loss - [x] SSL Approaches + Index - [x] Model Zoo - [x] Project - [x] Default configs - [x] pretrain - [X] test/integration - [X] debugging/pretrain - [x] Benchmarks - [x] ImageNet: 70.75 for 300 epochs - [x] Imagenette 160: 88.8 Top1 accuracy closes #229 Pull Request resolved: #230 Reviewed By: iseessel Differential Revision: D28118605 Pulled By: prigoyal fbshipit-source-id: 4436d6fd9d115b80ef5c5396318caa3cb26faadb

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 10, 2021

OlivierDehaene mentioned this pull request Mar 11, 2021

Implement Barlow Twins #229

Closed

3 tasks

prigoyal requested review from jzbontar and prigoyal March 11, 2021 14:27

OlivierDehaene commented Mar 14, 2021

View reviewed changes

configs/config/test/integration_test/quick_barlow_twins.yaml Outdated Show resolved Hide resolved

OlivierDehaene commented Mar 14, 2021

View reviewed changes

configs/config/test/integration_test/quick_barlow_twins.yaml Show resolved Hide resolved

prigoyal reviewed Mar 16, 2021

View reviewed changes

QuentinDuval reviewed Mar 17, 2021

View reviewed changes

vissl/losses/barlow_twins_loss.py Show resolved Hide resolved

OlivierDehaene commented Mar 22, 2021

View reviewed changes

vissl/losses/barlow_twins_loss.py Show resolved Hide resolved

OlivierDehaene requested a review from QuentinDuval March 22, 2021 16:20

prigoyal reviewed Mar 23, 2021

View reviewed changes

jzbontar mentioned this pull request Mar 23, 2021

Is torch.distributed.all_reduce working as expected? facebookresearch/barlowtwins#8

Closed

OlivierDehaene added 12 commits April 29, 2021 17:44

Added docs

e3dad35

Backward pass was dysfunctional

2125fb0

Patched test

e1fff28

Patched test

cc2866b

Cleaned SyncNormalizeFunction

664bad2

Taking into account review comments

1083fb6

reduce over the cross-correlation matrix is overall faster as it cuts…

c74fb2c

… gradients and save one all-to-all communication.

Added LARS

32f182d

Flake8

1a61226

Patched config

c1d2f9d

Added Facebook copyrights

1c113fe

Added 300epochs model

Removed LARC mentions in doc as Barlow Twins uses its own version.

a5ba52b

OlivierDehaene force-pushed the barlow_twins branch from dfa33e5 to a5ba52b Compare April 29, 2021 15:45

prigoyal reviewed Apr 30, 2021

View reviewed changes

Linting + configs patch

947c7fc

Patched barlow_twins.rst

28c294f

misunderstanding around TRUNK and TRUNK_PARAMS

fa7b119

facebook-github-bot closed this Apr 30, 2021

Barlow Twins implementation #230

Barlow Twins implementation #230

Conversation

OlivierDehaene commented Mar 10, 2021 • edited

Required (TBC)

facebook-github-bot commented Mar 10, 2021

Action Required

Process

facebook-github-bot commented Mar 10, 2021

facebook-github-bot commented Mar 11, 2021

OlivierDehaene commented Mar 15, 2021

QuentinDuval commented Mar 15, 2021 • edited

prigoyal left a comment

Choose a reason for hiding this comment

OlivierDehaene commented Mar 16, 2021 • edited

prigoyal commented Mar 17, 2021

prigoyal commented Mar 17, 2021

OlivierDehaene commented Mar 22, 2021 • edited

prigoyal commented Mar 22, 2021

OlivierDehaene commented Mar 22, 2021 • edited

jzbontar commented Mar 22, 2021 • edited

odelalleau commented Mar 22, 2021

prigoyal commented Mar 23, 2021 • edited

prigoyal left a comment

Choose a reason for hiding this comment

prigoyal Mar 23, 2021

Choose a reason for hiding this comment

OlivierDehaene Mar 23, 2021 • edited

Choose a reason for hiding this comment

OlivierDehaene Mar 23, 2021

Choose a reason for hiding this comment

OlivierDehaene commented Mar 23, 2021 • edited

OlivierDehaene commented Mar 23, 2021 • edited

jzbontar commented Mar 24, 2021 • edited

prigoyal commented Mar 24, 2021

facebook-github-bot commented Apr 29, 2021

OlivierDehaene commented Apr 29, 2021

facebook-github-bot commented Apr 29, 2021

OlivierDehaene commented Apr 30, 2021 • edited

prigoyal commented Apr 30, 2021

prigoyal left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Apr 30, 2021

facebook-github-bot commented Apr 30, 2021

OlivierDehaene commented Apr 30, 2021

prigoyal commented Apr 30, 2021

OlivierDehaene commented Apr 30, 2021

facebook-github-bot commented Apr 30, 2021

OlivierDehaene commented Mar 10, 2021 •

edited

QuentinDuval commented Mar 15, 2021 •

edited

OlivierDehaene commented Mar 16, 2021 •

edited

OlivierDehaene commented Mar 22, 2021 •

edited

OlivierDehaene commented Mar 22, 2021 •

edited

jzbontar commented Mar 22, 2021 •

edited

prigoyal commented Mar 23, 2021 •

edited

OlivierDehaene Mar 23, 2021 •

edited

OlivierDehaene commented Mar 23, 2021 •

edited

OlivierDehaene commented Mar 23, 2021 •

edited

jzbontar commented Mar 24, 2021 •

edited

OlivierDehaene commented Apr 30, 2021 •

edited