Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Gradient accumulation of several sample #7964

Closed
edmBernard opened this issue Sep 20, 2017 · 7 comments
Closed

Gradient accumulation of several sample #7964

edmBernard opened this issue Sep 20, 2017 · 7 comments

Comments

@edmBernard
Copy link

I search to compute a triplet ranking loss.
To achieve this, I need to accumulate gradient through several example and update weight with the resulting gradient.

In chainer we can easily do that because backward function accumulate gradient by default :

We need to clear the gradients first because the backward() method accumulates gradients instead of overwriting the previous values.

Is there a way to do this with Mxnet Gluon and the autograd API ?

That similar to gradient accumulation inside a batch.

@ZiyueHuang
Copy link
Member

You mean only collect several examples' gradients in a batch to update weight? You can create a 0-1 mask and use it as the outermost out_grad to backward。

@edmBernard
Copy link
Author

A bit more description of the process (it's for image retrieval-like task):

  • compute triplet loss on 10000 (query, relevant, non-relevant) images
  • sort these triplet to keep the 100 first with higher loss
  • compute gradient on these 100 samples
  • aggregate gradients
  • update weight

I can't pack these 100 samples in one batch and make the learning process on this batch because a batch of 100*3 images take too much place in graphic card memory.

@ZiyueHuang
Copy link
Member

If you want to do this on whole dataset level. Then you should do an extra forward on whole dataset every time to get the indices of top-100, then create a batch, run forward-backward and update weight.

@edmBernard
Copy link
Author

I try but I can't run forward-backward on a batch of 100+ images, it consume too much memory and crash.
That's why I ask if there is a way to accumulate gradient outside of batch usage.

@ZiyueHuang
Copy link
Member

I think you can hack update in optimizer, store the gradients into states in previous batches. For example, if batch size is 10, then in the first 9 batches only aggregate the gradients into states and in the 10th batch aggregate the gradients then update the weights.

@edmBernard
Copy link
Author

I was looking for a more regular way :(
I will test to hack the optimizer thx

@ZiyueHuang
Copy link
Member

ZiyueHuang commented Sep 23, 2017 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants