Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss stuck, not decreasing #27

Closed
mesnico opened this issue Feb 24, 2020 · 2 comments
Closed

Loss stuck, not decreasing #27

mesnico opened this issue Feb 24, 2020 · 2 comments

Comments

@mesnico
Copy link

mesnico commented Feb 24, 2020

Hi, I'm noticing a very strange loss behavior during the training phase.
Initially, the loss decreases as it should be. At a certain point, it reaches a plateau from which most of the times cannot escape.
In particular, if I use pre-extracted features without fine-tuning the image encoder, the plateau is overtaken quite immediately, as show in the following plot:
image

However, if I try to fine-tune, the loss get stuck forever:
image

I noticed that the loss stuck on a very specific value, that is 2 * (batch_size * loss_margin).
It seems that the loss is collapsing to values where the difference between positive and negative pair similarities is always 0:
equation
and
equation

I'm using margin = 0.2. For the pre-extracted features I used a batch size = 128, while for the fine-tuning the batch size = 32. The configuration is the very same as yours.
In general I noticed this behavior happening when the network is too complex.
Maybe the reason is that good hard negatives cannot be found if I use batch sizes less than 128. However, I have hardware constraints.

Did you notice a similar behavior in your experiments? If so, how did you solve?
Thank you very much

@fartashf
Copy link
Owner

Thanks for reporting.
Generally, we observed that MAX loss can be harder to optimize. There are ways to reduce the difficulty:

  • Train with the SUM loss and make sure it achieves a reasonable performance. This is rather a sanity check for the rest of your code.
  • Stage-wise optimization. First, start from pretrained encoders and train only the embedding, then switch to fine-tuning and train end-to-end.
  • Tune the batch size. I expect the optimal batch size for MAX loss to be dataset-dependent. If the batch size is too large hard negatives can be outliers. If they are too small, it can take longer to train or the best we end up with is bounded by the SUM loss.
  • We have unpublished studies on the effect of a separate negative set. On MSCOCO, it happended that the optimal negative set size was the same as the batch size used (128). Depending on your setting, a large negative set could be cheaper. You can trade-off speed for a larger negative set. Choose a large negative set, compute all embeddings, get rid of activations for memory saving, match your mini-batch to a single example in the negative set, then do another forward pass only for selected pairs.
    Here is a plot for varying the negative set size on MSCOCO. figure

@mesnico
Copy link
Author

mesnico commented Feb 27, 2020

Thank you very much for these hints. Actually, I think that the stage-wise optimization is the way to go. If I first optimize using SUM loss and then I resume, after 10 epochs, using MAX loss, the problem disappears and the validation metrics keep increasing smoothly.

However, I will pay attention also to the batch size, as you suggested.
Thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants