Add one-cycle policy #1776

lucaventurini2 · 2020-07-24T13:14:26Z

This PR adds support for the One Cycle policy described by Smith in https://arxiv.org/pdf/1803.09820.pdf .

It is based on the PyTorch implementation: https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.OneCycleLR .
I added only the strictly needed parameters to the trainer (max_lr, total_steps, cycle_momentum) and left the others to default values, as they usually work well, but if one wants to experiment with them we need to add them all to the trainer interface.

In experiments on private datasets, I have seen a behavior that seems similar to the one described in the paper (very fast convergence and good regularization).

I tried also to reproduce this example https://github.com/flairNLP/flair/blob/master/resources/docs/EXPERIMENTS.md#wnut-17-emerging-entity-detection-english and it reached

- F1-score (micro) 0.4925
- F1-score (macro) 0.4034

in 20 epochs.

If this PR is accepted I'd like to add some automated test also, maybe @alanakbik you can guide me in this (e.g. point me to some training test that I can adapt to one cycle)?

alanakbik · 2020-07-29T14:55:17Z

@lucaventurini2 thanks for adding this! Could you paste a quick example of a training script that uses this? (and sorry for the late reply - I've been offline)

lucaventurini2 · 2020-07-29T15:05:03Z

Yes, the minimum needed change to a script is something like:

from torch.optim.lr_scheduler import OneCycleLR

trainer.train(out_folder, scheduler=OneCycleLR, max_epochs=20)

alanakbik · 2020-07-29T15:24:46Z

Ah thanks. So 20 epochs is the recommendation? What about the cycle_momentum, should this be set as well? How did you get the f-score above?

lucaventurini2 · 2020-07-29T15:52:37Z

No, I tried 20 to try a number that was considerably less than what I saw in the wnut-17 example, but it's not a recommendation.

The example above is exactly how I got that result, I left the default lr. cycle_momentum should be set when normally we would want to use momentum, but I haven't seen a scenario where it was worthwhile to set, yet (but if you save some experiments where you use momentum, please try it!). The most critical parameters are the batch size and the lr, like explained in the paper, but values that are good with annealing should be good with one-cycle as well, since they are supposedly not diverging the loss.

lucaventurini2 · 2020-07-29T15:55:29Z

Just for clearness, in wnut-17 I launched the training with

trainer.train('resources/taggers/example-ner',
              train_with_dev=True, 
              scheduler=OneCycleLR, max_epochs=20)

alanakbik · 2020-08-03T14:44:36Z

@lucaventurini2 thanks again for adding this! I've tested a bit and everything looks good!

Add one-cycle policy

6aa6d16

alanakbik merged commit 928a168 into flairNLP:master Aug 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add one-cycle policy #1776

Add one-cycle policy #1776

lucaventurini2 commented Jul 24, 2020

alanakbik commented Jul 29, 2020

lucaventurini2 commented Jul 29, 2020

alanakbik commented Jul 29, 2020

lucaventurini2 commented Jul 29, 2020

lucaventurini2 commented Jul 29, 2020

alanakbik commented Aug 3, 2020

Add one-cycle policy #1776

Add one-cycle policy #1776

Conversation

lucaventurini2 commented Jul 24, 2020

alanakbik commented Jul 29, 2020

lucaventurini2 commented Jul 29, 2020

alanakbik commented Jul 29, 2020

lucaventurini2 commented Jul 29, 2020

lucaventurini2 commented Jul 29, 2020

alanakbik commented Aug 3, 2020