Skip to content

v7.3.0: Mish activation and experimental optimizers

Compare
Choose a tag to compare
@ines ines released this 28 Oct 13:11

✨ New features and improvements

  • Add Mish activation. Use via the thinc.v2v.Mish layer, which computes f(X) = mish(W @ X + b). CUDA and Cython kernels are included to make the activation efficient.
  • Add experimental support for RAdam to the optimizer. Enable it with the keyword argument use_radam to True. In preliminary testing, it's a small change that's worth enabling.
  • Add experimental support for Lookahead to the optimizer. Enable it by setting the keyword argument lookahead_k to a positive integer. In preliminary testing, it helps if you're not using parameter averaging, but with averaging it's a bit worse.
  • Add experimental support for LARS to the optimizer. Enable it by setting use_lars to True. In preliminary testing, this hasn't worked well at all – possibly our implementation is broken.

🙏 Acknowledgements

Big thanks to @digantamisra98 for the Mish activation, especially the extensive experiments and simple gradient calculation. We expect to be using the activation in the next round of spaCy models.

Gratitude to the fast.ai community for their crowd-sourced experiments, and especially to users @lessw2020, @mgrankin and others for their optimizer implementations, which we referenced heavily when implementing the optimizers for Thinc. More importantly, it's super helpful to have a community filtering the deluge of papers for techniques that work on a few different datasets. This thread on optimization research was particularly helpful.