- word2vec (Google): Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey Dean released Efficient Estimation of Word Representations in Vector Space
- GloVe (Standford): Jeffrey Pennington, Richard Socher, and Christopher D. Manning released GloVe: Global Vectors for Word Representation
- fastText (Facebook): Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch and Armand Joulin released Advances in Pre-Training Distributed Word Representations
- BERT (Google): Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova released BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Used Hugging Face PyTorch version.
- RoBERTa (UW/Facebook): Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov released RoBERTa: A Robustly Optimized BERT Pretraining Approach. Used Hugging Face PyTorch version.
- DistilBERT (Hugging Face): . Used Hugging Face PyTorch version.
- GPT2 (OpenAI): Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever released Language Models are Unsupervised Multitask Learners. Used Hugging Face PyTorch version.
- DistilGPT2 (Hugging Face): Used Hugging Face PyTorch version.
- XLNet (Google/CMU): Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le released XLNet: Generalized Autoregressive Pretraining for Language Understanding. Used Hugging Face PyTorch version.
- Fairseq WMT19 (Facebook): Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli and Sergey Edunov released Facebook FAIR’s WMT19 News Translation Task Submission
Some of the above augmenters are inspired by the following research papers. However, it does not always follow original implementation due to different reasons. If original implementation is needed, please refer to original source code.
- J. Salamon and J. P. Bello. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. 2016
- Y. Belinkov and Y. Bisk. Synthetic and Natural Noise Both Break Neural Machine Translation. 2017
- J. Ebrahimi, A. Rao, D. Lowd and D. Dou. HotFlip: White-Box Adversarial Examples for Text Classification. 2018
- J. Ebrahimi, D. Lowd and Dou. On Adversarial Examples for Character-Level Neural Machine Translation. 2018
- D. Pruthi, B. Dhingra and Z. C. Lipton. Combating Adversarial Misspellings with Robust Word Recognition. 2019
- T. Niu and M. Bansal. Adversarial Over-Sensitivity and Over-Stability Strategies for Dialogue Models. 2018
- P. Minervini and S. Riedel. Adversarially Regularising Neural NLI Models to Integrate Logical Background Knowledge. 2018
- X. Zhang, J. Zhao and Y. LeCun. Character-level Convolutional Networks for Text Classification. 2015
- S. Kobayashi and C. Coulombe. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs. 2018
- Q. Xie, Z. Dai, E Hovy, M. T. Luong and Q. V. Le. Unsupervised Data Augmentation. 2019
- W. Y. Wang and D. Yang. That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets. 2015
- S. Kobayashi. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relation. 2018
- D. S. Park, W. Chan, Y. Zhang, C. C. Chiu, B. Zoph, E. D. Cubuk and Q. V. Le. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. 2019
- R. Jia and P. Liang. Adversarial Examples for Evaluating Reading Comprehension Systems. 2017
- M. Alzantot, Y. Sharma, A. Elgohary, B. Ho, M. B. Srivastava and K. Chang. Generating Natural Language Adversarial Examples. 2018
- Z. Xie, S. I. Wang, J. Li, D. Levy, A. Nie, D. Jurafsky and A. Y. Ng. Data Noising as Smoothing in Natural Network Language Models. 2017
- N. Jaitly and G. E. Hinton. Vocal Tract Length Perturbation (VTLP) improves speech recognition. 2013
- N. Ng, K. Yee, A. Baevski, M. Ott, M. Auli and S Edunov. Facebook FAIR’s WMT19 News Translation Task Submission. 2019
- V. Kumar, A. Choudhary and E. Cho. Data Augmentation using Pre-trained Transformer Models. 2020
- Y. Hwang, H. Cho, H. Yang, D. Won, I. Oh and S. Lee. Mel-spectrogram augmentation for sequence-to-sequence voice conversion. 2020
- G. G. Sahin and M. Steedman. Data Augmentation via Dependency Tree Morphing for Low-Resource Languages. 2019
- M. Regina, M. Meyer and S. Goutal Text Data Augmentation: Towards better detection of spear-phishing emails. 2020