- A Neural Probabilistic Language Model
- Efficient Estimation of Word Representations in Vector Space
- Distributed representations of words and phrases and their compositionality
- GloVe: Global Vectors for Word Representation
- Enriching Word Vectors with Subword Information
- Extensions of recurrent neural network language model
- Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
- Sequence to Sequence Learning with Neural Networks
- Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
- Convolutional Neural Networks for Sentence Classification
- Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks
- Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
- Neural Machine Translation of Rare Words with Subword Units
- Attention is All You Need
- Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention
- Linformer: Self-Attention with Linear Complexity
- You May Not Need Attention
- Attention is not not Explanation
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Roberta: A Robustly Optimized BERT Pretraining Approach
- DeBERTa: Decoding-enhanced BERT with Disentangled Attention
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
- Improving Language Understanding by Generative Pre-Training
- An overview of gradient descent optimization algorithms
- An overview of gradient descent optimization algorithms
- Understanding the difficulty of training deep feedforward neural networks
- Cyclical Learning Rates for Training Neural Networks
- Don't Decay the Learning Rate, Increase the Batch Size
- Large Batch Optimization for Deep Learning: Training BERT in 76 minutes