Just helping myself keep track of LLM papers that I‘m reading, with an emphasis on inference and model compression.

Transformer Architectures

Attention Is All You Need
Fast Transformer Decoding: One Write-Head is All You Need - Multi-Query Attention
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Augmenting Self-attention with Persistent Memory (Meta 2019)
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers (Meta 2023)
Hyena Hierarchy: Towards Larger Convolutional Language Models

Foundation Models

LLaMA: Open and Efficient Foundation Language Models
PaLM: Scaling Language Modeling with Pathways
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Language Models are Unsupervised Multitask Learners (OpenAI) - GPT-2
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
OpenLLaMA: An Open Reproduction of LLaMA
Llama 2: Open Foundation and Fine-Tuned Chat Models
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Position Encoding

Self-Attention with Relative Position Representations
RoFormer: Enhanced Transformer with Rotary Position Embedding - RoPE
Transformer Language Models without Positional Encodings Still Learn Positional Information - NoPE
Rectified Rotary Position Embeddings - ReRoPE

KV Cache

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (Jun. 2023)
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Activation

Pruning

Optimal Brain Damage (1990)
Optimal Brain Surgeon (1993)
Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning (Jan. 2023) - Introduces Optimal Brain Quantization based on the Optimal Brain Surgeon
Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
A Simple and Effective Pruning Approach for Large Language Models - Introduces Wanda (pruning with Weights and Activations)

Quantization

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale - Quantization with outlier handling. Might be solving the wrong problem - see "Quantizable Transformers" below.
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - Another approach to quantization with outliers
Up or Down? Adaptive Rounding for Post-Training Quantization (Qualcomm 2020) - Introduces AdaRound
Understanding and Overcoming the Challenges of Efficient Transformer Quantization (Qualcomm 2021)
QuIP: 2-Bit Quantization of Large Language Models With Guarantees (Cornell Jul. 2023) - Introduces incoherence processing
SqueezeLLM: Dense-and-Sparse Quantization (Berkeley Jun. 2023)
Intriguing Properties of Quantization at Scale (Cohere May 2023)
Pruning vs Quantization: Which is Better? (Qualcomm Jul. 2023)

Normalization

Root Mean Square Layer Normalization
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing - Introduces gated attention and argues that outliers are a consequence of normalization

Sparsity and rank compression

Compressing Pre-trained Language Models by Decomposition - vanilla SVD composition to reduce matrix sizes
Language model compression with weighted low-rank factorization - Fisher information-weighted SVD
Numerical Optimizations for Weighted Low-rank Estimation on Language Model - Iterative implementation for the above
Weighted Low-Rank Approximation (2003)
Transformers learn through gradual rank increase
Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models
Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression
KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation
TRP: Trained Rank Pruning for Efficient Deep Neural Networks - Introduces energy-pruning ratio

Fine-tuning

LoRA: Low-Rank Adaptation of Large Language Models
QLoRA: Efficient Finetuning of Quantized LLMs
DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation - works over a range of ranks
Full Parameter Fine-tuning for Large Language Models with Limited Resources

Sampling

Scaling

Efficiently Scaling Transformer Inference (Google Nov. 2022) - Pipeline and tensor parallelization for inference
Megatron-LM (Nvidia Mar. 2020) - Intra-layer parallelism for training

Mixture of Experts

Adaptive Mixtures of Local Experts (1991, remastered PDF)
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Google 2017)
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (Google 2022)
Go Wider Instead of Deeper

Watermarking

More

Provide feedback

Saved searches