# LLM Context Extension

## Introduction

LLM context length, or context window, is the maximum input length (number of tokens) that the model can process each time. It is the model's "attention span," within which the model can pay attention to for generating the next token. In attention implementation, it is the parameter that usually named `seq_len` (or `max_seq_len`).

LLMs are stateless. It is the context window that decides how much information we can give the model each time to get an output, which then decides what tasks the model can do for us. For example, is it able to process a couple of sentences/paragraphs, or a whole bunch of documents, or even entire books and code bases?

The attention mechanism has quadratic complexity with regard to the context length (O(N^2)). Extending context length N means quickly increasing computation and memory costs. Therefore, efficient context extension is essential for both evaluation quality and system performance.

## Growing Context Length

Eraly Transformers had limited context lengths. BERT(2018) and T5(2020) both have a max context length of 512. GPT-2(2019) doubled BERT and made it 1024. GPT-3(2020) doubled GPT-2 to 2048, GPT-3.5(2022) doubled it again to 4096, then GPT-4(2023) did the double game again and pushed it to 8192.

With the context window growth from 512 to 8K tokens, the complexity and diversity of tasks that an LLM can handle evolved from sentence-level tasks such as sentiment classification, NER, translation to multi-turn complex dialogues and multi-page document processing.




Landscape
- Early Transformers: 512~1024 tokens (BERT, GPT-2)
- Each generation pushed further: GPT-3 (2K-4K), open source long-context models (8K-32K), and now commercial LLMs offering 128K~1M+ tokens (GPT-4, Claude 3.5/4, Gemini 1.5)

## Encoding-Based Context Extension Techniques

### RoPE Variants

- Position Interpolation
- NTK-Aware Scaling
- xPos
- YaRN

### Non-RoPE Alternatives

- ALiBi
- Relative Position Representations (Shaw et al., Transformer-XL, T5)
- Recurrence & Memory (Transformer-XL, Compressive Transformer)

## Beyond Encodings

- Sparse Attention (LongFormer, BigBird)
- Sliding Window Attention (MPT, GPT-OSS)
- Recurrence & Memory (Transformer-XL, Compressive Transformer, RWKV)
- Hybrid Layer