# vLLM: OpenAI REST API to LLM Inference Code Flow Trace

This notebook provides a comprehensive trace of the code flow from OpenAI REST API calls to actual LLM inference in vLLM v0.10.0.

## Overview
This document traces the complete journey of an API request through the vLLM codebase, including:
- API endpoint handling
- Request preprocessing and validation
- Engine scheduling
- Model execution
- Output post-processing

---

## Layer 1: OpenAI REST API Entry Point

### File: `vllm/entrypoints/openai/api_server.py`

When a client makes a request to the OpenAI-compatible API endpoint (e.g., `/v1/chat/completions`), the request is handled by FastAPI routes defined in this file.

**Key Classes & Functions:**
- `ChatCompletionRequest` - Request protocol/schema
- `create_chat_completion()` - Main endpoint handler for chat completions
- `app` - FastAPI application instance

**Flow:**
```
HTTP Request
    ↓
FastAPI Route Handler (e.g., @app.post("/v1/chat/completions"))
    ↓
parse_request() / validate_request()
    ↓
create_chat_completion(request: ChatCompletionRequest)
```

**Code Reference:**
```python
# vllm/entrypoints/openai/api_server.py:1857 lines
# Contains OpenAI-compatible API server implementation
# Uses FastAPI for HTTP handling
# Routes requests to appropriate serving handlers
```

---

## Layer 2: Request Serving & Processing

### File: `vllm/entrypoints/openai/serving_chat.py`

The `OpenAIServingChat` class handles chat-specific request processing.

**Key Classes:**
- `OpenAIServingChat(OpenAIServing)` - Handles chat completion logic
- `OpenAIServing` - Base serving class with common functionality

**Main Method: `create_chat_completion()`**

Steps:
1. **Check model validity** - Verify model exists and is loaded
2. **Get tokenizer** - Retrieve the tokenizer for this request
3. **Preprocess chat messages** - Convert chat messages to prompt format using chat template
4. **Create sampling parameters** - Set generation parameters (temperature, top_p, etc.)
5. **Prepare engine prompts** - Convert preprocessed messages to format engine expects
6. **Generate completion** - Call engine to generate tokens

**Code Flow:**
```python
create_chat_completion(request: ChatCompletionRequest)
    ├── _check_model(request)
    ├── engine_client.get_tokenizer(lora_request)
    ├── _preprocess_chat(
    │   ├── messages → conversation
    │   ├── chat_template application
    │   └── return: (conversation, request_prompts, engine_prompts)
    ├── Create SamplingParams from request
    └── _generate_completions(engine_prompts, sampling_params)
```

---

## Layer 3: Engine Interface (Async Layer)

### File: `vllm/entrypoints/openai/serving_engine.py`

**Key Class: `OpenAIServing`**

This is the base class handling:
- Request validation
- Input/output format conversion
- Engine client communication

**Critical Methods:**
```python
def _generate_completions(
    engine_prompts, 
    sampling_params, 
    request_id
)
    ├── engine_client.add_request(request_id, prompt, sampling_params)
    └── engine_client.generate(request_id)  # Returns async generator
```

**What happens:**
1. Request ID is generated (unique identifier for this request)
2. Prompts are passed to the engine
3. Sampling parameters are attached
4. Engine returns an async generator of results

---

## Layer 4: Async LLM Engine

### File: `vllm/engine/async_llm_engine.py`

**Key Class: `AsyncLLMEngine`**

This wraps the synchronous `LLMEngine` for async operation.

**Key Methods:**
```python
async def add_request(
    request_id: str,
    prompt: PromptType,
    sampling_params: SamplingParams,
    lora_request: Optional[LoRARequest] = None
)
    ├── RequestTracker.add_request() - Track new request
    └── _request_streams[request_id] = AsyncStream()  # Create result stream

async def generate(request_id: str)
    ├── while request not finished:
    ├── yield await request_stream.get_next_output()
    └── return final output
```

**Request Tracking:**
- `RequestTracker` maintains queues of new/pending requests
- `AsyncStream` provides an async generator for each request
- Engine loop processes requests in background

**Important: The actual scheduling & execution happens in the background engine loop:**
```python
async def run_engine_loop()  # Background task
    ├── while True:
    ├── process_new_requests()  # From RequestTracker
    ├── step()  # Call engine.step()
    └── update_request_streams()  # Send outputs to streams
```

---

## Layer 5: Core LLM Engine (Synchronous)

### File: `vllm/engine/llm_engine.py`

**Key Class: `LLMEngine`**

This is the core engine orchestrating the entire inference pipeline.

### Phase 1: Request Addition

**Method: `add_request()`** (Lines 619-769)

```python
def add_request(
    request_id: str,
    prompt: PromptType,
    params: Union[SamplingParams, PoolingParams],
    arrival_time: Optional[float] = None,
    lora_request: Optional[LoRARequest] = None,
    ...
)
    ├── Input validation
    ├── Preprocess prompt
    │   └── input_preprocessor.preprocess()
    ├── Tokenize prompt (if needed)
    ├── Create Sequence objects
    │   └── One per output (n=sampling_params.n)
    ├── Create SequenceGroup (batching unit)
    └── Add to scheduler
        └── self.scheduler.add_seq_group()
```

**Key Points:**
- Prompt is converted to token IDs via tokenizer
- `Sequence` objects track individual output sequences
- `SequenceGroup` represents all outputs for a single request
- Request goes into the scheduler's waiting queue

---

## Phase 2: Scheduling & Model Execution

### Method: `step()` (Lines 1194-1494)

This is called repeatedly in a loop and is the **core inference iteration**.

**Overview:**
```python
def step() -> List[Union[RequestOutput, PoolingRequestOutput]]:
    # STEP 1: Schedule
    ├── scheduler.schedule()  # Decide which sequences to run next
    │   ├── Select sequences from pending queue
    │   ├── Determine KV cache operations (swap in/out/copy)
    │   ├── Return: seq_group_metadata_list, scheduler_outputs
    │   
    # STEP 2: Execute Model
    ├── model_executor.execute_model(execute_model_req)
    │   └── Returns: List[SamplerOutput]
    │
    # STEP 3: Post-Process Outputs
    ├── _process_model_outputs()
    │   ├── Decode tokens to text
    │   ├── Update sequences with new tokens
    │   ├── Check stop conditions
    │   ├── Sample next tokens
    │   └── Mark finished sequences
    │
    # STEP 4: Return Results
    └── return request_outputs
```

**Key Variables:**
- `seq_group_metadata_list` - Metadata about sequences being executed
- `scheduler_outputs` - Blocks to swap, cache operations
- `execute_model_req` - Request passed to executor

---

## Layer 6: Scheduler

### File: `vllm/core/scheduler.py`

**Key Class: `Scheduler`**

**Method: `schedule()` returns `SchedulerOutputs`**

**What it does:**
```
Scheduler Input:
├── Pending requests queue
├── Running sequences (need more tokens)
└── KV cache status

Process:
├── Priority-based selection
├── Determine:
│   ├── Which sequences to run (fit in GPU memory)
│   ├── Which KV cache blocks to swap in/out
│   ├── Which blocks to copy (for beam search)
│   └── Prefill vs decode phase decision
└── Build seq_group_metadata_list

Output (SchedulerOutputs):
├── scheduled_seq_groups - Sequences to run
├── blocks_to_swap_in - KV cache blocks to load from CPU
├── blocks_to_swap_out - KV cache blocks to save to CPU
├── blocks_to_copy - KV cache blocks to duplicate
└── num_lookahead_slots - For speculative decoding
```

**Key Decision:** The scheduler packs as many sequences as possible into the batch while respecting:
- Maximum batch size
- GPU memory constraints
- KV cache limits

---

## Layer 7: Model Executor

### File: `vllm/executor/executor_base.py` (Base) & `vllm/executor/uniproc_executor.py` (Single GPU)

**Key Class: `ExecutorBase`** and implementations like `UniProcExecutor`

**Method: `execute_model(execute_model_req: ExecuteModelRequest)`**

For single-GPU setup:
```python
def execute_model(execute_model_req: ExecuteModelRequest):
    ├── Call driver_worker.execute_model(execute_model_req)
    └── return driver_outputs  # List[SamplerOutput]
```

For distributed setups:
```python
def execute_model(execute_model_req):
    ├── Start worker execution loop (if not running)
    ├── Execute model in driver worker
    ├── Broadcast metadata to other workers
    ├── Wait for all workers to complete
    └── Return outputs from driver
```

**What ExecuteModelRequest contains:**
```python
ExecuteModelRequest:
├── seq_group_metadata_list - Metadata for each sequence
├── blocks_to_swap_in/out/copy - KV cache operations
├── num_lookahead_slots - For speculative decoding
└── async_callback - For async output processing
```

---

## Layer 8: Worker

### File: `vllm/worker/worker_base.py`

**Key Class: `LocalOrDistributedWorkerBase`**

**Method: `execute_model(execute_model_req: ExecuteModelRequest)`**

```python
def execute_model(execute_model_req):
    ├── If distributed: Broadcast seq_group_metadata to other workers
    ├── prepare_worker_input(execute_model_req)
    │   └── Convert to WorkerInput format
    ├── Handle KV cache operations
    │   ├── Swap in blocks from CPU
    │   ├── Swap out blocks to CPU
    │   └── Copy blocks (for beam search)
    ├── execute_worker(worker_input)
    │   └── Calls model_runner.execute_model()
    └── return outputs
```

**What happens in worker:**
1. Prepare input tensors from metadata
2. Load KV cache if needed (from CPU swap space)
3. Call the model runner's execute_model
4. Save outputs and KV cache

---

## Layer 9: Model Runner

### File: `vllm/worker/model_runner.py`

**Key Class: `ModelRunner`** (inherits from `ModelRunnerBase`)

**Method: `execute_model(model_input, kv_caches, ...)`** (Line 234+)

This is where the actual neural network forward pass happens!

```python
def execute_model(model_input, kv_caches, num_steps=1):
    ├── prepare_inputs_for_generation()
    │   ├── _prepare_model_input_tensors()
    │   │   ├── Prepare input_ids (token IDs)
    │   │   ├── Create attention masks
    │   │   ├── Position IDs
    │   │   ├── Multi-modal data (if applicable)
    │   │   └── Any LoRA configuration
    │   │
    │   └── model_input = ModelInputForGeneration()
    │
    ├── Apply guided decoding logits processor
    ├── Apply custom logits processors
    │
    ├── forward() - **NEURAL NETWORK INFERENCE**
    │   ├── model.forward(
    │   │   input_ids,
    │   │   position_ids,
    │   │   attention_mask,
    │   │   kv_caches,
    │   │   ...
    │   │ )
    │   └── output = ModelOutput (logits, cache)
    │
    ├── Sampler - Token selection
    │   ├── Apply temperature scaling
    │   ├── Apply top-k filtering
    │   ├── Apply top-p (nucleus) sampling
    │   ├── Apply logits processors
    │   ├── Sample next token
    │   └── output = SamplerOutput
    │
    └── return SamplerOutput (logits, sampled_tokens, etc.)
```

**Critical Variables:**
- `input_ids` - Token IDs for current batch
- `position_ids` - Position in sequence
- `attention_mask` - Which tokens to attend to
- `kv_caches` - Cached key-value tensors for efficiency

---

## Layer 10: LLM Model Forward Pass

### File: Depends on model architecture (e.g., `vllm/model_executor/models/llama.py`)

**This is where the actual transformer model lives!**

```python
def forward(
    self,
    input_ids: torch.Tensor,      # Shape: [batch_size, seq_len]
    positions: torch.Tensor,      # Positions in sequences
    attn_metadata,                # Attention metadata
    past_key_values,              # KV cache (for efficiency)
    ...
) -> torch.Tensor:               # Shape: [batch_size, seq_len, vocab_size]
    
    ├── Embed tokens
    │   └── output = embedding(input_ids)
    │
    ├── For each transformer block:
    │   ├── Self-attention with KV cache
    │   │   ├── Query = input @ W_q
    │   │   ├── Key/Value from cache or compute
    │   │   ├── Attention weights = softmax(Q @ K^T)
    │   │   ├── Update KV cache for next iteration
    │   │   └── output = attention_weights @ V
    │   │
    │   ├── Feed-forward network
    │   │   ├── output = MLP(attention_output)
    │   │   └── return + residual connection
    │   │
    │   └── Layer normalization & residuals
    │
    ├── Final layer norm
    │
    ├── Project to vocabulary
    │   └── logits = output @ W_lm_head
    │
    └── return logits  # Scores for each token
```

**Key Insight: KV Caching**
- Instead of recomputing attention for all previous tokens, we cache their K,V
- Only compute attention for the new token against cached values
- This is the main speedup in autoregressive generation

---

## Layer 11: Sampling

### File: `vllm/model_executor/layers/sampler.py`

**After the model produces logits, we need to sample the next token**

```python
class Sampler:
    
    def forward(
        self,
        logits: torch.Tensor,          # [batch_size, vocab_size]
        sampling_params: SamplingParams,
        ...
    ) -> SamplerOutput:
        
        ├── 1. Apply Logits Processors (custom constraints)
        │   └── logits = apply_processors(logits)
        │
        ├── 2. Temperature Scaling
        │   └── scaled_logits = logits / temperature
        │
        ├── 3. Frequency Penalty
        │   └── Reduce scores for repeated tokens
        │
        ├── 4. Presence Penalty
        │   └── Reduce scores for tokens that appeared
        │
        ├── 5. Top-K Filtering
        │   └── Keep only top-k highest probability tokens
        │
        ├── 6. Top-P (Nucleus) Sampling
        │   └── Keep tokens until cumulative probability ≥ p
        │
        ├── 7. Convert Logits to Probabilities
        │   └── probs = softmax(logits)
        │
        ├── 8. Sample Token ID
        │   ├── Greedy: argmax(probs)
        │   ├── Stochastic: multinomial(probs)
        │   └── Beam search: keep top-b tokens
        │
        ├── 9. Compute Log Probabilities (if requested)
        │   └── log_probs = log(probs)
        │
        └── return SamplerOutput
           ├── sampled_token_ids
           ├── logprobs
           └── other metadata
```

**Sampling Parameters Control:**
- `temperature` - Randomness (0=greedy, >1=more random)
- `top_k` - Keep top-k tokens
- `top_p` - Nucleus sampling parameter
- `frequency_penalty`, `presence_penalty` - Repetition control
- `use_beam_search` - Enable beam search

---

## Layer 12: Output Processing

### Files: `vllm/engine/llm_engine.py` (Method: `_process_model_outputs()`)

Back in the engine, after getting sampler output:

```python
def _process_model_outputs(ctx: SchedulerContext):
    
    ├── For each SamplerOutput:
    │
    ├── Decode sampled tokens to text
    │   ├── Get token IDs from sampler output
    │   ├── Convert to token strings
    │   └── output_text = detokenizer.decode(tokens)
    │
    ├── Update Sequences with new tokens
    │   ├── sequence.append_token_id(token_id)
    │   ├── sequence.append_output_token(token, logprob)
    │   └── Update sequence length
    │
    ├── Check Stop Conditions
    │   ├── Reached max_tokens?
    │   ├── Generated stop_token?
    │   ├── Stop string detected?
    │   └── If yes: mark sequence as finished
    │
    ├── Process Logprobs (if requested)
    │   ├── Return top-k token probabilities
    │   └── Track log probabilities for analysis
    │
    ├── Create RequestOutput objects
    │   ├── request_id
    │   ├── prompt (original input)
    │   ├── outputs (list of CompletionOutput)
    │   │   ├── text (generated text)
    │   │   ├── finish_reason (stop, length, etc.)
    │   │   ├── cumulative_logprob
    │   │   └── logprobs
    │   ├── prompt_token_ids
    │   ├── finished (boolean)
    │   └── metadata
    │
    └── return List[RequestOutput]
```

---

## Layer 13: Response Formatting

### File: `vllm/entrypoints/openai/serving_chat.py` (continues after engine calls)

After outputs are received from the engine, they're formatted as OpenAI API responses:

```python
# In create_chat_completion():

async for request_output in results_generator:
    ├── Extract completion outputs
    ├── Format as OpenAI ChatCompletion format:
    │
    ├── ChatCompletionResponse:
    │   ├── id - unique response ID
    │   ├── object - "text_completion"
    │   ├── created - timestamp
    │   ├── model - model name
    │   ├── choices - list of choices:
    │   │   ├── message - ChatMessage
    │   │   │   ├── role - "assistant"
    │   │   │   └── content - generated text
    │   │   ├── finish_reason - "stop", "length", etc.
    │   │   └── index - choice index
    │   ├── usage - UsageInfo
    │   │   ├── prompt_tokens - input token count
    │   │   ├── completion_tokens - output token count
    │   │   └── total_tokens - sum
    │   └── system_fingerprint - for reproducibility
    │
    └── For streaming: yield formatted chunks
       └── ChatCompletionStreamResponse
          └── Contains delta (incremental content)
```

---

## Complete Code Flow Diagram

```
┌─────────────────────────────────────────────────────────────────────────┐
│                    HTTP REST API Request                                 │
│                 POST /v1/chat/completions                                │
│              (JSON with messages, model, parameters)                      │
└──────────────────────────────┬──────────────────────────────────────────┘
                               │
                 ┌─────────────▼─────────────┐
                 │   FastAPI Endpoint        │
                 │  api_server.py            │
                 └─────────────┬─────────────┘
                               │
                 ┌─────────────▼──────────────────┐
                 │  OpenAIServingChat             │
                 │  serving_chat.py               │
                 │  - Validate request            │
                 │  - Preprocess messages        │
                 │  - Apply chat template        │
                 └─────────────┬──────────────────┘
                               │
                 ┌─────────────▼──────────────────┐
                 │  AsyncLLMEngine               │
                 │  async_llm_engine.py          │
                 │  - add_request()              │
                 │  - Queue for processing       │
                 └─────────────┬──────────────────┘
                               │
            ┌──────────────────┴───────────────────────┐
            │    Background Engine Loop (Continuous)   │
            │           run_engine_loop()              │
            └──────────────────┬───────────────────────┘
                               │
        ┌──────────────────────▼──────────────────────┐
        │  LLMEngine.step()                          │
        │  - Scheduler: Select sequences             │
        │  - Executor: Run model                     │
        │  - Post-process: Decode & update          │
        │  - Return outputs                          │
        └──────┬─────────────────────────┬───────────┘
               │                         │
       ┌───────▼──────┐         ┌───────▼──────────┐
       │  Scheduler   │         │  ModelExecutor   │
       │              │         │                  │
       │ - Batch seqs │         │ - Distribute work│
       │ - KV cache   │         │ - Call workers   │
       │   management │         │                  │
       └──────────────┘         └────────┬─────────┘
                                         │
                              ┌──────────▼────────────┐
                              │  Worker               │
                              │  worker_base.py      │
                              │                       │
                              │ - Prepare inputs     │
                              │ - KV cache ops       │
                              │ - Call model_runner  │
                              └──────────┬───────────┘
                                         │
                              ┌──────────▼─────────────┐
                              │  ModelRunner          │
                              │  model_runner.py      │
                              │                        │
                              │ - Tokenize inputs     │
                              │ - Setup attention     │
                              │ - Call model.forward()│
                              │ - Sampler: pick token │
                              └──────────┬────────────┘
                                         │
                              ┌──────────▼───────────────┐
                              │  LLM Model              │
                              │  Transformer            │
                              │                          │
                              │ - Embeddings            │
                              │ - Attention layers      │
                              │ - FFN layers            │
                              │ - Output logits         │
                              └──────────┬──────────────┘
                                         │
                              ┌──────────▼──────────┐
                              │  Sampler            │
                              │                      │
                              │ - Temperature        │
                              │ - Top-K              │
                              │ - Top-P              │
                              │ - Sample token ID    │
                              └──────────┬──────────┘
                                         │
                              ┌──────────▼───────────────┐
                              │  Output Processing       │
                              │                          │
                              │ - Detokenize            │
                              │ - Check stop conditions  │
                              │ - Create RequestOutput   │
                              └──────────┬───────────────┘
                                         │
                ┌────────────────────────▼──────────────────┐
                │  Response Formatting                      │
                │  (Back in OpenAIServingChat)              │
                │                                            │
                │ - Convert to ChatCompletionResponse        │
                │ - Add usage statistics                     │
                │ - Handle streaming vs non-streaming       │
                └────────────────┬───────────────────────────┘
                                 │
                ┌────────────────▼────────────────┐
                │  HTTP Response (JSON)           │
                │  Sent back to client             │
                └─────────────────────────────────┘
```

---

## Key Data Structures

### 1. Sequence & SequenceGroup

```python
Sequence:
  ├── request_id - Links to original request
  ├── prompt_token_ids - Tokenized input
  ├── output_token_ids - Generated tokens so far
  ├── logprobs - Log probabilities per token
  ├── status - WAITING, RUNNING, FINISHED
  ├── kv_cache - Cached key-value tensors
  └── generation_config - Temperature, top_k, etc.

SequenceGroup:
  ├── request_id - Unique request identifier
  ├── seqs - List[Sequence] (usually n=1, but more for beam search)
  ├── arrival_time - When request arrived
  ├── sampling_params - SamplingParams
  ├── lora_request - LoRA adapter (optional)
  └── status - PENDING, RUNNING, FINISHED
```

### 2. SamplingParams

```python
SamplingParams:
  ├── temperature - Randomness (default: 1.0)
  ├── top_p - Nucleus sampling (default: 1.0)
  ├── top_k - Top-k filtering (default: -1, disabled)
  ├── max_tokens - Maximum length (default: None)
  ├── frequency_penalty - Penalize repeated tokens
  ├── presence_penalty - Penalize seen tokens
  ├── use_beam_search - Enable beam search
  ├── best_of - Number of beams
  ├── repetition_penalty - Another repetition control
  ├── stop - Stop strings/tokens
  ├── logprobs - Number of logprobs to return
  └── prompt_logprobs - Include prompt logprobs
```

### 3. RequestOutput

```python
RequestOutput:
  ├── request_id - Matches the request
  ├── prompt - Original prompt text
  ├── prompt_token_ids - Tokenized input
  ├── prompt_logprobs - Token probabilities from prompt (optional)
  ├── outputs - List[CompletionOutput]
  │   ├── text - Generated text
  │   ├── finish_reason - "stop", "length", etc.
  │   ├── cumulative_logprob - Sum of log probabilities
  │   └── logprobs - Per-token probabilities
  ├── finished - Whether generation is complete
  └── metadata - Request timing and stats
```

---

## Important Concepts

### KV Cache (Key-Value Cache)

**Why it matters:** The main performance optimization in LLM serving

In standard transformer attention:
```
For each token position i:
  - Compute Query(i) from current token
  - Compute attention against ALL previous K,V values
  - This is O(n²) with sequence length
```

With KV caching:
```
On first pass (prefill):
  - Compute all Q,K,V
  - Store K,V for reuse
  
On subsequent passes (decode):
  - Only compute Q for new token
  - Reuse cached K,V from previous tokens
  - This is O(n) with sequence length
  
Result: ~100x speedup for long sequences
```

**vLLM manages KV cache:**
- Allocates blocks of memory
- Tracks which blocks are in use
- Swaps blocks between GPU/CPU
- Copies blocks for beam search

### Batching Strategy

vLLM uses **continuous batching** (also called dynamic batching):

```
Traditional batching:
  - Wait for full batch before running
  - All sequences must finish together
  - Wasted GPU time if some finish early

Continuous batching (vLLM):
  - Add new requests anytime
  - Remove finished sequences
  - Pack GPU with running sequences
  - Much higher utilization
```

This is done by the Scheduler in `engine.step()`.

---

## File Structure Summary

```
vllm/
├── entrypoints/
│   ├── openai/
│   │   ├── api_server.py              [Layer 1] HTTP entry point
│   │   ├── serving_chat.py            [Layer 2] Chat request handling
│   │   ├── serving_engine.py          [Layer 3] Base serving logic
│   │   └── protocol.py                Request/response schemas
│   ├── llm.py                         High-level LLM class
│   └── utils.py                       Utility functions
│
├── engine/
│   ├── llm_engine.py                  [Layer 5] Core engine
│   ├── async_llm_engine.py            [Layer 4] Async wrapper
│   ├── protocol.py                    Engine interface
│   └── ...
│
├── core/
│   └── scheduler.py                   [Layer 6] Scheduling logic
│
├── executor/
│   ├── executor_base.py               [Layer 7] Base executor
│   ├── uniproc_executor.py            Single GPU executor
│   └── ...
│
├── worker/
│   ├── worker_base.py                 [Layer 8] Worker interface
│   ├── model_runner.py                [Layer 9] Model execution
│   └── ...
│
├── model_executor/
│   ├── models/                        [Layer 10] Model implementations
│   │   ├── llama.py
│   │   ├── mistral.py
│   │   └── ...
│   └── layers/
│       ├── sampler.py                 [Layer 11] Token sampling
│       ├── attention.py
│       └── ...
│
├── config.py                          Configuration classes
├── sampling_params.py                 Sampling parameters
├── sequence.py                        Sequence/SequenceGroup
└── outputs.py                         Output classes
```

---

## Execution Timeline for a Single Request

### Time 0: Request Arrives

```python
POST /v1/chat/completions
{
  "model": "meta-llama/Llama-2-7b",
  "messages": [{"role": "user", "content": "Hello, how are you?"}],
  "temperature": 0.7,
  "max_tokens": 100
}
```

### Time 1: Request Processing (FastAPI Handler)

- Parse JSON
- Validate request format
- Pass to OpenAIServingChat.create_chat_completion()

### Time 2: Preprocessing (OpenAIServingChat)

- Get tokenizer
- Apply chat template
- Convert to prompt format: `"Hello, how are you?"`
- Tokenize: `[13881, 29892, 1920, 526, 366, 29973]`
- Create SamplingParams(temperature=0.7, max_tokens=100)

### Time 3: Queue Request (AsyncLLMEngine)

- Generate request_id: e.g., `"req-001"`
- Create AsyncStream for tracking output
- Call `add_request()` on background engine
- Immediately return async generator

### Time 4: Scheduler Picks Request (Engine.step())

**Iteration 1:**
- Scheduler sees new request in queue
- Checks: Can it fit on GPU?
- Allocates: KV cache blocks, attention memory
- Creates SequenceGroupMetadata
- Marks as ready to run

### Time 5: Prefill Pass (Tokenization → Attention)

```python
ModelRunner.execute_model()
  ├── input_ids = [13881, 29892, 1920, 526, 366, 29973]
  ├── Model forward pass
  │   ├── Embed all input tokens
  │   ├── Self-attention over ALL input tokens
  │   ├── Store full KV cache
  │   └── Output logits for last position
  ├── Sampler picks next token (e.g., ID=29892 for ",")
  └── Append to sequence
```

**Result:** sequence_tokens = `[13881, 29892, 1920, 526, 366, 29973, 29892]`

### Time 6-N: Decode Loop (Repeated)

**For each new token (iterations 2-100):**

```python
Iteration 2:
  ├── input_ids = [29892]  ← Only new token!
  ├── Model forward pass
  │   ├── Embed new token
  │   ├── Attention: Query against CACHED K,V
  │   ├── Update KV cache with new K,V
  │   └── Output logits for this position
  ├── Sampler picks next token
  └── Append to sequence
```

**Performance:** ~100x faster than prefill due to KV cache

### Time N+1: Stop Condition Met

```python
- Generated "<|endoftext|>" token OR
- Reached max_tokens=100 OR
- Stop string found
└── Mark sequence as FINISHED
```

### Time N+2: Post-Processing & Response

```python
- Detokenize: [13881, 29892, ...] → "Hello, I'm doing great!"
- Create RequestOutput
- Format as ChatCompletionResponse
- Return to client

Response:
{
  "id": "chatcmpl-001",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "meta-llama/Llama-2-7b",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Hello, I'm doing great! How are you?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 6,
    "completion_tokens": 12,
    "total_tokens": 18
  }
}
```

---

## Performance Optimizations

### 1. Prefill vs Decode

| Phase | Input | Compute | KV Cache | Speed |
|-------|-------|---------|----------|-------|
| **Prefill** | Full prompt | Attention over all tokens | Build full cache | Slower (memory-bound) |
| **Decode** | 1 token | Attention to 1 token | Use cached KV | Faster (compute-bound) |

### 2. Continuous Batching

```python
# Without continuous batching:
Batch 1: [Req1, Req2, Req3] → Wait for all to finish
Batch 2: [Req4, Req5] → Then run these

# With continuous batching (vLLM):
Iteration 1: [Req1-prefill, Req2-prefill, Req3-prefill, Req4-decode]
Iteration 2: [Req2-prefill, Req3-decode, Req4-decode, Req5-prefill]
Iteration 3: [Req3-decode, Req4-decode, Req5-decode, Req1-decode]
→ Constant GPU utilization!
```

### 3. Block-wise KV Cache Management

```
GPU KV Cache:
┌─────────────────────────┐
│ Block 0: Req1 tokens    │
├─────────────────────────┤
│ Block 1: Req2 tokens    │
├─────────────────────────┤
│ Block 2: Req3 tokens    │
├─────────────────────────┤
│ Block 3: Req4 tokens    │
├─────────────────────────┤
│ Block 4: FREE           │
└─────────────────────────┘

- Allocate blocks on demand
- Swap to CPU when GPU full
- Copy blocks for beam search
```

### 4. Flash Attention

- **Standard:** Attention is O(n²) memory (materialized attention matrix)
- **Flash Attention:** Compute attention tiles, reduce I/O
- Result: ~2-4x speedup

---

## Quick Reference: Finding Code

### To understand each layer:

| Layer | Key File | Key Class/Function | Line | Purpose |
|-------|----------|-------------------|------|---------|
| 1. HTTP API | `api_server.py` | FastAPI routes | - | Accept HTTP requests |
| 2. Chat Serving | `serving_chat.py` | `OpenAIServingChat.create_chat_completion()` | ~200 | Parse and preprocess chat |
| 3. Serving Logic | `serving_engine.py` | `OpenAIServing._generate_completions()` | ~150 | Interface to engine |
| 4. Async Engine | `async_llm_engine.py` | `AsyncLLMEngine.add_request()`, `.generate()` | - | Async wrapper |
| 5. Core Engine | `llm_engine.py` | `LLMEngine.step()` | 1194 | Main inference loop |
| 6. Scheduling | `scheduler.py` | `Scheduler.schedule()` | - | Batch scheduling |
| 7. Executor | `executor_base.py` | `ExecutorBase.execute_model()` | 143 | Executor abstraction |
| 8. Worker | `worker_base.py` | `LocalOrDistributedWorkerBase.execute_model()` | 385 | Actual worker |
| 9. Model Runner | `model_runner.py` | `ModelRunner.execute_model()` | 234 | Prepare and run model |
| 10. Model Forward | `llama.py` (etc) | `LlamaForCausalLM.forward()` | - | Transformer logic |
| 11. Sampling | `sampler.py` | `Sampler.forward()` | - | Token selection |
| 12. Output Processing | `llm_engine.py` | `LLMEngine._process_model_outputs()` | - | Decode & finalize |
| 13. Response | `serving_chat.py` | Format response | - | Format as JSON |

---

## Summary

### The vLLM Request Journey

1. **API Request Arrives** → FastAPI handler in `api_server.py`
2. **Request Validated & Preprocessed** → `serving_chat.py` applies chat template
3. **Queued for Processing** → `async_llm_engine.py` adds to request tracker
4. **Scheduled in Engine Loop** → `llm_engine.py.step()` runs continuously
5. **Sequences Batched** → `scheduler.py` decides which sequences to execute
6. **Model Executed** → Passed through `executor.py` → `worker.py` → `model_runner.py`
7. **Transformer Forward Pass** → Model computes logits using attention & FFN
8. **Token Sampled** → `sampler.py` picks next token with temperature/top-k
9. **Sequence Updated** → Token appended, KV cache updated
10. **Loop Repeats** → Steps 4-9 until stop condition
11. **Output Post-Processed** → Tokens detokenized to text
12. **Response Formatted** → Converted to OpenAI API format
13. **Response Sent** → HTTP response returned to client

### Key Insight

The entire system is built on:
- **Continuous batching** - Maximize GPU utilization
- **KV caching** - 100x speedup for decode phase
- **Block-wise cache management** - Enable swapping and sharing
- **Asynchronous I/O** - Handle multiple requests simultaneously
- **Modular architecture** - Support multiple hardware backends

This is why vLLM achieves 10-100x throughput improvements over standard inference.

---