# vLLM From Scratch
Making my own version of vLLM for inference system architecture (scale, visualization, etc)

## Goal
1. Loads LLM
2. Streams generated tokens over HTTP
3. Tracks and visualize KV cache memory usage per requests

In [5]:
# %pip install torch torchvision torchaudio transformers accelerate accelerate fastapi uvicorn


In [6]:
%pip install --quiet torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
%pip install --quiet transformers accelerate


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [7]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

In [8]:
# Use metal
device = "mps" if torch.backends.mps.is_available() else "cpu"
# device = "cpu"  # Force CPU for simplicity
print(f"Using device: {device}")

Using device: mps


In [9]:
# Load tokenizer and model
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16 if device != "cpu" else torch.float32)

In [10]:
model = model.to(device)


In [11]:
model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): 

In [20]:
prompt = "what is the capital of Virginia? answer"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=100)
    
response = tokenizer.decode(output[0], skip_special_tokens=True)
print(response)

what is the capital of Virginia? answer: Richmond


## KV Caching

In [25]:
# TODO: implement inference pipeline with KV cache tracking

## Batching 

In [26]:
# TODO: implement inference pipeline with batching

## Streaming

```bash
uvicorn app:app --reload --port 8000
```

In [23]:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

device = "mps" if torch.backends.mps.is_available() else "cpu"

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
model = model.to(device)
model.eval()

app = FastAPI()

@app.get("/generate")
def generate(prompt: str = "Why is the sky blue?"):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=30)
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    return StreamingResponse(iter([response]), media_type="text/plain")

## Debugging

In [13]:
if device == "mps":
    print("Allocated (MB):", torch.mps.current_allocated_memory() / 1e6)

Allocated (MB): 2200.09856


In [14]:
# Print shape of key/value projection weights
count = 0
for name, param in model.named_parameters():
    if "k_proj" in name or "v_proj" in name:
        count += 1
        print(f"{name}: {param.shape}")
print(f"Total key/value projection weights: {count}")

# KV cache of each layer of the model, each layer has a key and value projection:
# 2048D - input size, input to the attention layer. gets projected to 256D Query, 256D Key, 256D Value. The input is the token embedding from the previous layer, ex: [The], [sky], [is], [blue], where each token ([]) is a 2048D vector.
# Each layer projects the 2048D input to 256D to the query, key, and value. These are then used to compute how much attention each token should pay to every other token in the sequence.
# 256D - size of the Q/K/V
# 2048D - size of the output, which is the same as the input size to the next layer.

model.layers.0.self_attn.k_proj.weight: torch.Size([256, 2048])
model.layers.0.self_attn.v_proj.weight: torch.Size([256, 2048])
model.layers.1.self_attn.k_proj.weight: torch.Size([256, 2048])
model.layers.1.self_attn.v_proj.weight: torch.Size([256, 2048])
model.layers.2.self_attn.k_proj.weight: torch.Size([256, 2048])
model.layers.2.self_attn.v_proj.weight: torch.Size([256, 2048])
model.layers.3.self_attn.k_proj.weight: torch.Size([256, 2048])
model.layers.3.self_attn.v_proj.weight: torch.Size([256, 2048])
model.layers.4.self_attn.k_proj.weight: torch.Size([256, 2048])
model.layers.4.self_attn.v_proj.weight: torch.Size([256, 2048])
model.layers.5.self_attn.k_proj.weight: torch.Size([256, 2048])
model.layers.5.self_attn.v_proj.weight: torch.Size([256, 2048])
model.layers.6.self_attn.k_proj.weight: torch.Size([256, 2048])
model.layers.6.self_attn.v_proj.weight: torch.Size([256, 2048])
model.layers.7.self_attn.k_proj.weight: torch.Size([256, 2048])
model.layers.7.self_attn.v_proj.weight: 

# Inference Pipelines and Scheduling

## Continuous Batching
 Continuous batching, widely used in modern inference systems, boosts compute throughput by prioritizing prefill and batching decode stages to enhance throughput  [18, 19]. As shown in Fig. 2(b), Requests 2 and 3 preempt Request 1’s decode, and all three decode together after their prefill, improving throughput over static batching (Fig. 2(a)).

## Chunked Batching
 Chunked batching improves latency-throughput balance by splitting long sequences into smaller, fixed-size chunks. As shown in Fig. 2(c), this allows prefill (e.g., Request 2) to run alongside decode (e.g., Request 1), avoiding the stalls seen in prefill-prioritized strategies like continuous batching.

## Disaggregated Batching 
Disaggregated serving decouples prefill and decode stages by assigning them to independently scaled hardware instances, enabling flexible resource allocation for heterogeneous workloads. For example in Fig. 2(d), Request 1 begins decoding while Requests 2 and 3 are still in prefill, due to this separation. Decode stages are then batched as resources become available, improving throughput. But it comes with the cost of KV cache transfer.

We define two disaggregation types: Global, as in Splitwise [17], uses a shared GPU pool without locality constraints, offering better load balancing; and Local, which restricts requests to fixed, physically co-located GPU groups, reducing KV cache transfer overhead. By default, we use global disaggregated batching unless otherwise noted.

from https://arxiv.org/html/2504.09775v1


In [None]:
# TODO: implement inference pipeline