# LMCache + vLLM Testing: Gemma 3 270M

**Model**: `google/gemma-3-270m-it` (270M parameters)

**Quantization**: 4-bit weights (W4A16) via vLLM

**Hardware**: T4 GPU (16GB VRAM)

## Cell 1: Install Packages

In [1]:
%%time
!pip install -q --upgrade pip jedi vllm lmcache
!pip show transformers
!pip show torch
print("✓ Packages installed")

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.8/1.8 MB[0m [31m64.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m60.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m438.2/438.2 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m180.0/180.0 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.0/111.0 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Cell 2: Verify GPU

In [2]:
import torch

if torch.cuda.is_available():
    device_name = torch.cuda.get_device_name(0)
    cc = torch.cuda.get_device_capability(0)
    mem_gb = torch.cuda.get_device_properties(0).total_memory / (1024**3)

    print(f"GPU: {device_name}")
    print(f"Compute Capability: {cc[0]}.{cc[1]}")
    print(f"Memory: {mem_gb:.1f} GB")

    if cc[0] >= 7 and cc[1] >= 5:
        print("✓ T4 or better - W4A16 quantization supported")
    else:
        print("⚠️  GPU may not support quantization kernels")
else:
    print("✗ No CUDA GPU detected!")
    raise RuntimeError("GPU required")

GPU: Tesla T4
Compute Capability: 7.5
Memory: 14.7 GB
✓ T4 or better - W4A16 quantization supported


## Cell 3: Check HuggingFace Access

Gemma models require accepting Google's license on HuggingFace.

In [3]:
from huggingface_hub import login
import os

# Option 1: Use Colab secrets (recommended)
try:
    from google.colab import userdata
    hf_token = userdata.get('HF_TOKEN')
    login(token=hf_token)
    print("✓ Logged in via Colab secrets")
except:
    # Option 2: Manual login
    print("Colab secrets not found. Logging in manually...")
    login()

# Verify model access
model_id = "google/gemma-3-270m-it"
print(f"\nModel: {model_id}")
print("Ensure you've accepted the license at:")
print(f"https://huggingface.co/{model_id}")

✓ Logged in via Colab secrets

Model: google/gemma-3-270m-it
Ensure you've accepted the license at:
https://huggingface.co/google/gemma-3-270m-it


## Cell 4: Download Model (One-Time)

Downloads model to `/content/models/` for reuse across sessions.

In [None]:
%time
from huggingface_hub import snapshot_download
import os

model_id = "google/gemma-3-270m-it"
local_dir = "/content/models/gemma-3-270m-it"

if os.path.exists(local_dir):
    print(f"✓ Model already cached at {local_dir}")
else:
    print(f"Downloading {model_id}...")
    snapshot_download(
        repo_id=model_id,
        local_dir=local_dir,
        local_dir_use_symlinks=False
    )
    print(f"✓ Downloaded to {local_dir}")

# Verify download
!ls -lh {local_dir}

In [None]:
%time
from huggingface_hub import snapshot_download
import os

model_id = "google/gemma-3-270m-it"
local_dir = "/content/models/gemma-3-270m-it"

if os.path.exists(local_dir):
    print(f"✓ Model already cached at {local_dir}")
else:
    print(f"Downloading {model_id}...")
    snapshot_download(
        repo_id=model_id,
        local_dir=local_dir,
        local_dir_use_symlinks=False
    )
    print(f"✓ Downloaded to {local_dir}")

# Verify download
!ls -lh {local_dir}

Downloading google/gemma-3-270m-it...


For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/28.3k [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/536M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

✓ Downloaded to /content/models/gemma-3-270m-it
total 549M
-rw-r--r-- 1 root root   35 Nov  9 22:41 added_tokens.json
-rw-r--r-- 1 root root 1.5K Nov  9 22:41 chat_template.jinja
-rw-r--r-- 1 root root 1.4K Nov  9 22:41 config.json
-rw-r--r-- 1 root root  173 Nov  9 22:41 generation_config.json
-rw-r--r-- 1 root root 512M Nov  9 22:41 model.safetensors
-rw-r--r-- 1 root root  28K Nov  9 22:41 README.md
-rw-r--r-- 1 root root  662 Nov  9 22:41 special_tokens_map.json
-rw-r--r-- 1 root root 1.2M Nov  9 22:41 tokenizer_config.json
-rw-r--r-- 1 root root  32M Nov  9 22:41 tokenizer.json
-rw-r--r-- 1 root root 4.5M Nov  9 22:41 tokenizer.model
CPU times: user 2.06 s, sys: 1.77 s, total: 3.83 s
Wall time: 4.58 s


## Cell 5: Load Model with vLLM + LMCache

**Takes 30-60 seconds on T4**. vLLM will quantize on-the-fly to 4-bit.

In [None]:
%%time
from vllm import LLM, SamplingParams

model_path = "/content/models/gemma-3-270m-it"

print("Loading model with vLLM + LMCache...")
print("(This takes 30-60s on T4)\n")

# LMCache configuration
kv_cache_config = {
    "kv_connector": "LMCacheConnectorV1",
    "kv_role": "kv_both"
}

try:
    llm = LLM(
        model=model_path,
        dtype="auto",
        gpu_memory_utilization=0.8,
        max_model_len=2048,
        kv_transfer_config=kv_cache_config,
        enforce_eager=True  # Disable CUDA graphs for compatibility
    )
    print("\n✓ Model loaded successfully")
except Exception as e:
    print(f"\n✗ Model loading failed: {e}")
    raise

INFO 11-09 22:41:23 [__init__.py:216] Automatically detected platform cuda.
Loading model with vLLM + LMCache...
(This takes 30-60s on T4)

INFO 11-09 22:41:37 [utils.py:233] non-default args: {'max_model_len': 2048, 'gpu_memory_utilization': 0.8, 'disable_log_stats': True, 'enforce_eager': True, 'kv_transfer_config': KVTransferConfig(kv_connector='LMCacheConnectorV1', engine_id='c18d142d-831d-4c80-a1ce-1f649ec0d2e6', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_both', kv_rank=None, kv_parallel_size=1, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={}, kv_connector_module_path=None), 'model': '/content/models/gemma-3-270m-it'}
INFO 11-09 22:42:03 [model.py:547] Resolved architecture: Gemma3ForCausalLM


`torch_dtype` is deprecated! Use `dtype` instead!


INFO 11-09 22:42:03 [model.py:1727] Upcasting torch.bfloat16 to torch.float32.
INFO 11-09 22:42:03 [model.py:1510] Using max model len 2048
INFO 11-09 22:42:05 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 11-09 22:42:05 [__init__.py:381] Cudagraph is disabled under eager mode
INFO 11-09 22:43:04 [llm.py:306] Supported_tasks: ['generate']

✓ Model loaded successfully
CPU times: user 19.9 s, sys: 2.04 s, total: 21.9 s
Wall time: 1min 46s


## Cell 6: Test Basic Generation

In [6]:
from vllm import SamplingParams

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=50
)

prompt = "Explain recursion in one sentence."
outputs = llm.generate([prompt], sampling_params)

print(f"Prompt: {prompt}")
print(f"Output: {outputs[0].outputs[0].text}")

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: Explain recursion in one sentence.
Output: 

The concept of recursion is a fundamental principle in programming that allows a function to call itself multiple times within a single function, without needing to explicitly write a function to call each function. This is particularly useful for problems that involve a large number of inputs


## Cell 7: Test Cache Performance

Run identical prompt twice to measure cache speedup.

In [7]:
import time

prompt = "What is the capital of France?"

# First run (populate cache)
print("=" * 70)
print("RUN 1: Populating cache")
print("=" * 70)
start = time.time()
outputs = llm.generate([prompt], sampling_params)
time1 = time.time() - start
print(f"Output: {outputs[0].outputs[0].text}")
print(f"Time: {time1:.3f}s\n")

# Second run (use cache)
print("=" * 70)
print("RUN 2: Using cached KV")
print("=" * 70)
start = time.time()
outputs = llm.generate([prompt], sampling_params)
time2 = time.time() - start
print(f"Output: {outputs[0].outputs[0].text}")
print(f"Time: {time2:.3f}s\n")

# Results
speedup = time1 / time2 if time2 > 0 else 0
print("=" * 70)
print(f"First run:  {time1:.3f}s")
print(f"Second run: {time2:.3f}s")
print(f"Speedup:    {speedup:.2f}x")
print("=" * 70)

if speedup > 1.5:
    print("\n✓ Cache working effectively!")
elif speedup > 1.1:
    print("\n~ Cache shows some improvement")
else:
    print("\n⚠️  Cache may not be working (speedup should be >1.5x)")

RUN 1: Populating cache


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Output: 

A) Paris
B) Lyon
C) Marseille
D) Rome

**Answer: B) Lyon**

Time: 17.151s

RUN 2: Using cached KV


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Output: 

A) Paris
B) Rome
C) Berlin
D) Lyon

**Answer:** B) Rome

Time: 1.162s

First run:  17.151s
Second run: 1.162s
Speedup:    14.76x

✓ Cache working effectively!


## Cell 8: Test with Shared Prefix

More realistic test: multiple prompts sharing a long prefix.

In [8]:
import time

# Long shared system prompt
system_prompt = """You are a helpful coding assistant. Guidelines:
1. Write clean, readable code
2. Add explanatory comments
3. Use descriptive variable names
4. Handle edge cases

Task: """

tasks = [
    "Sort a list of integers",
    "Reverse a string",
    "Find prime numbers up to N"
]

print("Testing cache with shared prefix...\n")
times = []

for i, task in enumerate(tasks, 1):
    prompt = system_prompt + task

    start = time.time()
    outputs = llm.generate([prompt], sampling_params)
    elapsed = time.time() - start
    times.append(elapsed)

    marker = "(populating)" if i == 1 else "(cached)"
    print(f"Run {i} {marker}: {elapsed:.3f}s")
    print(f"Task: {task}")
    print(f"Output: {outputs[0].outputs[0].text[:80]}...\n")

# Analysis
print("=" * 70)
if len(times) > 1:
    avg_cached = sum(times[1:]) / len(times[1:])
    improvement = times[0] / avg_cached
    print(f"First run:         {times[0]:.3f}s")
    print(f"Avg cached runs:   {avg_cached:.3f}s")
    print(f"Cache speedup:     {improvement:.2f}x")
print("=" * 70)

Testing cache with shared prefix...



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Run 1 (populating): 7.773s
Task: Sort a list of integers
Output:  from 1 to 100.

```python
def sort_numbers(numbers):
    """
    Sorts a list o...



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Run 2 (cached): 2.984s
Task: Reverse a string
Output: .

Given a string "hello world", reverse the string.

Input: hello world
Output:...



Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Run 3 (cached): 2.322s
Task: Find prime numbers up to N
Output: .

Given a number `n` and an integer `k`, find all prime numbers less than or eq...

First run:         7.773s
Avg cached runs:   2.653s
Cache speedup:     2.93x


## Cell 9: Cleanup (Optional)

In [9]:
# Free GPU memory
import gc
import torch

del llm
gc.collect()
torch.cuda.empty_cache()

print("✓ GPU memory released")

✓ GPU memory released
