llmlab

A hobbyist's notebook for running local LLMs on consumer GPUs — focused on what actually works, not what benchmarks promise.

What this is

We run agentic LLM workloads on 3× RTX 3060 12 GB (36 GB VRAM) and document everything: configs that work, models that don't, performance numbers from real serving (not just llama-bench), and the weird edge cases you only find by actually using these things.

We started with a dual 3060 setup (24 GB) and upgraded to triple in early March 2026 — unlocking dense 27B models and 3-way parallel serving that wasn't possible before.

Focus areas:

MoE and hybrid models — the sweet spot for interactive use on limited VRAM. Small active parameters = fast generation, large total parameters = good quality. Dense models are on the table too now that we have 36 GB to play with.
Agentic tool-calling — not just chat, but models driving multi-step tool chains (web search → fetch → analyze → file ops). We test with OpenClaw, which is demanding enough that if a model works here, it'll work in any general-purpose agentic setup.
Real serving metrics — llama-bench numbers are a starting point. Real-world serving with prompt caches, thinking tokens, and growing context tells a different story. We measure both.
Multi-GPU optimization — tensor-split tuning, compute buffer analysis, and practical guides for squeezing maximum context and parallelism out of consumer GPUs without NVLink.

Hardware and serving setup

Hardware

GPU server: 3× RTX 3060 12 GB (36 GB total), Intel i5-7400, PCIe x16 + x4 + x4 — detailed hardware profile
CPU fallback: Intel i5-8400T, 64 GB DDR4-2667, llama.cpp

Runtime profiles

llama.cpp remains the reference path for GGUF-based serving, tensor-split tuning, and long-context fitment work.
vLLM is now part of the documented production story as well, especially for GPTQ-based serving and higher aggregate concurrency with native prefix caching.

Validated highlights

Qwen3.5-27B on llama.cpp (Dense, Q5_K_XL, ~19 GB) — --parallel 3 --ctx-size 393216 = 3 concurrent sessions × 131K context.
Qwen3.5-35B-A3B on vLLM (GPTQ-Int4, PP=3) — dedicated validation in the PP=3 experiment.
CPU fallback: Nemotron-3-Nano-30B-A3B.
Earlier 24 GB-era baselines: GLM-4.7-Flash, Qwen3.5-35B-A3B.

Key findings

-sm layer is everything. On PCIe multi-GPU without NVLink, split-mode matters more than the model itself. -sm layer gives 2.5× the throughput of -sm row. If you have multiple consumer GPUs, this is the single most important flag.

output.weight lands on the last GPU. In llama.cpp's split-mode layer, the output projection (~1+ GB) is hardcoded to the last GPU. This creates asymmetric VRAM pressure that must be compensated with tensor-split ratios. See our multi-GPU tensor-split guide.

--parallel N shrinks compute buffers. More slots = smaller per-slot compute buffers, which frees VRAM for KV cache. On our 3×3060 setup, going from parallel 1→3 freed enough headroom for 3× the concurrent sessions at 131K each. The tradeoff: per-slot context shrinks proportionally.

What we've learned

Models tested

Model	Main validated serving path	Arch	Verdict	Notes
Qwen3.5-27B	llama.cpp	Hybrid (DeltaNet+Attn)	✅ Production	3×131K parallel, strong fit for the 3×3060 GGUF path
Qwen3.5-35B-A3B	vLLM + llama.cpp	MoE	✅/🟡 Mixed by backend	Important current model: validated on vLLM PP=3; historically tighter and riskier on 24 GB llama.cpp configs
GLM-4.7-Flash	llama.cpp	MoE ~4B	✅ Production	Best tool-calling quality in earlier 24 GB-era work
Nemotron-3-Nano-30B	llama.cpp / CPU fallback	MoE (Mamba-2)	✅ Production	Excellent speed retention at depth; useful fallback profile
LFM2-24B-A2B	llama.cpp	MoE	❌ Failed	Extremely fast but unreliable for agentic work
Nanbeige4.1-3B	llama.cpp	Dense	❌ Failed	Leaks `<think>` blocks, can't disable reasoning
ZwZ-4B	llama.cpp	Dense	🟡 Parked	Multimodal arch, untested for agentic
Qwen3-Coder-REAP	llama.cpp	MoE	🟡 Mixed	Good code, context degradation issues

Context degradation (the number that actually matters)

llama-bench at empty context is marketing. Here's what happens as context fills:

Model	@0	@16K	@32K	@64K	Degradation pattern
Nemotron (Mamba-2)	96	85	72	55	-42% @64K — best
GLM (MLA)	71	54	45	33	-53% @64K
Qwen3 (GQA)	99	39	24	13	-87% @64K — worst

Nemotron's Mamba-2 architecture genuinely delivers on the "constant-time attention" promise. Qwen3's traditional GQA falls off a cliff. Qwen3.5-27B's hybrid approach (mostly recurrent) should sit closer to Nemotron — benchmarks pending.

Real serving vs benchmarks

From a 79-request GLM production session:

Benchmark says: 71 tok/s at empty context
Real serving: 30 tok/s at 37K context, 14.4 tok/s at 113K
Gap: 28-36% slower than llama-bench (server overhead, prompt cache, thinking tokens, KV pressure)
Compaction helps: speed recovered 14.4 → 30.0 tok/s (2.1×) after context compaction

Docs & guides

Guide	Description
Multi-GPU Tensor-Split	How to optimize layer distribution across GPUs for llama.cpp — ceiling testing, `output.weight` gotcha, `--parallel` effects
Hardware: Triple 3060	Our specific 3×3060 setup — validated llama.cpp configs, vLLM PP=3 note, VRAM budgets, capacity planning
Architecture	System architecture overview
Runbook	Start/stop servers, common operations
Troubleshooting	Common issues and fixes
Qwen3.5-35B-A3B vLLM PP=3 experiment	Dedicated write-up for the validated vLLM deployment profile and concurrency behavior

Repo structure

models/          Model profiles (GLM, Nemotron, Qwen3.5-27B, Qwen3.5-35B, LFM2, ...)
experiments/     Experiment logs (context sweeps, quant comparisons, speed tests)
benchmarks/      Agentic benchmark suite (L0-L4: read/write → config → tool chains)
docs/            Guides, runbook, troubleshooting, hardware profiles
scripts/         Context sweep benchmarking, model info fetcher, server start scripts
web/             Web UI for running and monitoring benchmarks

Web UI (`web/`)

Interactive benchmark dashboard built with FastAPI + htmx, styled to match llama.cpp's web interface:

🤖 Model and server status monitoring
🚀 Run context ladder benchmarks with live streaming output
📈 Browse historical test results
⚙️ View current server configuration

Quick start:

cd web
python3 -m venv venv
./venv/bin/pip install -r requirements.txt
./start.sh

Then open http://localhost:8000 — see web/README.md for details.

Model profiles (`models/`)

Each model gets a full write-up: architecture, quant selection rationale, speed at various context depths, real-world serving numbers, agentic capability test results, known issues, and recommended config. Not just "it works" — how it works, where it breaks, and why.

Benchmark suite (`benchmarks/`)

Five-level agentic benchmark:

L0: Basic file read/write
L1: Config summarization
L2: Config patching (structured edit)
L3: Benchmark output parsing (complex extraction)
L4: Multi-step tool chain (search → fetch → analyze → write)

Designed to stress real agentic capabilities, not trivia or chat fluency.

Quant selection philosophy

We pick the highest-quality quant that fits with 100K+ context headroom. Every model profile includes a transparent comparison table showing all viable quants with exact sizes and math. No "just use Q4" hand-waving.

KV cache per token varies wildly between architectures (Nemotron: 2.25 KiB, Qwen3.5-27B: ~12 KiB for 16/64 attn layers, GLM: ~54 KiB, Qwen3: 36 KiB) and dominates fitment at large context more than model size itself.

Hardware history

Period	Setup	VRAM	Key models
Jan–Feb 2026	2× RTX 3060 12 GB	24 GB	GLM-4.7-Flash, Nemotron-30B, Qwen3.5-35B-A3B
Mar 2026+	3× RTX 3060 12 GB	36 GB	Qwen3.5-27B (dense), 3-slot parallel serving

The third GPU opened up dense models and multi-session serving that wasn't feasible at 24 GB. See hardware profile for the full story.

Safety note

This repo is public. Do not commit:

Private hostnames, IPs, or infrastructure details
Credentials, tokens, or SSH keys
Personal transcripts or sensitive logs

Keep examples generic and reproducible.

Contributing

This is primarily a personal lab notebook, but issues and discussions are welcome if you're running similar hardware and have findings to share. The more data points on consumer GPU setups, the better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llmlab

What this is

Hardware and serving setup

Hardware

Runtime profiles

Validated highlights

Key findings

What we've learned

Models tested

Context degradation (the number that actually matters)

Real serving vs benchmarks

Docs & guides

Repo structure

Web UI (`web/`)

Model profiles (`models/`)

Benchmark suite (`benchmarks/`)

Quant selection philosophy

Hardware history

Safety note

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
benchmarks/openclaw		benchmarks/openclaw
config		config
docs		docs
experiments		experiments
models		models
scripts		scripts
web		web
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

llmlab

What this is

Hardware and serving setup

Hardware

Runtime profiles

Validated highlights

Key findings

What we've learned

Models tested

Context degradation (the number that actually matters)

Real serving vs benchmarks

Docs & guides

Repo structure

Web UI (web/)

Model profiles (models/)

Benchmark suite (benchmarks/)

Quant selection philosophy

Hardware history

Safety note

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Web UI (`web/`)

Model profiles (`models/`)

Benchmark suite (`benchmarks/`)

Packages