Skip to content

githabideri/llmlab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llmlab

A hobbyist's notebook for running local LLMs on consumer GPUs — focused on what actually works, not what benchmarks promise.

What this is

We run agentic LLM workloads on 3× RTX 3060 12 GB (36 GB VRAM) and document everything: configs that work, models that don't, performance numbers from real serving (not just llama-bench), and the weird edge cases you only find by actually using these things.

We started with a dual 3060 setup (24 GB) and upgraded to triple in early March 2026 — unlocking dense 27B models and 3-way parallel serving that wasn't possible before.

Focus areas:

  • MoE and hybrid models — the sweet spot for interactive use on limited VRAM. Small active parameters = fast generation, large total parameters = good quality. Dense models are on the table too now that we have 36 GB to play with.
  • Agentic tool-calling — not just chat, but models driving multi-step tool chains (web search → fetch → analyze → file ops). We test with OpenClaw, which is demanding enough that if a model works here, it'll work in any general-purpose agentic setup.
  • Real serving metricsllama-bench numbers are a starting point. Real-world serving with prompt caches, thinking tokens, and growing context tells a different story. We measure both.
  • Multi-GPU optimization — tensor-split tuning, compute buffer analysis, and practical guides for squeezing maximum context and parallelism out of consumer GPUs without NVLink.

Hardware and serving setup

Hardware

  • GPU server: 3× RTX 3060 12 GB (36 GB total), Intel i5-7400, PCIe x16 + x4 + x4 — detailed hardware profile
  • CPU fallback: Intel i5-8400T, 64 GB DDR4-2667, llama.cpp

Runtime profiles

  • llama.cpp remains the reference path for GGUF-based serving, tensor-split tuning, and long-context fitment work.
  • vLLM is now part of the documented production story as well, especially for GPTQ-based serving and higher aggregate concurrency with native prefix caching.

Validated highlights

Key findings

-sm layer is everything. On PCIe multi-GPU without NVLink, split-mode matters more than the model itself. -sm layer gives 2.5× the throughput of -sm row. If you have multiple consumer GPUs, this is the single most important flag.

output.weight lands on the last GPU. In llama.cpp's split-mode layer, the output projection (~1+ GB) is hardcoded to the last GPU. This creates asymmetric VRAM pressure that must be compensated with tensor-split ratios. See our multi-GPU tensor-split guide.

--parallel N shrinks compute buffers. More slots = smaller per-slot compute buffers, which frees VRAM for KV cache. On our 3×3060 setup, going from parallel 1→3 freed enough headroom for 3× the concurrent sessions at 131K each. The tradeoff: per-slot context shrinks proportionally.

What we've learned

Models tested

Model Main validated serving path Arch Verdict Notes
Qwen3.5-27B llama.cpp Hybrid (DeltaNet+Attn) ✅ Production 3×131K parallel, strong fit for the 3×3060 GGUF path
Qwen3.5-35B-A3B vLLM + llama.cpp MoE ✅/🟡 Mixed by backend Important current model: validated on vLLM PP=3; historically tighter and riskier on 24 GB llama.cpp configs
GLM-4.7-Flash llama.cpp MoE ~4B ✅ Production Best tool-calling quality in earlier 24 GB-era work
Nemotron-3-Nano-30B llama.cpp / CPU fallback MoE (Mamba-2) ✅ Production Excellent speed retention at depth; useful fallback profile
LFM2-24B-A2B llama.cpp MoE ❌ Failed Extremely fast but unreliable for agentic work
Nanbeige4.1-3B llama.cpp Dense ❌ Failed Leaks <think> blocks, can't disable reasoning
ZwZ-4B llama.cpp Dense 🟡 Parked Multimodal arch, untested for agentic
Qwen3-Coder-REAP llama.cpp MoE 🟡 Mixed Good code, context degradation issues

Context degradation (the number that actually matters)

llama-bench at empty context is marketing. Here's what happens as context fills:

Model @0 @16K @32K @64K Degradation pattern
Nemotron (Mamba-2) 96 85 72 55 -42% @64K — best
GLM (MLA) 71 54 45 33 -53% @64K
Qwen3 (GQA) 99 39 24 13 -87% @64K — worst

Nemotron's Mamba-2 architecture genuinely delivers on the "constant-time attention" promise. Qwen3's traditional GQA falls off a cliff. Qwen3.5-27B's hybrid approach (mostly recurrent) should sit closer to Nemotron — benchmarks pending.

Real serving vs benchmarks

From a 79-request GLM production session:

  • Benchmark says: 71 tok/s at empty context
  • Real serving: 30 tok/s at 37K context, 14.4 tok/s at 113K
  • Gap: 28-36% slower than llama-bench (server overhead, prompt cache, thinking tokens, KV pressure)
  • Compaction helps: speed recovered 14.4 → 30.0 tok/s (2.1×) after context compaction

Docs & guides

Guide Description
Multi-GPU Tensor-Split How to optimize layer distribution across GPUs for llama.cpp — ceiling testing, output.weight gotcha, --parallel effects
Hardware: Triple 3060 Our specific 3×3060 setup — validated llama.cpp configs, vLLM PP=3 note, VRAM budgets, capacity planning
Architecture System architecture overview
Runbook Start/stop servers, common operations
Troubleshooting Common issues and fixes
Qwen3.5-35B-A3B vLLM PP=3 experiment Dedicated write-up for the validated vLLM deployment profile and concurrency behavior

Repo structure

models/          Model profiles (GLM, Nemotron, Qwen3.5-27B, Qwen3.5-35B, LFM2, ...)
experiments/     Experiment logs (context sweeps, quant comparisons, speed tests)
benchmarks/      Agentic benchmark suite (L0-L4: read/write → config → tool chains)
docs/            Guides, runbook, troubleshooting, hardware profiles
scripts/         Context sweep benchmarking, model info fetcher, server start scripts
web/             Web UI for running and monitoring benchmarks

Web UI (web/)

Interactive benchmark dashboard built with FastAPI + htmx, styled to match llama.cpp's web interface:

  • 🤖 Model and server status monitoring
  • 🚀 Run context ladder benchmarks with live streaming output
  • 📈 Browse historical test results
  • ⚙️ View current server configuration

Quick start:

cd web
python3 -m venv venv
./venv/bin/pip install -r requirements.txt
./start.sh

Then open http://localhost:8000 — see web/README.md for details.

Model profiles (models/)

Each model gets a full write-up: architecture, quant selection rationale, speed at various context depths, real-world serving numbers, agentic capability test results, known issues, and recommended config. Not just "it works" — how it works, where it breaks, and why.

Benchmark suite (benchmarks/)

Five-level agentic benchmark:

  • L0: Basic file read/write
  • L1: Config summarization
  • L2: Config patching (structured edit)
  • L3: Benchmark output parsing (complex extraction)
  • L4: Multi-step tool chain (search → fetch → analyze → write)

Designed to stress real agentic capabilities, not trivia or chat fluency.

Quant selection philosophy

We pick the highest-quality quant that fits with 100K+ context headroom. Every model profile includes a transparent comparison table showing all viable quants with exact sizes and math. No "just use Q4" hand-waving.

KV cache per token varies wildly between architectures (Nemotron: 2.25 KiB, Qwen3.5-27B: ~12 KiB for 16/64 attn layers, GLM: ~54 KiB, Qwen3: 36 KiB) and dominates fitment at large context more than model size itself.

Hardware history

Period Setup VRAM Key models
Jan–Feb 2026 2× RTX 3060 12 GB 24 GB GLM-4.7-Flash, Nemotron-30B, Qwen3.5-35B-A3B
Mar 2026+ 3× RTX 3060 12 GB 36 GB Qwen3.5-27B (dense), 3-slot parallel serving

The third GPU opened up dense models and multi-session serving that wasn't feasible at 24 GB. See hardware profile for the full story.

Safety note

This repo is public. Do not commit:

  • Private hostnames, IPs, or infrastructure details
  • Credentials, tokens, or SSH keys
  • Personal transcripts or sensitive logs

Keep examples generic and reproducible.

Contributing

This is primarily a personal lab notebook, but issues and discussions are welcome if you're running similar hardware and have findings to share. The more data points on consumer GPU setups, the better.

About

All the test runs on my homelab's gpu server with dual and triple RTX 3060 setup, conducted by Labmaster agent.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors