Delegate grunt work from Claude Code (or any coding agent) to a local LLM running in LM Studio. File contents enter the local model's context, not your paid Claude session — so you can explore big repos, triage logs, or extract content without burning the cloud token meter.
Built on the pattern originally shared by Ok_Significance_9109 on r/LocalLLaMA, extended with a read budget, read cache, grep tool, streaming, and model preflight.
If you use a Qwen3 model (including Qwen3.6) and want the fast non-thinking mode, you must edit the model's Jinja template in LM Studio, or everything will silently run in reasoning mode and burn your token budget.
- Open LM Studio → My Models
- Find your Qwen3 model → click the edit/gear icon
- Expand Prompt Template (Jinja)
- Add this line at the very top of the template:
{%- set enable_thinking = false %}
- Save and reload the model
Other models (e.g. Gemma 4) don't need this.
/ask-localslash command for Claude Code — delegates a task to the local model with file reading, listing, and grepping.- Two Python scripts, stdlib-only, no pip dependencies:
agent_lm.py— tool-calling agent loop (the main event)query_lm.py— simple prompt-only helper
- Read budget + cache so the model can't spiral into unbounded exploration.
- Streaming final answer so you see output flowing instead of staring at a spinner.
- Token-usage footer printed after every run so you can see exactly how much context stayed local.
git clone https://github.com/alisorcorp/ask-local.git
cd ask-local
./install.shThe install script:
- Checks Python 3.9+ is available
- Pings LM Studio on
http://localhost:1234(warns, doesn't fail, if unreachable) - Copies
scripts/agent_lm.pyandscripts/query_lm.pyinto~/.claude/scripts/ - Copies
commands/ask-local.mdinto~/.claude/commands/ - Does NOT edit your LM Studio Jinja templates — that step is on you
If you prefer symlinks (so git pull updates your installed version), pass --link:
./install.sh --linkFrom any Claude Code session, invoke /ask-local <task>. The model reads files, lists directories, and greps for patterns on your behalf. Don't paste file content into the task description — describe the task and let the local model do the reading.
/ask-local summarize this repo in 6 bullets: purpose, tech stack, entry points, how to run
/ask-local build a mental model: top 5 directories, what each contains, the most important file in each
/ask-local find every TODO and FIXME, group by file
/ask-local list every env var read via process.env / os.getenv / os.environ — include file:line
/ask-local inventory every API route under app/api: method, path, one-line purpose
/ask-local find every hardcoded string that looks like an API key or secret — file:line
/ask-local list every import from 'lodash' — I want to remove this dep
/ask-local build a page inventory: for each route, H1, primary CTA, disclaimer yes/no
/ask-local extract every user-facing error string, flag any that sound rude or cryptic
/ask-local find every reference to the old /v1/users endpoint that should move to /v2/users
/ask-local find every place we build SQL queries via string concatenation
/ask-local list every component still using class syntax instead of hooks
Most tasks don't need an explicit --read-budget — the default 15 covers triage, audit, and typical inventory work. Only raise it for jobs that legitimately want to read >15 files (e.g. full-site page inventories on a large site). If you underspec it, you'll see a loud [WARNING: read budget exhausted...] line at the end telling you exactly what to raise.
/ask-local find the three most error-prone paths — unhandled rejections, swallowed exceptions, missing validation. Skip tests.
/ask-local review middleware.ts and lib/auth.ts for permission gaps — cite line numbers
/ask-local --think check lib/db.ts for N+1 queries or missing transaction boundaries
cat error.log | python3 ~/.claude/scripts/query_lm.py "classify these errors into buckets, count each, show one example per"
tail -5000 build.log | python3 ~/.claude/scripts/query_lm.py "which 3 errors are blocking the build?"python3 ~/.claude/scripts/agent_lm.py \
--dir ~/Code/my-project \
"find every environment variable read from os.getenv"If the answer is "a list, inventory, or count," it'll crush it. If the answer is "a nuanced judgment call," use it for a first pass and spot-check the top findings yourself.
| Flag | Default | What it does |
|---|---|---|
--dir DIR |
cwd | Working directory the model can read. The model is sandboxed to this directory. |
--model MODEL |
qwen3.6-35b-a3b |
Which loaded LM Studio model to use. |
--max-turns N |
15 | Max agent loop iterations. |
--max-tokens N |
6000 | Max tokens per model response. Default is sized for 64k windows with comfortable headroom. On 96k+ windows you can push to 10000-12000 for longer inventories. |
--read-budget N |
15 | Max read_file calls before tools are force-disabled. list_dir and grep are free. If the budget is hit, a clear warning is printed so incomplete answers are never silent. |
--max-read-chars N |
12000 | Per-file truncation cap (head + tail, middle discarded). |
--max-file-bytes N |
500000 | Refuse to read files bigger than this. |
--think |
off | Enable reasoning mode (slower but better for hard problems). |
--url URL |
http://localhost:1234 |
LM Studio base URL. |
--no-stream |
(streaming on) | Disable streaming of the final answer. |
--quiet |
off | Suppress turn markers and tool-call logs. |
read_file(path)— read a text file. Binaries, oversized files, and escapes outside--dirare refused. Reads are cached (second read of same path is free).list_dir(path)— list entries in a directory. Free — doesn't count against the read budget.grep(pattern, path='.', glob=None)— regex search across files. Free. Skips binaries and standard build dirs (node_modules,.git,dist,.venv, etc.). Returns up to 50 matches. Use this instead of reading many files blindly.
| Task type | Fit |
|---|---|
| Triage across many files ("find the 3 files that touch auth") | Good |
| Log / stack-trace summarization | Good |
| Content extraction or inventory | Good |
| Quick bug-finding in isolated files | Good |
| Privacy-sensitive code you don't want leaving the machine | Good |
| Multi-file reasoning where relationships matter | Mixed |
| Anything accuracy-critical (security review, data migration review) | Use as first pass, then verify |
| Tasks needing current conversation context | Avoid |
Rough benchmarks from testing on real projects. "Marginal" is session total minus a 49k fresh-session baseline (system prompt, skill descriptions, CLAUDE.md — the overhead a Claude Code session starts with before any task work happens).
| Task | Files involved | Opus direct | Ask-local | Per-task ratio |
|---|---|---|---|---|
Inventory every route under app/api/admin: method, path, auth check, purpose, DB tables |
23 route files | 13k marginal (62k total) | 0.4k marginal (49.4k total) | ~30× |
| Full page inventory of an Astro site: H1, H2s, meta, CTA, disclaimer per page + layout details + consistency review | 18 files (14 pages + 4 layouts) | 89k marginal (138k total) | 3k marginal (52k total) | ~30× |
What this means in practice: in a working session with 3-5 such tasks, the difference is hitting compaction pressure mid-afternoon versus staying cool all day. With Opus direct, each inventory task adds 15-90k to your session. With ask-local, each adds ~1-3k — essentially just the size of the answer coming back.
What this doesn't mean: these are extraction-heavy tasks in the local model's sweet spot. Tasks that need multi-file reasoning, subtle correctness, or cross-cutting judgment will narrow the gap — the model that produces the right answer wins regardless of token cost. Treat ~30× as the upper end for inventory work, not a universal claim. And these are single-run measurements, not averages — expect some variance.
Output-quality side note: on the second benchmark above, Qwen and Opus produced different but overlapping consistency observations. Qwen caught an architectural issue Opus missed (one page bypassing the standard layout); Opus caught a heading-hierarchy issue Qwen missed. Neither was strictly better — they noticed different things. Use ask-local's output as a strong first pass, verify anything load-bearing yourself.
- Output quality on local 30B-class models lags Claude/GPT-class models, especially on subtle correctness. Spot-check security or correctness findings — in testing, Qwen3.6 claimed a Next.js middleware constant was "exposed to the client bundle," which was wrong (edge runtime is server-side).
- Quality degrades toward the tail of the advertised context window on most open-weight models. Don't push past ~40–60k tokens of context even if the model advertises 128k+.
- Dense enumeration tasks (e.g. inventory 20+ items with 5-6 attributes each) need
--max-tokens 10000+or they'll truncate. The script now warns when truncation happens — don't ignore the warning. - Model must support OpenAI-style tool calling. Most recent instruction-tuned models do; older ones may not.
- One query at a time on RAM-constrained machines. LM Studio will happily accept concurrent requests and OOM your laptop.
- Original pattern + base scripts: ClassicalDude via r/LocalLLaMA.
MIT.