ask-local

Delegate grunt work from Claude Code (or any coding agent) to a local LLM running in LM Studio. File contents enter the local model's context, not your paid Claude session — so you can explore big repos, triage logs, or extract content without burning the cloud token meter.

Built on the pattern originally shared by Ok_Significance_9109 on r/LocalLLaMA, extended with a read budget, read cache, grep tool, streaming, and model preflight.

Required one-time setup for Qwen3 models

If you use a Qwen3 model (including Qwen3.6) and want the fast non-thinking mode, you must edit the model's Jinja template in LM Studio, or everything will silently run in reasoning mode and burn your token budget.

Open LM Studio → My Models
Find your Qwen3 model → click the edit/gear icon
Expand Prompt Template (Jinja)
Add this line at the very top of the template:
```
{%- set enable_thinking = false %}
```
Save and reload the model

Other models (e.g. Gemma 4) don't need this.

What you get

/ask-local slash command for Claude Code — delegates a task to the local model with file reading, listing, and grepping.
Two Python scripts, stdlib-only, no pip dependencies:
- agent_lm.py — tool-calling agent loop (the main event)
- query_lm.py — simple prompt-only helper
Read budget + cache so the model can't spiral into unbounded exploration.
Streaming final answer so you see output flowing instead of staring at a spinner.
Token-usage footer printed after every run so you can see exactly how much context stayed local.

Install

git clone https://github.com/alisorcorp/ask-local.git
cd ask-local
./install.sh

The install script:

Checks Python 3.9+ is available
Pings LM Studio on http://localhost:1234 (warns, doesn't fail, if unreachable)
Copies scripts/agent_lm.py and scripts/query_lm.py into ~/.claude/scripts/
Copies commands/ask-local.md into ~/.claude/commands/
Does NOT edit your LM Studio Jinja templates — that step is on you

If you prefer symlinks (so git pull updates your installed version), pass --link:

./install.sh --link

Usage

From any Claude Code session, invoke /ask-local <task>. The model reads files, lists directories, and greps for patterns on your behalf. Don't paste file content into the task description — describe the task and let the local model do the reading.

Repo orientation

/ask-local summarize this repo in 6 bullets: purpose, tech stack, entry points, how to run
/ask-local build a mental model: top 5 directories, what each contains, the most important file in each

Pattern inventory (grep-heavy, very fast)

/ask-local find every TODO and FIXME, group by file
/ask-local list every env var read via process.env / os.getenv / os.environ — include file:line
/ask-local inventory every API route under app/api: method, path, one-line purpose
/ask-local find every hardcoded string that looks like an API key or secret — file:line
/ask-local list every import from 'lodash' — I want to remove this dep

Content audits

/ask-local build a page inventory: for each route, H1, primary CTA, disclaimer yes/no
/ask-local extract every user-facing error string, flag any that sound rude or cryptic

Migration and refactor prep

/ask-local find every reference to the old /v1/users endpoint that should move to /v2/users
/ask-local find every place we build SQL queries via string concatenation
/ask-local list every component still using class syntax instead of hooks

Most tasks don't need an explicit --read-budget — the default 15 covers triage, audit, and typical inventory work. Only raise it for jobs that legitimately want to read >15 files (e.g. full-site page inventories on a large site). If you underspec it, you'll see a loud [WARNING: read budget exhausted...] line at the end telling you exactly what to raise.

Triage (always spot-check the findings yourself)

/ask-local find the three most error-prone paths — unhandled rejections, swallowed exceptions, missing validation. Skip tests.
/ask-local review middleware.ts and lib/auth.ts for permission gaps — cite line numbers
/ask-local --think check lib/db.ts for N+1 queries or missing transaction boundaries

Log / output triage (pipe through `query_lm.py` — no file tree needed)

cat error.log | python3 ~/.claude/scripts/query_lm.py "classify these errors into buckets, count each, show one example per"
tail -5000 build.log | python3 ~/.claude/scripts/query_lm.py "which 3 errors are blocking the build?"

Directly from the CLI

python3 ~/.claude/scripts/agent_lm.py \
  --dir ~/Code/my-project \
  "find every environment variable read from os.getenv"

Rule of thumb for picking tasks

If the answer is "a list, inventory, or count," it'll crush it. If the answer is "a nuanced judgment call," use it for a first pass and spot-check the top findings yourself.

Flags

Flag	Default	What it does
`--dir DIR`	cwd	Working directory the model can read. The model is sandboxed to this directory.
`--model MODEL`	`qwen3.6-35b-a3b`	Which loaded LM Studio model to use.
`--max-turns N`	15	Max agent loop iterations.
`--max-tokens N`	6000	Max tokens per model response. Default is sized for 64k windows with comfortable headroom. On 96k+ windows you can push to 10000-12000 for longer inventories.
`--read-budget N`	15	Max `read_file` calls before tools are force-disabled. `list_dir` and `grep` are free. If the budget is hit, a clear warning is printed so incomplete answers are never silent.
`--max-read-chars N`	12000	Per-file truncation cap (head + tail, middle discarded).
`--max-file-bytes N`	500000	Refuse to read files bigger than this.
`--think`	off	Enable reasoning mode (slower but better for hard problems).
`--url URL`	`http://localhost:1234`	LM Studio base URL.
`--no-stream`	(streaming on)	Disable streaming of the final answer.
`--quiet`	off	Suppress turn markers and tool-call logs.

Tools the local model has

read_file(path) — read a text file. Binaries, oversized files, and escapes outside --dir are refused. Reads are cached (second read of same path is free).
list_dir(path) — list entries in a directory. Free — doesn't count against the read budget.
grep(pattern, path='.', glob=None) — regex search across files. Free. Skips binaries and standard build dirs (node_modules, .git, dist, .venv, etc.). Returns up to 50 matches. Use this instead of reading many files blindly.

When this helps

Task type	Fit
Triage across many files ("find the 3 files that touch auth")	Good
Log / stack-trace summarization	Good
Content extraction or inventory	Good
Quick bug-finding in isolated files	Good
Privacy-sensitive code you don't want leaving the machine	Good
Multi-file reasoning where relationships matter	Mixed
Anything accuracy-critical (security review, data migration review)	Use as first pass, then verify
Tasks needing current conversation context	Avoid

Measured results

Rough benchmarks from testing on real projects. "Marginal" is session total minus a 49k fresh-session baseline (system prompt, skill descriptions, CLAUDE.md — the overhead a Claude Code session starts with before any task work happens).

Task	Files involved	Opus direct	Ask-local	Per-task ratio
Inventory every route under `app/api/admin`: method, path, auth check, purpose, DB tables	23 route files	13k marginal (62k total)	0.4k marginal (49.4k total)	~30×
Full page inventory of an Astro site: H1, H2s, meta, CTA, disclaimer per page + layout details + consistency review	18 files (14 pages + 4 layouts)	89k marginal (138k total)	3k marginal (52k total)	~30×

What this means in practice: in a working session with 3-5 such tasks, the difference is hitting compaction pressure mid-afternoon versus staying cool all day. With Opus direct, each inventory task adds 15-90k to your session. With ask-local, each adds ~1-3k — essentially just the size of the answer coming back.

What this doesn't mean: these are extraction-heavy tasks in the local model's sweet spot. Tasks that need multi-file reasoning, subtle correctness, or cross-cutting judgment will narrow the gap — the model that produces the right answer wins regardless of token cost. Treat ~30× as the upper end for inventory work, not a universal claim. And these are single-run measurements, not averages — expect some variance.

Output-quality side note: on the second benchmark above, Qwen and Opus produced different but overlapping consistency observations. Qwen caught an architectural issue Opus missed (one page bypassing the standard layout); Opus caught a heading-hierarchy issue Qwen missed. Neither was strictly better — they noticed different things. Use ask-local's output as a strong first pass, verify anything load-bearing yourself.

Known limits

Output quality on local 30B-class models lags Claude/GPT-class models, especially on subtle correctness. Spot-check security or correctness findings — in testing, Qwen3.6 claimed a Next.js middleware constant was "exposed to the client bundle," which was wrong (edge runtime is server-side).
Quality degrades toward the tail of the advertised context window on most open-weight models. Don't push past ~40–60k tokens of context even if the model advertises 128k+.
Dense enumeration tasks (e.g. inventory 20+ items with 5-6 attributes each) need --max-tokens 10000+ or they'll truncate. The script now warns when truncation happens — don't ignore the warning.
Model must support OpenAI-style tool calling. Most recent instruction-tuned models do; older ones may not.
One query at a time on RAM-constrained machines. LM Studio will happily accept concurrent requests and OOM your laptop.

Credits

Original pattern + base scripts: ClassicalDude via r/LocalLLaMA.

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
commands		commands
scripts		scripts
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ask-local

Required one-time setup for Qwen3 models

What you get

Install

Usage

Repo orientation

Pattern inventory (grep-heavy, very fast)

Content audits

Migration and refactor prep

Triage (always spot-check the findings yourself)

Log / output triage (pipe through `query_lm.py` — no file tree needed)

Directly from the CLI

Rule of thumb for picking tasks

Flags

Tools the local model has

When this helps

Measured results

Known limits

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

ask-local

Required one-time setup for Qwen3 models

What you get

Install

Usage

Repo orientation

Pattern inventory (grep-heavy, very fast)

Content audits

Migration and refactor prep

Triage (always spot-check the findings yourself)

Log / output triage (pipe through query_lm.py — no file tree needed)

Directly from the CLI

Rule of thumb for picking tasks

Flags

Tools the local model has

When this helps

Measured results

Known limits

Credits

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Log / output triage (pipe through `query_lm.py` — no file tree needed)

Packages