Skip to content

alisorcorp/ask-local

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ask-local

Delegate grunt work from Claude Code (or any coding agent) to a local LLM running in LM Studio. File contents enter the local model's context, not your paid Claude session — so you can explore big repos, triage logs, or extract content without burning the cloud token meter.

Built on the pattern originally shared by Ok_Significance_9109 on r/LocalLLaMA, extended with a read budget, read cache, grep tool, streaming, and model preflight.

Required one-time setup for Qwen3 models

If you use a Qwen3 model (including Qwen3.6) and want the fast non-thinking mode, you must edit the model's Jinja template in LM Studio, or everything will silently run in reasoning mode and burn your token budget.

  1. Open LM Studio → My Models
  2. Find your Qwen3 model → click the edit/gear icon
  3. Expand Prompt Template (Jinja)
  4. Add this line at the very top of the template:
    {%- set enable_thinking = false %}
  5. Save and reload the model

Other models (e.g. Gemma 4) don't need this.

What you get

  • /ask-local slash command for Claude Code — delegates a task to the local model with file reading, listing, and grepping.
  • Two Python scripts, stdlib-only, no pip dependencies:
    • agent_lm.py — tool-calling agent loop (the main event)
    • query_lm.py — simple prompt-only helper
  • Read budget + cache so the model can't spiral into unbounded exploration.
  • Streaming final answer so you see output flowing instead of staring at a spinner.
  • Token-usage footer printed after every run so you can see exactly how much context stayed local.

Install

git clone https://github.com/alisorcorp/ask-local.git
cd ask-local
./install.sh

The install script:

  • Checks Python 3.9+ is available
  • Pings LM Studio on http://localhost:1234 (warns, doesn't fail, if unreachable)
  • Copies scripts/agent_lm.py and scripts/query_lm.py into ~/.claude/scripts/
  • Copies commands/ask-local.md into ~/.claude/commands/
  • Does NOT edit your LM Studio Jinja templates — that step is on you

If you prefer symlinks (so git pull updates your installed version), pass --link:

./install.sh --link

Usage

From any Claude Code session, invoke /ask-local <task>. The model reads files, lists directories, and greps for patterns on your behalf. Don't paste file content into the task description — describe the task and let the local model do the reading.

Repo orientation

/ask-local summarize this repo in 6 bullets: purpose, tech stack, entry points, how to run
/ask-local build a mental model: top 5 directories, what each contains, the most important file in each

Pattern inventory (grep-heavy, very fast)

/ask-local find every TODO and FIXME, group by file
/ask-local list every env var read via process.env / os.getenv / os.environ — include file:line
/ask-local inventory every API route under app/api: method, path, one-line purpose
/ask-local find every hardcoded string that looks like an API key or secret — file:line
/ask-local list every import from 'lodash' — I want to remove this dep

Content audits

/ask-local build a page inventory: for each route, H1, primary CTA, disclaimer yes/no
/ask-local extract every user-facing error string, flag any that sound rude or cryptic

Migration and refactor prep

/ask-local find every reference to the old /v1/users endpoint that should move to /v2/users
/ask-local find every place we build SQL queries via string concatenation
/ask-local list every component still using class syntax instead of hooks

Most tasks don't need an explicit --read-budget — the default 15 covers triage, audit, and typical inventory work. Only raise it for jobs that legitimately want to read >15 files (e.g. full-site page inventories on a large site). If you underspec it, you'll see a loud [WARNING: read budget exhausted...] line at the end telling you exactly what to raise.

Triage (always spot-check the findings yourself)

/ask-local find the three most error-prone paths — unhandled rejections, swallowed exceptions, missing validation. Skip tests.
/ask-local review middleware.ts and lib/auth.ts for permission gaps — cite line numbers
/ask-local --think check lib/db.ts for N+1 queries or missing transaction boundaries

Log / output triage (pipe through query_lm.py — no file tree needed)

cat error.log | python3 ~/.claude/scripts/query_lm.py "classify these errors into buckets, count each, show one example per"
tail -5000 build.log | python3 ~/.claude/scripts/query_lm.py "which 3 errors are blocking the build?"

Directly from the CLI

python3 ~/.claude/scripts/agent_lm.py \
  --dir ~/Code/my-project \
  "find every environment variable read from os.getenv"

Rule of thumb for picking tasks

If the answer is "a list, inventory, or count," it'll crush it. If the answer is "a nuanced judgment call," use it for a first pass and spot-check the top findings yourself.

Flags

Flag Default What it does
--dir DIR cwd Working directory the model can read. The model is sandboxed to this directory.
--model MODEL qwen3.6-35b-a3b Which loaded LM Studio model to use.
--max-turns N 15 Max agent loop iterations.
--max-tokens N 6000 Max tokens per model response. Default is sized for 64k windows with comfortable headroom. On 96k+ windows you can push to 10000-12000 for longer inventories.
--read-budget N 15 Max read_file calls before tools are force-disabled. list_dir and grep are free. If the budget is hit, a clear warning is printed so incomplete answers are never silent.
--max-read-chars N 12000 Per-file truncation cap (head + tail, middle discarded).
--max-file-bytes N 500000 Refuse to read files bigger than this.
--think off Enable reasoning mode (slower but better for hard problems).
--url URL http://localhost:1234 LM Studio base URL.
--no-stream (streaming on) Disable streaming of the final answer.
--quiet off Suppress turn markers and tool-call logs.

Tools the local model has

  • read_file(path) — read a text file. Binaries, oversized files, and escapes outside --dir are refused. Reads are cached (second read of same path is free).
  • list_dir(path) — list entries in a directory. Free — doesn't count against the read budget.
  • grep(pattern, path='.', glob=None) — regex search across files. Free. Skips binaries and standard build dirs (node_modules, .git, dist, .venv, etc.). Returns up to 50 matches. Use this instead of reading many files blindly.

When this helps

Task type Fit
Triage across many files ("find the 3 files that touch auth") Good
Log / stack-trace summarization Good
Content extraction or inventory Good
Quick bug-finding in isolated files Good
Privacy-sensitive code you don't want leaving the machine Good
Multi-file reasoning where relationships matter Mixed
Anything accuracy-critical (security review, data migration review) Use as first pass, then verify
Tasks needing current conversation context Avoid

Measured results

Rough benchmarks from testing on real projects. "Marginal" is session total minus a 49k fresh-session baseline (system prompt, skill descriptions, CLAUDE.md — the overhead a Claude Code session starts with before any task work happens).

Task Files involved Opus direct Ask-local Per-task ratio
Inventory every route under app/api/admin: method, path, auth check, purpose, DB tables 23 route files 13k marginal (62k total) 0.4k marginal (49.4k total) ~30×
Full page inventory of an Astro site: H1, H2s, meta, CTA, disclaimer per page + layout details + consistency review 18 files (14 pages + 4 layouts) 89k marginal (138k total) 3k marginal (52k total) ~30×

What this means in practice: in a working session with 3-5 such tasks, the difference is hitting compaction pressure mid-afternoon versus staying cool all day. With Opus direct, each inventory task adds 15-90k to your session. With ask-local, each adds ~1-3k — essentially just the size of the answer coming back.

What this doesn't mean: these are extraction-heavy tasks in the local model's sweet spot. Tasks that need multi-file reasoning, subtle correctness, or cross-cutting judgment will narrow the gap — the model that produces the right answer wins regardless of token cost. Treat ~30× as the upper end for inventory work, not a universal claim. And these are single-run measurements, not averages — expect some variance.

Output-quality side note: on the second benchmark above, Qwen and Opus produced different but overlapping consistency observations. Qwen caught an architectural issue Opus missed (one page bypassing the standard layout); Opus caught a heading-hierarchy issue Qwen missed. Neither was strictly better — they noticed different things. Use ask-local's output as a strong first pass, verify anything load-bearing yourself.

Known limits

  • Output quality on local 30B-class models lags Claude/GPT-class models, especially on subtle correctness. Spot-check security or correctness findings — in testing, Qwen3.6 claimed a Next.js middleware constant was "exposed to the client bundle," which was wrong (edge runtime is server-side).
  • Quality degrades toward the tail of the advertised context window on most open-weight models. Don't push past ~40–60k tokens of context even if the model advertises 128k+.
  • Dense enumeration tasks (e.g. inventory 20+ items with 5-6 attributes each) need --max-tokens 10000+ or they'll truncate. The script now warns when truncation happens — don't ignore the warning.
  • Model must support OpenAI-style tool calling. Most recent instruction-tuned models do; older ones may not.
  • One query at a time on RAM-constrained machines. LM Studio will happily accept concurrent requests and OOM your laptop.

Credits

License

MIT.

About

Delegate grunt work from Claude Code to a local LLM via LM Studio. File contents stay on your machine; only the final answer enters your Claude session.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages