-
Notifications
You must be signed in to change notification settings - Fork 0
Self Hosting LLMs
ReWrite's cleanup step works with any server that speaks the OpenAI /chat/completions dialect, through the OpenAI-compatible LLM provider. That covers Ollama, llama.cpp's llama-server, LM Studio, and most self-hosted gateways, whether on your own machine or a remote box.
Pair this with local whisper.cpp so neither audio nor text leaves your machine.
Ollama is the easiest starting point on macOS, Linux, and Windows.
- Install Ollama and start it (
ollama serve, or it runs as a service). - Pull a model:
ollama pull llama3.2(see the model picks below). - In a ReWrite profile, set LLM provider to "OpenAI-compatible (cloud or local)".
-
LLM base URL:
http://localhost:11434/v1 -
LLM model: the pulled name, for example
llama3.2. - API key: Ollama ignores it, so any non-empty placeholder works.
Note the /v1: for the LLM side the base URL must include the version path, because the adapter appends /chat/completions to whatever you enter. (The transcription side is different; see the base-URL asymmetry in Providers.)
Run llama-server -m <model.gguf>. Base URL http://localhost:8080/v1, model = whatever name the server reports.
The setup is the same; you just point the base URL at the remote host. The important part is not exposing the raw model port to the internet. Ollama's API has no authentication, so a public 11434 lets anyone use (and abuse) your GPU.
Recommended approaches, safest first:
-
Private tunnel (best). Keep the server bound to localhost and reach it over WireGuard, Tailscale, or an SSH tunnel (
ssh -L 11434:localhost:11434 user@host). Then usehttp://localhost:11434/v1as if it were local. No public port at all. -
Reverse proxy with TLS and auth. Put nginx or Caddy in front, terminate HTTPS, and require a token (bearer header) or basic auth. Bind Ollama to localhost and let only the proxy talk to it. Then the base URL is
https://your-host/v1and you put the token in the API key field (for bearer) or in the URL/proxy config. -
Bind Ollama to the LAN only (
OLLAMA_HOST=0.0.0.0:11434) behind a firewall, for a trusted home network. Acceptable on an isolated LAN; never do this on a public IP without a proxy in front.
Do not expose Ollama's port directly to the public internet.
Running a model locally is more demanding than calling a cloud API: you need enough RAM (and ideally a GPU) to hold the model. The good news is that ReWrite's job (cleaning and lightly structuring a transcript) is undemanding, so small instruct-tuned models do it well. You do not need a 70B model to fix grammar and add headings.
Use 4-bit quantization (Ollama's default Q4_K_M) as a starting point and move to Q5_K_M only if quality falls short. RAM figures below are rough 4-bit estimates; a GPU with that much VRAM is faster, but CPU + system RAM works for short notes.
| Model | Params | ~RAM (4-bit) | Pull | Good for |
|---|---|---|---|---|
| Llama 3.2 1B | 1B | ~2-3 GB | ollama pull llama3.2:1b |
Lightest option; fine for cleanup on low-end laptops. |
| SmolLM2 1.7B | 1.7B | ~4 GB | ollama pull smollm2:1.7b |
Fast, tiny, good for short notes. |
| Phi-4-mini | 3.8B | ~3-4 GB | ollama pull phi4-mini |
Strong instruction-following in a small footprint; long context. |
| Llama 3.2 3B | 3B | ~6 GB | ollama pull llama3.2 |
A solid all-rounder; good default if unsure. |
| Gemma 3 4B | 4B | ~6 GB | ollama pull gemma3:4b |
Good instruction-following from Google. |
| Qwen3 4B | 4B | ~6 GB | ollama pull qwen3:4b |
Capable small model; strong at structure. |
| Qwen3 8B | 8B | ~8-10 GB | ollama pull qwen3:8b |
Step up when you have the headroom. |
| Ministral 8B | 8B | ~10 GB | ollama pull mistral:8b |
Competitive with larger models; needs more RAM. |
These are guidance, not gospel; model availability and tags move quickly, so check the Ollama library for current names. If a model feels too terse or ignores your template structure, step up a size or try Q5_K_M. (This list refreshes the older MachineLearningMastery "Top 7" roundup, which still cited Phi-3.5, Llama 3.2, Qwen 2.5, and Gemma 2.)
For a rule of thumb: 8 GB RAM comfortably runs a 3B-4B model; 16 GB opens up 7B-8B; below 8 GB, stick to 1B-2B.
-
Connection refused / network error: Ollama (or your server) is not running, or the host/port is wrong. Confirm
ollama serveis up and thathttp://localhost:11434/v1/modelsresponds in a browser orcurl. -
404 Not Found: almost always the
/v1path. The LLM base URL must end in/v1(Ollama) or the right version path; the adapter adds/chat/completions. A doubled/v1/v1also 404s. -
Model not found: the model id does not match a pulled model. Run
ollama listand use the exact name (including the tag, for examplellama3.2:1b). -
Works locally but not from another device: Ollama binds to localhost by default. Set
OLLAMA_HOST=0.0.0.0:11434to listen on the LAN (behind a firewall), or use a tunnel/proxy as above. -
CORS errors: ReWrite calls go through Obsidian's
requestUrl, which bypasses browser CORS, so this is rare. If a proxy enforces origins, setOLLAMA_ORIGINSappropriately on the server. - Very slow first response: the model is loading into memory on the first call. Subsequent calls are faster while it stays resident.
- Output cut off / truncated: lower the "Maximum note length" if your model has a small output limit, or pick a model with a larger context. Some local models silently truncate rather than erroring.
- Garbled or off-task output: the model is too small for the template's structure. Step up a size, or simplify the template prompt.
See also the general Troubleshooting page and Providers for the base-URL conventions.
These pages are generated from the wiki/ folder in the ReWrite-Voice-Notes repo. Edits made directly in the wiki are overwritten on the next sync. To fix or improve a page, edit the matching file in wiki/ and open a pull request.
Getting started
Reference
Self-hosting
Help