A basic llama.cpp wrapper to explore LLMs in an interactive chat session with channel support and reasoning.
Disclosure: coded partially using ChatGPT.
While this is 'pure' unopinionated Python, I've used uv inline project TOML at the top of the code to get the best of both worlds and declare dependencies. Run it wrapped in uv as intended (after installing uv of course) using:
uv run llama-cpp-wrapper.py [options]
or with your existing environment any other way you'd like to.
The impetus for this work was to allow me to run models locally that are not supported by Ollama. For example, while OpenAI's GPT open-weights model, AKA gpt-oss is available on Ollama (via a partnership apparently) as a relatively small 20b model, at 16-bit precision it is still too large to run on my Apple M1 at a tolerable speed. This applies to lots of other models too.
To get around this, you can run quantized precision reduction from 16-bit to 4-bit or 3-bit precision (a 4x or ~5.3x reduction) to make such models usable on low-spec CPU-only hardware. There are a number of good derivatives of the gpt-oss model (and many others, of course) on Hugging Face you can choose from, or alternatively quantize it yourself.
However, Ollama doesn't support many GGML/GGUF models, including many of the gpt-oss quantized derivatives, so I created this project as a simple way to try them. You can download the models, almost always in GGUF format, and run them using:
uv run llama-cpp-wrapper.py -m your-downloaded-model.gguf
There are probably better simple chat interfaces wrapping llama.cpp but this one supports the features I want, particularly for the reasoning models to show their "thinking" as a separate process, coloring similarly to Ollama, that can be enabled or disabled at will.
- primary: chat interface to try a GGML format (bin or GGUF) model
- supports command line options, especially model file, system prompt (inline or text file), context window size, CPU threads to use, and flags
- some options are exposed in interactive chat session options, including a
/help
chat command
- some options are exposed in interactive chat session options, including a
- parses llama.cpp multi-channel output markup, with optional non-main output suppression via the 'silence' option
- each channel output stream separated and shown
- if silenced, waiting spinner shows channel stream in progress until closed
- suppresses llama.cpp noisy model loading output. You may wish to enable this the first time you run a model with flag
--no-quiet-llama
- create tool integration from channel
tool
, this is currently unsupported. You can see the tool channel output with simple output disabled, and it would be possible to loop that back in, with user approvals etc. without too much additional work- consider leveraging standards such as MCP in this
- use it as a starting point for your own integration into your project