llama-cpp-wrapper

A basic llama.cpp wrapper to explore LLMs in an interactive chat session with channel support and reasoning.

Disclosure: coded partially using ChatGPT.

Usage

While this is 'pure' unopinionated Python, I've used uv inline project TOML at the top of the code to get the best of both worlds and declare dependencies. Run it wrapped in uv as intended (after installing uv of course) using:

uv run llama-cpp-wrapper.py [options]

or with your existing environment any other way you'd like to.

Motiviation

The impetus for this work was to allow me to run models locally that are not supported by Ollama. For example, while OpenAI's GPT open-weights model, AKA gpt-oss is available on Ollama (via a partnership apparently) as a relatively small 20b model, at 16-bit precision it is still too large to run on my Apple M1 at a tolerable speed. This applies to lots of other models too.

To get around this, you can run quantized precision reduction from 16-bit to 4-bit or 3-bit precision (a 4x or ~5.3x reduction) to make such models usable on low-spec CPU-only hardware. There are a number of good derivatives of the gpt-oss model (and many others, of course) on Hugging Face you can choose from, or alternatively quantize it yourself.

However, Ollama doesn't support many GGML/GGUF models, including many of the gpt-oss quantized derivatives, so I created this project as a simple way to try them. You can download the models, almost always in GGUF format, and run them using:

uv run llama-cpp-wrapper.py -m your-downloaded-model.gguf

There are probably better simple chat interfaces wrapping llama.cpp but this one supports the features I want, particularly for the reasoning models to show their "thinking" as a separate process, coloring similarly to Ollama, that can be enabled or disabled at will.

Features

primary: chat interface to try a GGML format (bin or GGUF) model
supports command line options, especially model file, system prompt (inline or text file), context window size, CPU threads to use, and flags
- some options are exposed in interactive chat session options, including a /help chat command
parses llama.cpp multi-channel output markup, with optional non-main output suppression via the 'silence' option
- each channel output stream separated and shown
- if silenced, waiting spinner shows channel stream in progress until closed
suppresses llama.cpp noisy model loading output. You may wish to enable this the first time you run a model with flag --no-quiet-llama

Future work / what you can build on

create tool integration from channel tool, this is currently unsupported. You can see the tool channel output with simple output disabled, and it would be possible to loop that back in, with user approvals etc. without too much additional work
- consider leveraging standards such as MCP in this
use it as a starting point for your own integration into your project

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
llama-cpp-wrapper.py		llama-cpp-wrapper.py
screenshot.png		screenshot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llama-cpp-wrapper

Usage

Motiviation

Features

Future work / what you can build on

About

Uh oh!

Languages

License

digithree/llama-cpp-wrapper

Folders and files

Latest commit

History

Repository files navigation

llama-cpp-wrapper

Usage

Motiviation

Features

Future work / what you can build on

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages