Skip to content

cronos3k/llama-cpp-python-standalone

Repository files navigation

llama.cpp Python Standalone

A simple Python wrapper for running llama.cpp server binaries directly, bypassing outdated Python bindings.

The Problem

llama-cpp-python often lags behind the official llama.cpp C++ implementation by weeks or months. New model architectures (like Qwen3-VL, Gemma3, etc.) are supported in llama.cpp but fail with:

unknown model architecture: 'qwen3vl'

The Solution

Use the official llama.cpp server binary directly from Python. This wrapper provides:

  • OpenAI-compatible API endpoints
  • Simple Python interface
  • Full control over server lifecycle
  • Support for ANY model architecture llama.cpp supports

Quick Start

1. Build llama.cpp (one-time setup)

# Run the build script
./scripts/build_llama_cpp.sh

# Or manually:
git clone https://github.com/ggerganov/llama.cpp /tmp/llama-cpp-standalone
cd /tmp/llama-cpp-standalone
mkdir build && cd build

# With CUDA (recommended)
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DBUILD_SHARED_LIBS=ON
cmake --build . --config Release -j

# Without CUDA (CPU only)
cmake .. -DBUILD_SHARED_LIBS=ON
cmake --build . --config Release -j

2. Use in Python

from llama_cpp_standalone import LlamaCppServer

# Start server
server = LlamaCppServer("/path/to/llama-server")
server.start(
    model_path="/path/to/model.gguf",
    port=8080,
    n_gpu_layers=-1,  # Full GPU offload
    n_ctx=4096
)

# Use with OpenAI Python client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

# Stop when done
server.stop()

Features

  • ✅ Works with ANY model architecture llama.cpp supports
  • ✅ OpenAI-compatible API (drop-in replacement)
  • ✅ Full CUDA/Metal/OpenCL support
  • ✅ Multimodal support (vision models with mmproj)
  • ✅ Context management and health checks
  • ✅ Automatic process lifecycle management

Advanced Usage

Vision Models (Qwen-VL, LLaVA, etc.)

server.start(
    model_path="/path/to/qwen-vl.gguf",
    mmproj_path="/path/to/mmproj.gguf",  # Vision projector
    port=8080,
    n_gpu_layers=-1
)

Multiple GPUs

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

server.start(
    model_path="/path/to/model.gguf",
    port=8080
)

Custom Server Arguments

server.start(
    model_path="/path/to/model.gguf",
    port=8080,
    extra_args=["--rope-freq-base", "10000", "--rope-freq-scale", "0.5"]
)

Requirements

  • Python 3.8+
  • requests library (pip install requests)
  • Built llama.cpp server binary

Comparison with llama-cpp-python

Feature llama-cpp-python This Wrapper
Model support Lags behind C++ Same day as llama.cpp
Installation Pip (large download) Just copy files
GPU support Version dependent Full llama.cpp support
Architecture Python bindings Direct binary
Updates Wait for PyPI Build llama.cpp anytime

Troubleshooting

Server won't start

  • Check binary path: which llama-server
  • Test manually: /path/to/llama-server --help
  • Check model path exists
  • Verify CUDA setup (if using GPU)

"Unknown model architecture" error

  • Rebuild llama.cpp from latest master
  • Verify model file is valid GGUF format

Port already in use

  • Use different port in start()
  • Or kill existing server: lsof -ti:8080 | xargs kill

Contributing

We welcome contributions! Please:

  1. Test with different model architectures
  2. Add support for new llama.cpp features
  3. Improve documentation
  4. Share your use cases

License

MIT License - See LICENSE file

Author

Created by Gregor Koch (@cronos3k)

Acknowledgments

  • llama.cpp - The amazing C++ implementation
  • Community members who identified the Python bindings lag issue

Why This Exists

The llama.cpp project moves fast. Python bindings take time to update. This bridges the gap, letting you use cutting-edge features immediately while maintaining a clean Python interface.

Perfect for:

  • Testing new model architectures
  • Production deployments needing stability
  • Projects requiring specific llama.cpp versions
  • Anyone frustrated with binding version mismatches

About

Simple Python wrapper for llama.cpp server - use new models before Python bindings catch up

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published