I have explored running language models locally using LM Studio, and Ollama.
Under the hood both of these tools use Llama.cpp runtimes.
I am now exploring using Llama.cpp directly to run local LLMs, primarily as a server for integration into applications and CLIs.
- Configure Llama.cpp environment.
- Create script to execute Llama.cpp with predefined parameters.
- Add support for multiple LLMs.
- Support per-model parameters, for tweaking settings based on model performance.
- Integrate into OpenCode.
- Identify method for verifying GPU offload.
- Apply AI generated optimial llama-server parameter values.
- Design method of benchmarking performance, automate repeatable tests.
- Apply improvements to each model execution using benchmarking results.
No known defects.
GitHub Copilot was used to assist in the development of this software.
Note
Other operating systems and versions will work, where versions are specified treat as minimums.
Note
TechPowerUp's GPU-Z is optional. This application provides a simple method of verifying GPU offload.
A system capable of running Llama.cpp is required.
Details of my personal system are below.
Note
The hardware in use on my PC includes an Accelerated Processor Unit (APU) which combines CPU and GPU on a single chip. Llama.cpp is focused on supporting a wide range of hardware. Performance will depend upon your hardware, the use of CPU v GPU, the models you choose to run and other operational factors.
Installation of Llama.cpp via Winget, no other configuration needed.
Note
Works on my machine!
Clone the repository.
Download supported models, place models within the models directory.
Note
The repository shows which models I am currently experimenting with. The script currently hardcodes their values.
Scripts can be executed within the VS Code terminal window, or via any other supported terminal e.g. Windows Terminal.
Note
The scripts are opinionated, they are hardcoded to use Windows Terminal when launching new Llama.cpp servers.
- Asks user on each execution whether they wish to update Llama.cpp.
- Asks user which model they wish to run.
- Runs the model in a new Windows Terminal window.
Run start-llama-cpp.bat in your preferred terminal.
Run GPU-Z to verify GPU offload:
Thanks to Nico Domino who shared his GLM-4.7-Flash Strix Halo Docker setup, I used this as a basis for running my own local Llama.cpp server.
Thanks also to the open source contributors of Llama.cpp.
This repository was created primarily for my own exploration of the technologies involved.
I have selected an appropriate license using this tool.
This software is licensed under the MIT license.
More detailed information can be found in the documentation:
