Resources for running Llama.cpp locally

🎥 Background

I have explored running language models locally using LM Studio, and Ollama.

Under the hood both of these tools use Llama.cpp runtimes.

I am now exploring using Llama.cpp directly to run local LLMs, primarily as a server for integration into applications and CLIs.

✅ Scope

Configure Llama.cpp environment.
Create script to execute Llama.cpp with predefined parameters.
Add support for multiple LLMs.
Support per-model parameters, for tweaking settings based on model performance.
Integrate into OpenCode.
Identify method for verifying GPU offload.
Apply AI generated optimial llama-server parameter values.

🔭 Future Gazing

Design method of benchmarking performance, automate repeatable tests.
Apply improvements to each model execution using benchmarking results.

🪲 Known defects

No known defects.

🔮 Use of AI

GitHub Copilot was used to assist in the development of this software.

🚀 Getting Started

💻 System Requirements

Software

Note

Other operating systems and versions will work, where versions are specified treat as minimums.

Note

TechPowerUp's GPU-Z is optional. This application provides a simple method of verifying GPU offload.

Hardware

A system capable of running Llama.cpp is required.

Details of my personal system are below.

Note

The hardware in use on my PC includes an Accelerated Processor Unit (APU) which combines CPU and GPU on a single chip. Llama.cpp is focused on supporting a wide range of hardware. Performance will depend upon your hardware, the use of CPU v GPU, the models you choose to run and other operational factors.

💾 System Configuration

Installation of Llama.cpp via Winget, no other configuration needed.

Note

Works on my machine!

🔧 Development Setup

Clone the repository.

Download supported models, place models within the models directory.

Note

The repository shows which models I am currently experimenting with. The script currently hardcodes their values.

Scripts can be executed within the VS Code terminal window, or via any other supported terminal e.g. Windows Terminal.

Note

The scripts are opinionated, they are hardcoded to use Windows Terminal when launching new Llama.cpp servers.

⚡ Features

Asks user on each execution whether they wish to update Llama.cpp.
Asks user which model they wish to run.
Runs the model in a new Windows Terminal window.

📎 Usage

Run start-llama-cpp.bat in your preferred terminal.

Run GPU-Z to verify GPU offload:

🙌 Thanks

Thanks to Nico Domino who shared his GLM-4.7-Flash Strix Halo Docker setup, I used this as a basis for running my own local Llama.cpp server.

Thanks also to the open source contributors of Llama.cpp.

👋 Contributing

This repository was created primarily for my own exploration of the technologies involved.

🎁 License

I have selected an appropriate license using this tool.

This software is licensed under the MIT license.

📖 Further reading

More detailed information can be found in the documentation:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
models		models
.gitignore		.gitignore
README.md		README.md
start-llama-cpp.bat		start-llama-cpp.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Resources for running Llama.cpp locally

🎥 Background

✅ Scope

🔭 Future Gazing

🪲 Known defects

🔮 Use of AI