This project provides an Ollama API-compatible server that uses the llama-cpp-python library to run local LLM inference. It allows you to use your own GGUF models with an API that's compatible with Ollama's endpoints, making it easy to integrate with existing tools and applications designed to work with Ollama.
- Ollama API Compatibility: Implements endpoints that match Ollama's API structure
- Local Inference: Uses
llama-cpp-pythonto run inference on local GGUF model files - Model Management: Implements a model caching system to avoid reloading models
- Configurable Parameters: Supports various inference parameters (temperature, max tokens, etc.)
- Web UI: Includes a simple web interface for chatting with the model and generating text
- Python 3.9+ with SSL support
- A GGUF model file (e.g., Llama 3.2)
-
Install Python with SSL support:
Using Homebrew (recommended):
brew install python
Or download from the official Python website.
-
Create a virtual environment:
python3 -m venv venv source venv/bin/activate -
Install required packages:
pip install fastapi uvicorn llama-cpp-python pydantic
-
Update the model path:
Edit the
ollama-api-compatible.pyfile and update theMODEL_PATHvariable to point to your GGUF model file:MODEL_PATH = "/path/to/your/model.gguf"
-
Run the server:
python ollama-api-compatible.py
The server will run on
http://127.0.0.1:11435by default.
-
Install Python with SSL support:
Download and install Python from the official Python website.
During installation, make sure to check the box that says "Add Python to PATH".
-
Create a virtual environment:
python -m venv venv venv\Scripts\activate
-
Install required packages:
pip install fastapi uvicorn llama-cpp-python pydantic
-
Update the model path:
Edit the
ollama-api-compatible.pyfile and update theMODEL_PATHvariable to point to your GGUF model file:MODEL_PATH = "C:\\path\\to\\your\\model.gguf"
-
Run the server:
python ollama-api-compatible.py
The server will run on
http://127.0.0.1:11435by default.
The server implements the following Ollama-compatible endpoints:
- GET /api/tags: List available models
- POST /api/generate: Generate text completions
- POST /api/chat: Handle chat-based interactions
- GET /api/version: Get the server version
The project includes a simple web interface for interacting with your models:
- Access: Simply navigate to
http://localhost:11435/in your browser after starting the server - Features:
- Chat Tab: Have conversational interactions with the model
- Generate Tab: Create text completions with adjustable parameters
- Model Info: View information about the loaded model
The web UI automatically connects to the API endpoints and provides a user-friendly way to interact with your models without needing to use command-line tools or write code.
curl -s http://localhost:11435/api/tagscurl -s -X POST http://localhost:11435/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "Hello, how are you today?",
"stream": false
}'curl -s -X POST http://localhost:11435/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "What are the three laws of robotics?"}
],
"stream": false
}'import requests
import json
# For text generation
def generate_text(prompt, model="llama3.2"):
response = requests.post(
"http://localhost:11435/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]
# For chat interaction
def chat(messages, model="llama3.2"):
response = requests.post(
"http://localhost:11435/api/chat",
json={
"model": model,
"messages": messages,
"stream": False
}
)
return response.json()["message"]["content"]
# Example usage
if __name__ == "__main__":
# Text generation
result = generate_text("Explain quantum computing in simple terms.")
print(f"Generated text: {result}")
# Chat interaction
messages = [
{"role": "user", "content": "What is the capital of France?"}
]
result = chat(messages)
print(f"Chat response: {result}")// Using fetch API
async function generateText(prompt, model = "llama3.2") {
const response = await fetch("http://localhost:11435/api/generate", {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({
model: model,
prompt: prompt,
stream: false,
}),
});
const data = await response.json();
return data.response;
}
async function chat(messages, model = "llama3.2") {
const response = await fetch("http://localhost:11435/api/chat", {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({
model: model,
messages: messages,
stream: false,
}),
});
const data = await response.json();
return data.message.content;
}
// Example usage
async function example() {
// Text generation
const generatedText = await generateText("Write a short poem about coding.");
console.log("Generated text:", generatedText);
// Chat interaction
const chatResponse = await chat([
{ role: "user", content: "Explain how to make a sandwich." }
]);
console.log("Chat response:", chatResponse);
}
example();You can modify the following parameters in the code to customize the server:
- Port: Change the port number in the
uvicorn.run()call - Model Parameters: Adjust the
n_ctx,n_gpu_layers, and other parameters in theget_or_load_model()function - Response Format: Customize the response format in the API endpoint handlers
If you encounter SSL errors when installing packages with pip, try:
pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org <package-name>- Ensure the model path is correct and the file exists
- Check that you have sufficient RAM to load the model
- For large models, consider enabling GPU acceleration by setting
n_gpu_layersto a higher value
This project is open source and available under the MIT License.