Skip to content

Conversation

@noname22
Copy link
Contributor

claude-code

Summary

This PR adds Anthropic Messages API compatibility to llama-server. The implementation converts Anthropic's format to OpenAI-compatible internal format, reusing existing inference pipeline.

Motivation

  • Enables llama.cpp to serve as a local/self-hosted alternative to Anthropic's Claude API
  • Allows Claude Code and other Anthropic-compatible clients to work with llama-server

Features Implemented

Endpoints:

  • POST /v1/messages - Chat completions with streaming support
  • POST /v1/messages/count_tokens - Token counting for prompts

Functionality:

  • Streaming with proper Anthropic SSE event types (message_start, content_block_delta, etc.)
  • Tool use (function calling) with tool_use/tool_result content blocks
  • Vision support with image content blocks (base64 and URL)
  • System prompts and multi-turn conversations
  • Extended thinking parameter support

Testing

  • Tests in test_anthropic_api.py
  • Tests cover: basic messages, streaming, tools, vision, token counting, parameters, error handling, content block indices

@noname22
Copy link
Contributor Author

New PR to allow maintainers to edit.
Old PR here: #17425

@github-actions github-actions bot added examples python python script changes server labels Nov 28, 2025
@noname22
Copy link
Contributor Author

The RISCV test is getting

The self-hosted runner lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

I'm guessing it's not related to the PR? Any way to retry?

@ngxson
Copy link
Collaborator

ngxson commented Nov 28, 2025

This PR can be merge when server CI passes. Other CI are not important.

@ngxson ngxson merged commit ddf9f94 into ggml-org:master Nov 28, 2025
65 of 69 checks passed
@ericcurtin
Copy link
Collaborator

ericcurtin commented Nov 28, 2025

I stumbled across this as it hit conflicts with my PR. I am curious. What models does this work with? With sufficient hardware is this capable of beating Claude cloud models?

@noname22
Copy link
Contributor Author

Technically it works with pretty much any model but to get anywhere near Claude Sonnet you'd probably need a large, agentic model like MiniMax M2, Kimi K2, Qwen3 Coder 480B-A35B, etc.

That being said, I've had decent results for simple tasks with Qwen3 Coder 30B-A3B and gpt-oss-20b on a single 4090.

In my very subjective experience, the same models tend to perform a lot better with the Claude Code CLI app than with alternatives such as Open Code or gemini-cli and its clones, like Qwen3-Coder (the cli app).

@ericcurtin
Copy link
Collaborator

ericcurtin commented Nov 28, 2025

Interesting... If you want to take a quick peek, I fixed the conflicts here:

#17554

although they weren't major conflicts, it was just moving code from one place to another.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants