llama-server: fix duplicate HTTP headers in multiple models mode #17698

ServeurpersoCom · 2025-12-02T18:10:50Z

Make sure to read the contributing guidelines before submitting a PR

Approach: Filter at source

This patch filters headers before forwarding them to avoid duplication.

Why headers get duplicated:
When the router proxies child process responses, both the router (via
set_default_headers) and the child send the same headers (Server,
Transfer-Encoding, Keep-Alive, CORS). The proxy was forwarding everything,
resulting in duplicates.

Solution:

Skip headers that will be added by the router or httplib:
- server, transfer-encoding, keep-alive (duplicated by router defaults)
- access-control-* (CORS headers injected by router)
Handle Content-Type separately via msg_t.content_type to avoid duplication
when httplib calls set_chunked_content_provider() or set_content().

Tested with:

curl -v http://127.0.0.1:8082/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "Codestral-22B-v0.1-i1-GGUF",
    "messages": [{"role": "user", "content": "hi"}],
    "max_tokens": 5
  }'
*   Trying 127.0.0.1:8082...
* Connected to 127.0.0.1 (127.0.0.1) port 8082 (#0)
> POST /v1/chat/completions HTTP/1.1
> Host: 127.0.0.1:8082
> User-Agent: curl/7.88.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 121
>
< HTTP/1.1 200 OK
< Keep-Alive: timeout=5, max=100
< Transfer-Encoding: chunked
< Access-Control-Allow-Origin:
< Server: llama.cpp
< Content-Length: 591
< Content-Type: application/json; charset=utf-8
< Connection: close
<
* Closing connection 0
{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":" Hello! I'm"}}],"created":1764699011,"model":"Codestral-22B-v0.1-i1-GGUF","system_fingerprint":"b7274-f57a165eb","object":"chat.completion","usage":{"completion_tokens":5,"prompt_tokens":5,"total_tokens":10},"id":"chatcmpl-6WGgrUU4EWxLfKAXfzUq9hVDLrbybfJm","timings":{"cache_n":4,"prompt_n":1,"prompt_ms":21.409,"prompt_per_token_ms":21.409,"prompt_per_second":46.7093278527722,"predicted_n":5,"predicted_ms":51.717,"predicted_per_token_ms":10.343399999999999,"predicted_per_second":96.68000850784075}}

Before: duplicate Server, Transfer-Encoding, Keep-Alive, Access-Control-Allow-Origin, Content-Type
After: all headers appear exactly once

Fixes #17693

…l-org#17693)

ServeurpersoCom · 2025-12-02T18:16:02Z

Note: I first tried a is_proxied flag approach but it required more code

struct server_http_res {
     int status = 200;
     std::string data;
     std::map<std::string, std::string> headers;
+    bool is_proxied = false;

with logic split between modules. Filtering at source is simpler.

ngxson

looks good overall! would appropriate if you can address some small comments

tools/server/server-models.cpp

arkamar · 2025-12-02T18:36:47Z

I just want to confirm that this PR solves my issue, nginx errors are gone and llm works as well.

- restrict scope of header after std::move - simplify header check (remove unordered_set)

ngxson

nice, thanks! (merging once the CI passes)

llama-server: fix duplicate HTTP headers in multiple models mode (ggm…

eee7d59

…l-org#17693)

ServeurpersoCom requested review from ggerganov and ngxson as code owners December 2, 2025 18:10

ngxson reviewed Dec 2, 2025

View reviewed changes

tools/server/server-models.cpp Outdated Show resolved Hide resolved

tools/server/server-models.cpp Outdated Show resolved Hide resolved

llama-server: address review feedback from ngxson

d76c7c2

- restrict scope of header after std::move - simplify header check (remove unordered_set)

ngxson approved these changes Dec 2, 2025

View reviewed changes

github-actions bot added examples server labels Dec 2, 2025

EZForever mentioned this pull request Dec 3, 2025

Misc. bug: Server multi-model mode add invalid Transfer-Encoding: chunked header to response #17710

Open

ngxson merged commit 5ceed62 into ggml-org:master Dec 3, 2025
65 of 74 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama-server: fix duplicate HTTP headers in multiple models mode #17698

llama-server: fix duplicate HTTP headers in multiple models mode #17698

ServeurpersoCom commented Dec 2, 2025

Uh oh!

ServeurpersoCom commented Dec 2, 2025 •

edited

Loading

Uh oh!

ngxson left a comment

Uh oh!

Uh oh!

Uh oh!

arkamar commented Dec 2, 2025

Uh oh!

ngxson left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

llama-server: fix duplicate HTTP headers in multiple models mode #17698

llama-server: fix duplicate HTTP headers in multiple models mode #17698

Conversation

ServeurpersoCom commented Dec 2, 2025

Approach: Filter at source

Uh oh!

ServeurpersoCom commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

arkamar commented Dec 2, 2025

Uh oh!

ngxson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ServeurpersoCom commented Dec 2, 2025 •

edited

Loading

ngxson left a comment •

edited

Loading