Skip to content

Misc. bug: Server multi-model mode add invalid Transfer-Encoding: chunked header to response #17710

@EZForever

Description

@EZForever

Name and Version

Tested on latest build (b7240). Also tried #17698 but it does not seem to fix this issue.

C:\llama-b7240-bin-win-cpu-x64>llama-server.exe --version
load_backend: loaded RPC backend from C:\llama-b7240-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\llama-b7240-bin-win-cpu-x64\ggml-cpu-haswell.dll
version: 7240 (61bde8e21)
built with clang version 19.1.5 for x86_64-pc-windows-msvc

Operating systems

Linux, Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server.exe --models-dir "C:\models"

curl -v http://127.0.0.1:8080/completion -d '{"model":"stories260K","prompt":"hello","n_predict":10}'

Problem description & steps to reproduce

When running in multi-model mode, the proxy server add unnecessary and invalid Transfer-Encoding: chunked HTTP header to certain endpoints' responses, even if the mutually-exclusive Content-Length header is already present. It does not affect WebUI or any other ordinary usage, but will cause reverse proxies with strict header checking to fail (e.g. for Nginx it errors out on every completion request with upstream sent "Content-Length" and "Transfer-Encoding" headers at the same time while reading response header from upstream).

First Bad Commit

ec18edf, or since the introduction of multi-model mode feature.

Relevant log output

Single-model mode (--model C:\models\stories260K.gguf):

$ curl -v http://127.0.0.1:8080/completion -d '{"model":"stories260K","prompt":"hello","n_predict":10}'
*   Trying 127.0.0.1:8080...
* Connected to 127.0.0.1 (127.0.0.1) port 8080
* using HTTP/1.x
> POST /completion HTTP/1.1
> Host: 127.0.0.1:8080
> User-Agent: curl/8.14.1
> Accept: */*
> Content-Length: 55
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 55 bytes
< HTTP/1.1 200 OK
< Server: llama.cpp
< Access-Control-Allow-Origin:
< Content-Type: application/json; charset=utf-8
< Content-Length: 1634
< Keep-Alive: timeout=5, max=100
<

Multi-model mode, b7240:

$ curl -v http://127.0.0.1:8080/completion -d '{"model":"stories260K","prompt":"hello","n_predict":10}'
*   Trying 127.0.0.1:8080...
* Connected to 127.0.0.1 (127.0.0.1) port 8080
* using HTTP/1.x
> POST /completion HTTP/1.1
> Host: 127.0.0.1:8080
> User-Agent: curl/8.14.1
> Accept: */*
> Content-Length: 55
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 55 bytes
< HTTP/1.1 200 OK
< Server: llama.cpp
< Server: llama.cpp
< Access-Control-Allow-Origin:
< Access-Control-Allow-Origin:
< Connection: close
< Content-Length: 1630
< Content-Type: application/json; charset=utf-8
< Content-Type: application/json; charset=utf-8
< Transfer-Encoding: chunked
< Keep-Alive: timeout=5, max=100
<

Multi-model mode, #17698:

$ curl -v http://127.0.0.1:8080/completion -d '{"model":"stories260K","prompt":"hello","n_predict":10}'
*   Trying 127.0.0.1:8080...
* Connected to 127.0.0.1 (127.0.0.1) port 8080
* using HTTP/1.x
> POST /completion HTTP/1.1
> Host: 127.0.0.1:8080
> User-Agent: curl/8.14.1
> Accept: */*
> Content-Length: 55
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 55 bytes
< HTTP/1.1 200 OK
< Transfer-Encoding: chunked
< Server: llama.cpp
< Access-Control-Allow-Origin:
< Connection: close
< Content-Length: 1636
< Content-Type: application/json; charset=utf-8
< Keep-Alive: timeout=5, max=100
<

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions