-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Description
Name and Version
Tested on latest build (b7240). Also tried #17698 but it does not seem to fix this issue.
C:\llama-b7240-bin-win-cpu-x64>llama-server.exe --version
load_backend: loaded RPC backend from C:\llama-b7240-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\llama-b7240-bin-win-cpu-x64\ggml-cpu-haswell.dll
version: 7240 (61bde8e21)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
Operating systems
Linux, Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server.exe --models-dir "C:\models"
curl -v http://127.0.0.1:8080/completion -d '{"model":"stories260K","prompt":"hello","n_predict":10}'Problem description & steps to reproduce
When running in multi-model mode, the proxy server add unnecessary and invalid Transfer-Encoding: chunked HTTP header to certain endpoints' responses, even if the mutually-exclusive Content-Length header is already present. It does not affect WebUI or any other ordinary usage, but will cause reverse proxies with strict header checking to fail (e.g. for Nginx it errors out on every completion request with upstream sent "Content-Length" and "Transfer-Encoding" headers at the same time while reading response header from upstream).
First Bad Commit
ec18edf, or since the introduction of multi-model mode feature.
Relevant log output
Single-model mode (--model C:\models\stories260K.gguf):
$ curl -v http://127.0.0.1:8080/completion -d '{"model":"stories260K","prompt":"hello","n_predict":10}'
* Trying 127.0.0.1:8080...
* Connected to 127.0.0.1 (127.0.0.1) port 8080
* using HTTP/1.x
> POST /completion HTTP/1.1
> Host: 127.0.0.1:8080
> User-Agent: curl/8.14.1
> Accept: */*
> Content-Length: 55
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 55 bytes
< HTTP/1.1 200 OK
< Server: llama.cpp
< Access-Control-Allow-Origin:
< Content-Type: application/json; charset=utf-8
< Content-Length: 1634
< Keep-Alive: timeout=5, max=100
<
Multi-model mode, b7240:
$ curl -v http://127.0.0.1:8080/completion -d '{"model":"stories260K","prompt":"hello","n_predict":10}'
* Trying 127.0.0.1:8080...
* Connected to 127.0.0.1 (127.0.0.1) port 8080
* using HTTP/1.x
> POST /completion HTTP/1.1
> Host: 127.0.0.1:8080
> User-Agent: curl/8.14.1
> Accept: */*
> Content-Length: 55
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 55 bytes
< HTTP/1.1 200 OK
< Server: llama.cpp
< Server: llama.cpp
< Access-Control-Allow-Origin:
< Access-Control-Allow-Origin:
< Connection: close
< Content-Length: 1630
< Content-Type: application/json; charset=utf-8
< Content-Type: application/json; charset=utf-8
< Transfer-Encoding: chunked
< Keep-Alive: timeout=5, max=100
<
Multi-model mode, #17698:
$ curl -v http://127.0.0.1:8080/completion -d '{"model":"stories260K","prompt":"hello","n_predict":10}'
* Trying 127.0.0.1:8080...
* Connected to 127.0.0.1 (127.0.0.1) port 8080
* using HTTP/1.x
> POST /completion HTTP/1.1
> Host: 127.0.0.1:8080
> User-Agent: curl/8.14.1
> Accept: */*
> Content-Length: 55
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 55 bytes
< HTTP/1.1 200 OK
< Transfer-Encoding: chunked
< Server: llama.cpp
< Access-Control-Allow-Origin:
< Connection: close
< Content-Length: 1636
< Content-Type: application/json; charset=utf-8
< Keep-Alive: timeout=5, max=100
<