Skip to content

Conversation

@angt
Copy link
Collaborator

@angt angt commented Nov 26, 2025

  • update the common/download interface to be directly usable by tools/run (removing duplicated code).
  • fix ollama downloads by implementing manual redirect handling (addressing issues with cpp-httplib).

angt added 4 commits November 26, 2025 22:26
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
@github-actions github-actions bot added testing Everything test related examples server labels Nov 26, 2025
@angt
Copy link
Collaborator Author

angt commented Nov 26, 2025

With this change, we can deprecate the cURL dependency and start shipping releases without it.
The unified HTTP stack will become a reality :)

@ericcurtin
Copy link
Collaborator

ericcurtin commented Nov 27, 2025

My reviews are somewhat irrelevant now, I don't have merge rights, took a quick skim didn't read every line, at a glance everything seems reasonably ok, recommend doing a quick test via:

llama-server -dr gemma3

to be sure...

@angt
Copy link
Collaborator Author

angt commented Nov 27, 2025

My reviews are somewhat irrelevant now, I don't have merge rights, took a quick skim didn't read every line, at a glance everything seems reasonably ok, recommend doing a quick test via:

llama-server -dr gemma3

to be sure...

Here some runs:

via docker

$ ./build/bin/llama-server -dr gemma3
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.021 sec
ggml_metal_device_init: GPU name:   Apple M3
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 19069.67 MB
common_docker_resolve_model: Downloading Docker Model: ai/gemma3:latest
common_download_file_single_online: no previous model file found /Users/angt/Library/Caches/llama.cpp/ai_gemma3_latest.gguf
common_download_file_single_online: trying to download model from https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/04/04a43a22e8d2003deda5acc262f68ec1005fa76c735a9962a8c77042a74a7d19/data?expires=1764234374&signature=m2%2BBuw6sCMTEH4cNizZDs6fsLC8%3D&version=2 to /Users/angt/Library/Caches/llama.cpp/ai_gemma3_latest.gguf.downloadInProgress (etag:"562fa5cb63ae8b96836b09a658443c01-25")...
^C==============================>                   ]  63%  (1503 MB / 2374 MB)

(resume works)

via HF:

$ ./build/bin/llama-server -hf unsloth/gpt-oss-120b-GGUF
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_device_init: GPU name:   Apple M3
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 19069.67 MB
common_download_file_single_online: no previous model file found /Users/angt/Library/Caches/llama.cpp/unsloth_gpt-oss-120b-GGUF_gpt-oss-120b-F16.gguf
common_download_file_single_online: trying to download model from https://cas-bridge.xethub.hf.co/xet-bridge-us/68923b51e5822b89fab7a1e7/f7c8b3cdb2bacb3cef00372f4fa3070f250a2d8838cec29653b9b9db8238a583?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20251127%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20251127T081900Z&X-Amz-Expires=3600&X-Amz-Signature=555418210eb5445a4ca2376a0a0b4c15c8b5cef28cc67e4894397e8c41243075&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27gpt-oss-120b-F16.gguf%3B+filename%3D%22gpt-oss-120b-F16.gguf%22%3B&x-id=GetObject&Expires=1764235140&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc2NDIzNTE0MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82ODkyM2I1MWU1ODIyYjg5ZmFiN2ExZTcvZjdjOGIzY2RiMmJhY2IzY2VmMDAzNzJmNGZhMzA3MGYyNTBhMmQ4ODM4Y2VjMjk2NTNiOWI5ZGI4MjM4YTU4MyoifV19&Signature=v02cqmRkLDiiyjkhUGUmvIzg9woZZbAsLWo9RCrqcYg6FB-wZboc4gLBqSvMjHzz8lCEVskrLy3ZrPPV8j%7E%7Ep%7EqqB04dzNIy608mlJUIUXm51Ux9%7EHYqjo9oZGi0ZJgoxNvnH5TU5Nn%7ELyftUrpB53b6BobQyzS66myRjsnEmIkYOyzXbZUBKRfpK9Hu65PUFvfAYEq0rC%7E6x4y8CLHu0eH8oX41UiKykrtZGiRGluYXztE0sBJpWGJERgp2Wv5L1qSX1J2YpgM7CJre0QPsk8xa8ZbP-lgRtvSj%7ELPkR3pwys4NxYillvJdB%7EI1uF-vURScMFi3COObp3OAqn-T6Q__&Key-Pair-Id=K2L8F4GPSG1IFC to /Users/angt/Library/Caches/llama.cpp/unsloth_gpt-oss-120b-GGUF_gpt-oss-120b-F16.gguf.downloadInProgress (etag:"f7c8b3cdb2bacb3cef00372f4fa3070f250a2d8838cec29653b9b9db8238a583")...
^C=========>                                        ]  21%  (13202 MB / 62340 MB)

and for tools/run:

$ ./build/bin/llama-run llama3
common_docker_resolve_model: Downloading Docker Model: library/llama3:latest
common_download_file_single_online: no previous model file found /Users/angt/Library/Caches/llama.cpp/library_llama3_latest.gguf
common_download_file_single_online: 403 on HEAD, assuming GET/Resume is allowed
common_download_file_single_online: trying to download model from https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/6a/6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20251127%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20251127T082042Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=891427f6f6e2b09cafa8869ca3d84993b1c17352b11f342e2b8c033578a77ca0 to /Users/angt/Library/Caches/llama.cpp/library_llama3_latest.gguf.downloadInProgress (etag:)...
^C=>                                                ]   4%  (185 MB / 4445 MB)

the output is excessively verbose, and the progress bar is broken when doing multiple downloads. However, it will be easier to improve the code from now

@CISC
Copy link
Collaborator

CISC commented Nov 27, 2025

Looks like thread-safety test fails across the board.

@angt
Copy link
Collaborator Author

angt commented Nov 27, 2025

Looks like thread-safety test fails across the board.

Yes I’m going to check all the red alerts. 😬

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
@ericcurtin
Copy link
Collaborator

@angt I don't know if you are interested in this but, it would be preferable if llama-run ran a client and a llama-server in parallel, llama-run/llama-server would benefit from more re-usable code

@angt
Copy link
Collaborator Author

angt commented Nov 27, 2025

@angt I don't know if you are interested in this but, it would be preferable if llama-run ran a client and a llama-server in parallel, llama-run/llama-server would benefit from more re-usable code

I tried to preserve llama-run’s current behavior as my main goal was to move toward removing the cURL dependency and fix the tool when cURL is disabled. But I fully agree that rethinking llama-run could be useful :)

@ericcurtin
Copy link
Collaborator

ericcurtin commented Nov 27, 2025

@angt I don't know if you are interested in this but, it would be preferable if llama-run ran a client and a llama-server in parallel, llama-run/llama-server would benefit from more re-usable code

I tried to preserve llama-run’s current behavior as my main goal was to move toward removing the cURL dependency and fix the tool when cURL is disabled. But I fully agree that rethinking llama-run could be useful :)

A CVE in linenoise ended up scratching my itch, but I'd appreciate if you built this branch, gave the new llama-run experience a shot and provided feedback for future PRs:

#17554

@ericcurtin
Copy link
Collaborator

You might want to abandon run.cpp changes here (I don't know if the other parts should stay in this PR). The new PR is a much better experience and CVE free (with the removal of linenoise)

@angt
Copy link
Collaborator Author

angt commented Nov 28, 2025

@ericcurtin, this PR still resolves some issues with cpp-httplib and improves download.cpp. I believe it can be merged before the complete rewrite of llama-run to address current issues more efficiently.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
@ericcurtin
Copy link
Collaborator

ericcurtin commented Nov 28, 2025

I've noticed a pattern where my PRs are often asked to wait for your changes to merge first. While I understand the need to avoid conflicts, constantly deferring my work can be discouraging. Could we try prioritizing based on readiness this time?

#16196 (comment)

I have no further changes to make to my pull request at this time:

#17554

In this case there isn't any significant conflict, the above PR makes no changes to download code.

@angt
Copy link
Collaborator Author

angt commented Nov 28, 2025

I apologize if you felt that way, but I must point out that the PR you mentioned happened before yours. 😅

Anyway, I was only suggesting we merge this one with yours, but more importantly, that they don’t solve the same problem.

@angt angt closed this Nov 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants