cuda : Disable host buffers on integrated GPUs (#15034) #16308

ai-fonsi · 2025-09-28T15:07:10Z

"Integrated" CUDA devices seem to be bugged and produce incorrect output in specific cases. Since disabling the integrated flag seems to neither affect performance nor memory usage on Jetson, I propose disabling the option until the underlying issue is fixed.

Fixes #15034 and probably also #15923.

slaren · 2025-09-28T17:17:32Z

This is probably the same synchronization issue with the scheduler that @ggerganov found when making the Metal backend async. To confirm this, can you verify if it works (without this change) by launching with the env variable CUDA_LAUNCH_BLOCKING=1 (effectively disabling async compute)?

ai-fonsi · 2025-09-28T20:59:01Z

I tried starting llama-server with CUDA_LAUNCH_BLOCKING=1, it didn't fix the issue.

* master: (113 commits) webui: updated the chat service to only include max_tokens in the req… (ggml-org#16489) cpu : optimize the ggml NORM operation (ggml-org#15953) server : host-memory prompt caching (ggml-org#16391) No markdown in cot (ggml-org#16483) model-conversion : add support for SentenceTransformers (ggml-org#16387) ci: add ARM64 Kleidiai build and test support (ggml-org#16462) CANN: Improve ACL graph matching (ggml-org#16166) kleidiai: kernel interface refactoring (ggml-org#16460) [SYCL] refactor soft_max, add soft_max_back (ggml-org#16472) model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules (ggml-org#16367) refactor: centralize CoT parsing in backend for streaming mode (ggml-org#16394) Disable CUDA host buffers on integrated GPUs (ggml-org#16308) server : fix cancel pending task (ggml-org#16467) metal : mark FA blocks (ggml-org#16372) server : improve context checkpoint logic (ggml-org#16440) ggml webgpu: profiling, CI updates, reworking of command submission (ggml-org#16452) llama : support LiquidAI LFM2-MoE hybrid model (ggml-org#16464) server : add `/v1/health` endpoint (ggml-org#16461) webui : added download action (ggml-org#13552) (ggml-org#16282) presets : fix pooling param for embedding models (ggml-org#16455) ...

Disable CUDA host buffers on integrated GPUs

3f6105d

ai-fonsi requested a review from slaren as a code owner September 28, 2025 15:07

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 28, 2025

slaren approved these changes Sep 29, 2025

View reviewed changes

slaren merged commit 9d08828 into ggml-org:master Oct 8, 2025
64 of 67 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda : Disable host buffers on integrated GPUs (#15034) #16308

cuda : Disable host buffers on integrated GPUs (#15034) #16308

ai-fonsi commented Sep 28, 2025 •

edited

Loading

Uh oh!

slaren commented Sep 28, 2025 •

edited

Loading

Uh oh!

ai-fonsi commented Sep 28, 2025

Uh oh!

Uh oh!

Uh oh!

cuda : Disable host buffers on integrated GPUs (#15034) #16308

cuda : Disable host buffers on integrated GPUs (#15034) #16308

Conversation

ai-fonsi commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ai-fonsi commented Sep 28, 2025

Uh oh!

Uh oh!

Uh oh!

ai-fonsi commented Sep 28, 2025 •

edited

Loading

slaren commented Sep 28, 2025 •

edited

Loading