-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description


Prerequisites
- I am running the latest code.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug to share.
Expected Behavior
When loading a large PyTorch-based model (e.g., an SDXL checkpoint >10GB) within a host application (ComfyUI), I expect the GPU's dedicated memory (VRAM) usage to increase smoothly and stabilize at the required level for the model. This process should be fast and efficient.
Current Behavior
The VRAM allocation is extremely erratic and inefficient. The memory usage graph shows a severe sawtooth/spiky pattern—it spikes up, drops slightly, spikes again, and repeats this cycle until the model is finally loaded. This process is significantly slower than normal and suggests a fundamental conflict in memory management at the CUDA level.
This behavior only occurs when a component that imports llama-cpp-python
is loaded by the host application at startup. The issue manifests before any llama-cpp-python
functions are actively called; the mere act of importing the library is enough to trigger this conflict with subsequent PyTorch operations.
Environment and Context
-
Hardware:
- GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (94.5 GB VRAM)
-
Operating System:
- Windows 11
-
SDK and Software Versions:
- Python Version:
3.12.11
- Compiler: MSVC / Visual Studio
2022
used to compile for CUDA - CUDA Toolkit: 12.8 (locally compiled)
- NVIDIA Driver Version:
580.97
- Host Application: ComfyUI (Release
2025-08-27
or newer)
- Python Version:
-
Key Python Packages:
- llama-cpp-python:
0.3.16
- torch:
2.8.0+cu128
- diskcache: 5.6.3
- llama-cpp-python:
Failure Information (for bugs)
This report details a deep-seated conflict between llama-cpp-python
and torch
regarding CUDA context and/or VRAM management.
Investigation Summary & Key Findings
We conducted an extensive, multi-day investigation to isolate this issue, and the results were conclusive:
-
The problem is triggered by the
import
ofllama-cpp-python
. Any host application that loads this library at startup will cause subsequent, unrelatedtorch
VRAM allocations to become unstable. Disabling the components that importllama-cpp-python
completely resolves the issue. -
The
diskcache
dependency is a Red Herring. We initially suspected thediskcache
library was the cause. To test this, we performed a definitive experiment: we hid the realdiskcache
library and replaced it with a completely inert, non-functional fake Python package that only contained an emptyCache
class.- Result: The VRAM instability problem persisted even with the fake library.
- Log Confirmation: The application log confirmed our fake library was being used:
--- 警告:正在使用一個偽造的 (FAKE) diskcache 函式庫 --- Set up system temp directory: C:\Users\Win\AppData\Local\Temp\latentsync_16880d62
- Conclusion: This proves that
diskcache
's code is not the active cause. The problem isllama-cpp-python
itself. The initial correlation was due todiskcache
being a mandatory dependency; uninstalling it simply preventedllama-cpp-python
from being imported successfully.
Final Hypothesis
The evidence overwhelmingly suggests a low-level conflict between how llama-cpp-python
(especially a version compiled against a modern CUDA 12.8 toolkit) initializes its CUDA context and how PyTorch expects to manage the VRAM. The initialization of llama-cpp-python
appears to "pollute" or "fragment" the VRAM state, making subsequent large-scale allocations by PyTorch highly inefficient.
Steps to Reproduce
- Set up a ComfyUI environment with PyTorch and the system components listed above.
- Compile and install
llama-cpp-python
with CUDA 12.8 support. - Install any ComfyUI custom node that imports
llama-cpp-python
at startup (e.g.,ComfyUI-JoyCaption
orComfyUI-Searge-LLM-Node
). - Launch ComfyUI. The server will start, and
llama-cpp-python
will be imported into the process. - Load a standard workflow designed to load a large PyTorch-based checkpoint model (e.g., an SDXL checkpoint, >10GB).
- Observe the "Dedicated GPU Memory" graph in Windows Task Manager. You will see the erratic, spiky allocation pattern instead of a smooth ramp-up.
Note on llama.cpp
comparison: This issue is not about the performance or functionality of llama-cpp-python
's inference, but rather a resource conflict it creates with another major library (torch
). Therefore, comparing behavior against llama.cpp
's ./main
executable is not applicable, as the issue requires both runtimes to be active in the same Python process.