Skip to content

[Bug Report] Severe VRAM Allocation Instability in PyTorch after llama-cpp-python is Imported #2060

@rookiestar28

Description

@rookiestar28
Image Image

Prerequisites

Expected Behavior

When loading a large PyTorch-based model (e.g., an SDXL checkpoint >10GB) within a host application (ComfyUI), I expect the GPU's dedicated memory (VRAM) usage to increase smoothly and stabilize at the required level for the model. This process should be fast and efficient.

Current Behavior

The VRAM allocation is extremely erratic and inefficient. The memory usage graph shows a severe sawtooth/spiky pattern—it spikes up, drops slightly, spikes again, and repeats this cycle until the model is finally loaded. This process is significantly slower than normal and suggests a fundamental conflict in memory management at the CUDA level.

This behavior only occurs when a component that imports llama-cpp-python is loaded by the host application at startup. The issue manifests before any llama-cpp-python functions are actively called; the mere act of importing the library is enough to trigger this conflict with subsequent PyTorch operations.

Environment and Context

  • Hardware:

    • GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (94.5 GB VRAM)
  • Operating System:

    • Windows 11
  • SDK and Software Versions:

    • Python Version: 3.12.11
    • Compiler: MSVC / Visual Studio 2022 used to compile for CUDA
    • CUDA Toolkit: 12.8 (locally compiled)
    • NVIDIA Driver Version: 580.97
    • Host Application: ComfyUI (Release 2025-08-27 or newer)
  • Key Python Packages:

    • llama-cpp-python: 0.3.16
    • torch: 2.8.0+cu128
    • diskcache: 5.6.3

Failure Information (for bugs)

This report details a deep-seated conflict between llama-cpp-python and torch regarding CUDA context and/or VRAM management.

Investigation Summary & Key Findings

We conducted an extensive, multi-day investigation to isolate this issue, and the results were conclusive:

  1. The problem is triggered by the import of llama-cpp-python. Any host application that loads this library at startup will cause subsequent, unrelated torch VRAM allocations to become unstable. Disabling the components that import llama-cpp-python completely resolves the issue.

  2. The diskcache dependency is a Red Herring. We initially suspected the diskcache library was the cause. To test this, we performed a definitive experiment: we hid the real diskcache library and replaced it with a completely inert, non-functional fake Python package that only contained an empty Cache class.

    • Result: The VRAM instability problem persisted even with the fake library.
    • Log Confirmation: The application log confirmed our fake library was being used:
      --- 警告:正在使用一個偽造的 (FAKE) diskcache 函式庫 ---
      Set up system temp directory: C:\Users\Win\AppData\Local\Temp\latentsync_16880d62
      
    • Conclusion: This proves that diskcache's code is not the active cause. The problem is llama-cpp-python itself. The initial correlation was due to diskcache being a mandatory dependency; uninstalling it simply prevented llama-cpp-python from being imported successfully.

Final Hypothesis

The evidence overwhelmingly suggests a low-level conflict between how llama-cpp-python (especially a version compiled against a modern CUDA 12.8 toolkit) initializes its CUDA context and how PyTorch expects to manage the VRAM. The initialization of llama-cpp-python appears to "pollute" or "fragment" the VRAM state, making subsequent large-scale allocations by PyTorch highly inefficient.

Steps to Reproduce

  1. Set up a ComfyUI environment with PyTorch and the system components listed above.
  2. Compile and install llama-cpp-python with CUDA 12.8 support.
  3. Install any ComfyUI custom node that imports llama-cpp-python at startup (e.g., ComfyUI-JoyCaption or ComfyUI-Searge-LLM-Node).
  4. Launch ComfyUI. The server will start, and llama-cpp-python will be imported into the process.
  5. Load a standard workflow designed to load a large PyTorch-based checkpoint model (e.g., an SDXL checkpoint, >10GB).
  6. Observe the "Dedicated GPU Memory" graph in Windows Task Manager. You will see the erratic, spiky allocation pattern instead of a smooth ramp-up.

Note on llama.cpp comparison: This issue is not about the performance or functionality of llama-cpp-python's inference, but rather a resource conflict it creates with another major library (torch). Therefore, comparing behavior against llama.cpp's ./main executable is not applicable, as the issue requires both runtimes to be active in the same Python process.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions