[Bug Report] Severe VRAM Allocation Instability in PyTorch after llama-cpp-python is Imported

<img width="1137" height="565" alt="Image" src="https://github.com/user-attachments/assets/9e964743-93bb-4adf-beb9-59b26ca4717e" />

<img width="686" height="659" alt="Image" src="https://github.com/user-attachments/assets/bb66af4d-764f-4bc6-9050-2568d2ec5bbe" />



# Prerequisites

- [x] I am running the latest code.
- [x] I carefully followed the [README.md](https://github.com/abetlen/llama-cpp-python/blob/main/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](httpshttps://github.com/abetlen/llama-cpp-python/discussions), and have a new bug to share.

# Expected Behavior

When loading a large PyTorch-based model (e.g., an SDXL checkpoint >10GB) within a host application (ComfyUI), I expect the GPU's dedicated memory (VRAM) usage to increase smoothly and stabilize at the required level for the model. This process should be fast and efficient.

# Current Behavior

The VRAM allocation is extremely erratic and inefficient. The memory usage graph shows a severe sawtooth/spiky pattern—it spikes up, drops slightly, spikes again, and repeats this cycle until the model is finally loaded. This process is significantly slower than normal and suggests a fundamental conflict in memory management at the CUDA level.

This behavior **only** occurs when a component that imports `llama-cpp-python` is loaded by the host application at startup. The issue manifests *before* any `llama-cpp-python` functions are actively called; the mere act of importing the library is enough to trigger this conflict with subsequent PyTorch operations.

# Environment and Context

* **Hardware**:
    * **GPU**: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (94.5 GB VRAM)

* **Operating System**:
    * Windows 11

* **SDK and Software Versions**:
    * **Python Version**: `3.12.11`
    * **Compiler**: MSVC / Visual Studio `2022` used to compile for CUDA
    * **CUDA Toolkit**: 12.8 (locally compiled)
    * **NVIDIA Driver Version**: `580.97`
    * **Host Application**: ComfyUI (Release `2025-08-27` or newer)

* **Key Python Packages**:
    * **llama-cpp-python**: `0.3.16`
    * **torch**: `2.8.0+cu128`
    * **diskcache**: 5.6.3

# Failure Information (for bugs)

This report details a deep-seated conflict between `llama-cpp-python` and `torch` regarding CUDA context and/or VRAM management.

### Investigation Summary & Key Findings

We conducted an extensive, multi-day investigation to isolate this issue, and the results were conclusive:

1.  **The problem is triggered by the `import` of `llama-cpp-python`**. Any host application that loads this library at startup will cause subsequent, unrelated `torch` VRAM allocations to become unstable. Disabling the components that import `llama-cpp-python` completely resolves the issue.

2.  **The `diskcache` dependency is a Red Herring**. We initially suspected the `diskcache` library was the cause. To test this, we performed a definitive experiment: we hid the real `diskcache` library and replaced it with a completely inert, non-functional fake Python package that only contained an empty `Cache` class.
    * **Result**: The VRAM instability problem **persisted** even with the fake library.
    * **Log Confirmation**: The application log confirmed our fake library was being used:
        ```
        --- 警告：正在使用一個偽造的 (FAKE) diskcache 函式庫 ---
        Set up system temp directory: C:\Users\Win\AppData\Local\Temp\latentsync_16880d62
        ```
    * **Conclusion**: This proves that `diskcache`'s code is not the active cause. The problem is `llama-cpp-python` itself. The initial correlation was due to `diskcache` being a mandatory dependency; uninstalling it simply prevented `llama-cpp-python` from being imported successfully.

### Final Hypothesis

The evidence overwhelmingly suggests a low-level conflict between how `llama-cpp-python` (especially a version compiled against a modern CUDA 12.8 toolkit) initializes its CUDA context and how PyTorch expects to manage the VRAM. The initialization of `llama-cpp-python` appears to "pollute" or "fragment" the VRAM state, making subsequent large-scale allocations by PyTorch highly inefficient.

# Steps to Reproduce

1.  Set up a ComfyUI environment with PyTorch and the system components listed above.
2.  Compile and install `llama-cpp-python` with CUDA 12.8 support.
3.  Install any ComfyUI custom node that imports `llama-cpp-python` at startup (e.g., `ComfyUI-JoyCaption` or `ComfyUI-Searge-LLM-Node`).
4.  Launch ComfyUI. The server will start, and `llama-cpp-python` will be imported into the process.
5.  Load a standard workflow designed to load a large PyTorch-based checkpoint model (e.g., an SDXL checkpoint, >10GB).
6.  Observe the "Dedicated GPU Memory" graph in Windows Task Manager. You will see the erratic, spiky allocation pattern instead of a smooth ramp-up.

---
**Note on `llama.cpp` comparison:** This issue is not about the performance or functionality of `llama-cpp-python`'s inference, but rather a resource conflict it creates with another major library (`torch`). Therefore, comparing behavior against `llama.cpp`'s `./main` executable is not applicable, as the issue requires both runtimes to be active in the same Python process.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug Report] Severe VRAM Allocation Instability in PyTorch after llama-cpp-python is Imported #2060

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Investigation Summary & Key Findings

Final Hypothesis

Steps to Reproduce

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug Report] Severe VRAM Allocation Instability in PyTorch after llama-cpp-python is Imported #2060

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Investigation Summary & Key Findings

Final Hypothesis

Steps to Reproduce

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions