Remove Legacy Copy-OP Pointer Indirection Code #16485

anavp-nvidia · 2025-10-09T12:43:12Z

As discussed in PR #16471, this PR removes the legacy copy-op pointer indirection code. This change allows cudaMemcpyAsync to be used instead of CUDA copy kernel for contiguous F32 tensors, resulting in ~4% performance improvement for Nemotron Nano v2 (NemotronH) model on RTX 5090.

Results:

Weights: bartowski/nvidia_NVIDIA-Nemotron-Nano-9B-v2-GGUF
Quantization: Q4_K_M

Performance before:

  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h 9B Q4_K - Medium    |   6.07 GiB |     8.89 B | CUDA       |  99 |  1 |    tg200 @ d100 |        165.50 ± 0.19 |
| nemotron_h 9B Q4_K - Medium    |   6.07 GiB |     8.89 B | CUDA       |  99 |  1 | pp100+tg200 @ d100 |        174.14 ± 2.02 |

Performance after:

  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h 9B Q4_K - Medium    |   6.07 GiB |     8.89 B | CUDA       |  99 |  1 |    tg200 @ d100 |        172.22 ± 0.16 |
| nemotron_h 9B Q4_K - Medium    |   6.07 GiB |     8.89 B | CUDA       |  99 |  1 | pp100+tg200 @ d100 |        180.91 ± 0.15 |

slaren

Since changing addresses of cpy operations in CUDA graphs is no longer supported, the exception for GGML_OP_CPY in ggml_graph_node_has_matching_properties should also be removed.

The indirections in cpy ops should also be removed, since their only purpose was to allow this, as well as ggml_cuda_cpy_dest_ptrs_copy and ggml_cuda_graph::cpy_dest_ptrs.

ggml/src/ggml-cuda/ggml-cuda.cu

anavp-nvidia · 2025-10-14T08:58:54Z

@CISC, I've applied the function rename in the latest commit as suggested. Could you please take a look and let me know if the changes look good, or if there's anything else you'd recommend updating before merge?

* cuda : remove legacy copy-op pointer indirection code (ggml-org#16485) * remove legacy copy-op pointer indirection code * further removal of copy-op indirection code * renamed check_node_graph_compatibility_and_refresh_copy_ops function * CUDA: add fp kernel for larger batch size MoE (ggml-org#16512) * CUDA: kernel for larger batch sizes for MoE * WIP * WIP * WIP * WIP * WIP * WIP * fixup * tests * Move mmq_ids_helper to mmid * cleanup * Remove redundant checks * CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557) * CUDA: use fastdiv + ggml_cuda_mad for mmvf * use bf16 directly + fix formatting * Add exception for HIP code * CUDA: enable FA for FP32 KV cache (ggml-org#16546) * vulkan: Improve build time for MSVC (ggml-org#16545) Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel. Enable /MP so source files are compiled in parallel. * vulkan: Support FA with K/V in F32 (ggml-org#16543) * CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577) * vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203) Signed-off-by: Stefan Savic <stefan.savic@huawei.com> Co-authored-by: Stefan Savic <stefan.savic@huawei.com> * metal : avoid using Metal's gpuAddress property (ggml-org#16576) * metal : avoid using Metal's gpuAddress property * metal : fix rope kernels buffer check --------- Signed-off-by: Stefan Savic <stefan.savic@huawei.com> Co-authored-by: Anav Prasad <anavp@nvidia.com> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Jeff Bolz <jbolz@nvidia.com> Co-authored-by: SavicStefan <50296686+SavicStefan@users.noreply.github.com> Co-authored-by: Stefan Savic <stefan.savic@huawei.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

remove legacy copy-op pointer indirection code

3bbcf37

anavp-nvidia requested a review from slaren as a code owner October 9, 2025 12:43

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 9, 2025

slaren reviewed Oct 9, 2025

View reviewed changes

further removal of copy-op indirection code

18ddf7d

slaren approved these changes Oct 12, 2025

View reviewed changes

CISC reviewed Oct 12, 2025

View reviewed changes

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

renamed check_node_graph_compatibility_and_refresh_copy_ops function

324a03d

CISC approved these changes Oct 14, 2025

View reviewed changes

CISC merged commit 5b6913c into ggml-org:master Oct 14, 2025
70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove Legacy Copy-OP Pointer Indirection Code #16485

Remove Legacy Copy-OP Pointer Indirection Code #16485

anavp-nvidia commented Oct 9, 2025

Uh oh!

slaren left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

anavp-nvidia commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Remove Legacy Copy-OP Pointer Indirection Code #16485

Remove Legacy Copy-OP Pointer Indirection Code #16485

Conversation

anavp-nvidia commented Oct 9, 2025

Results:

Performance before:

Performance after:

Uh oh!

slaren left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

anavp-nvidia commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

slaren left a comment •

edited

Loading