Skip to content

[Fix] Eliminate unnecessary vkb CPU allocation on GPU path#7296

Merged
mohanchen merged 5 commits into
deepmodeling:developfrom
Cstandardlib:fix/vkb-gpu-memory
Apr 29, 2026
Merged

[Fix] Eliminate unnecessary vkb CPU allocation on GPU path#7296
mohanchen merged 5 commits into
deepmodeling:developfrom
Cstandardlib:fix/vkb-gpu-memory

Conversation

@Cstandardlib
Copy link
Copy Markdown
Collaborator

Summary

  • Problem: vkb.create(nkb, npwx) in pseudopot_cell_vnl::init() always allocates a CPU-side ComplexMatrix, even on GPU path where it's never populated. GPU compute (getvnl()) writes directly to c_vkb/z_vkb (GPU buffers). The only useful artifact was .nc (column dimension = npwx) used as leading dimension in gemm/gemv.
  • Impact: Wastes nkb × npwx × 16 bytes of CPU memory (~3.2 GB for large systems).
  • Fix: Skip vkb.create() on GPU path, store dimension in vkbnc member, add lazy-allocation guard for GPU Velocity path.

Changes

File Change
vnl_pw.h Add int vkbnc = 0 public member
vnl_pw.cpp Guard vkb.create() behind !use_gpu_, set vkbnc = npwx
op_pw_nl.cpp vkb.ncvkbnc
hamilt_pw.cpp vkb.ncvkbnc
vnl_pw_grad.cpp Lazy-allocate vkb in getgradq_vnl() for GPU Velocity path

Test Plan

  • ✅ GPU build (buniverse.sh --cuda --test): success
  • ✅ Kernel unit tests: 28/28 passed (incl. cal_vnl_op_gpu, cal_vkb1_nl_op_gpu)
  • ✅ GPU integration tests: 38/40 passed (2 pre-existing failures: scf_bpcg, scf_out_wf — identical on clean develop)

Memory Savings

For typical large systems (nkb≈2000, npwx≈100000): ~3.2 GB CPU memory saved on GPU path.

…ll_vnl

On GPU path, vkb.create(nkb, npwx) allocates CPU ComplexMatrix memory
that is never used — getvnl() writes directly to GPU buffers (c_vkb/z_vkb).
The only consumer of vkb.nc metadata is the leading dimension in gemm/gemv.

This wastes nkb*npwx*16 bytes of CPU memory (~3.2 GB for large systems).

Changes:
- Add vkbnc member to store column dimension independently
- Guard vkb.create() behind !use_gpu_ in init()
- Replace all ppcell->vkb.nc with ppcell->vkbnc (op_pw_nl.cpp, hamilt_pw.cpp)
- Add lazy-allocation guard in getgradq_vnl() for GPU Velocity path

Tested: GPU build + 28/28 kernel UTs + 38/40 GPU integration tests
(2 pre-existing failures: scf_bpcg, scf_out_wf)
Copilot AI review requested due to automatic review settings April 28, 2026 15:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces CPU memory usage in the plane-wave nonlocal pseudopotential (VNL) GPU execution path by avoiding allocation of the large CPU-side vkb ComplexMatrix when it isn’t populated, while preserving the needed leading-dimension metadata for GEMM/GEMV.

Changes:

  • Add vkbnc to store vkb’s intended column dimension (npwk_max) even when vkb is not allocated.
  • Skip vkb.create(nkb, npwx) in pseudopot_cell_vnl::init() when running on GPU, and use vkbnc where vkb.nc was previously used as GEMM/GEMV leading dimension.
  • Add lazy CPU allocation of vkb in getgradq_vnl() for the GPU Velocity path.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
source/source_pw/module_pwdft/vnl_pw.h Introduces vkbnc to retain vkb column dimension without allocating CPU vkb on GPU runs.
source/source_pw/module_pwdft/vnl_pw.cpp Sets vkbnc and guards CPU vkb allocation behind !use_gpu_.
source/source_pw/module_pwdft/op_pw_nl.cpp Switches GEMM/GEMV leading-dimension argument from vkb.nc to vkbnc.
source/source_pw/module_pwdft/hamilt_pw.cpp Switches GEMM/GEMV leading-dimension argument from vkb.nc to vkbnc.
source/source_pw/module_pwdft/vnl_pw_grad.cpp Lazily allocates CPU vkb when needed for gradient/Velocity workflows on GPU path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/source_pw/module_pwdft/vnl_pw_grad.cpp
Comment thread source/source_pw/module_pwdft/vnl_pw_grad.cpp
Cstandardlib and others added 4 commits April 29, 2026 01:06
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@mohanchen mohanchen added Memory Memory issues Refactor Refactor ABACUS codes labels Apr 29, 2026
Copy link
Copy Markdown
Collaborator

@mohanchen mohanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mohanchen mohanchen merged commit 4b847b6 into deepmodeling:develop Apr 29, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Memory Memory issues Refactor Refactor ABACUS codes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants