[Fix] Eliminate unnecessary vkb CPU allocation on GPU path#7296
Merged
Conversation
…ll_vnl On GPU path, vkb.create(nkb, npwx) allocates CPU ComplexMatrix memory that is never used — getvnl() writes directly to GPU buffers (c_vkb/z_vkb). The only consumer of vkb.nc metadata is the leading dimension in gemm/gemv. This wastes nkb*npwx*16 bytes of CPU memory (~3.2 GB for large systems). Changes: - Add vkbnc member to store column dimension independently - Guard vkb.create() behind !use_gpu_ in init() - Replace all ppcell->vkb.nc with ppcell->vkbnc (op_pw_nl.cpp, hamilt_pw.cpp) - Add lazy-allocation guard in getgradq_vnl() for GPU Velocity path Tested: GPU build + 28/28 kernel UTs + 38/40 GPU integration tests (2 pre-existing failures: scf_bpcg, scf_out_wf)
There was a problem hiding this comment.
Pull request overview
This PR reduces CPU memory usage in the plane-wave nonlocal pseudopotential (VNL) GPU execution path by avoiding allocation of the large CPU-side vkb ComplexMatrix when it isn’t populated, while preserving the needed leading-dimension metadata for GEMM/GEMV.
Changes:
- Add
vkbncto storevkb’s intended column dimension (npwk_max) even whenvkbis not allocated. - Skip
vkb.create(nkb, npwx)inpseudopot_cell_vnl::init()when running on GPU, and usevkbncwherevkb.ncwas previously used as GEMM/GEMV leading dimension. - Add lazy CPU allocation of
vkbingetgradq_vnl()for the GPU Velocity path.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| source/source_pw/module_pwdft/vnl_pw.h | Introduces vkbnc to retain vkb column dimension without allocating CPU vkb on GPU runs. |
| source/source_pw/module_pwdft/vnl_pw.cpp | Sets vkbnc and guards CPU vkb allocation behind !use_gpu_. |
| source/source_pw/module_pwdft/op_pw_nl.cpp | Switches GEMM/GEMV leading-dimension argument from vkb.nc to vkbnc. |
| source/source_pw/module_pwdft/hamilt_pw.cpp | Switches GEMM/GEMV leading-dimension argument from vkb.nc to vkbnc. |
| source/source_pw/module_pwdft/vnl_pw_grad.cpp | Lazily allocates CPU vkb when needed for gradient/Velocity workflows on GPU path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
vkb.create(nkb, npwx)inpseudopot_cell_vnl::init()always allocates a CPU-side ComplexMatrix, even on GPU path where it's never populated. GPU compute (getvnl()) writes directly toc_vkb/z_vkb(GPU buffers). The only useful artifact was.nc(column dimension =npwx) used as leading dimension in gemm/gemv.nkb × npwx × 16bytes of CPU memory (~3.2 GB for large systems).vkb.create()on GPU path, store dimension invkbncmember, add lazy-allocation guard for GPU Velocity path.Changes
vnl_pw.hint vkbnc = 0public membervnl_pw.cppvkb.create()behind!use_gpu_, setvkbnc = npwxop_pw_nl.cppvkb.nc→vkbnchamilt_pw.cppvkb.nc→vkbncvnl_pw_grad.cppvkbingetgradq_vnl()for GPU Velocity pathTest Plan
buniverse.sh --cuda --test): successcal_vnl_op_gpu,cal_vkb1_nl_op_gpu)scf_bpcg,scf_out_wf— identical on clean develop)Memory Savings
For typical large systems (nkb≈2000, npwx≈100000): ~3.2 GB CPU memory saved on GPU path.