When executing K-point parallelism on GPU environment, we may encounter a segmentation fault.
[denghui@LuDh-4090:pw_Si2]$ /usr/bin/mpirun -n 2 /home/denghui/abacus-develop/build/abacus
hwloc/linux: Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
WARNING: Total thread number on this node mismatches with hardware availability. This may cause poor performance.
Info: Local MPI proc number: 2,OpenMP thread number: 2,Total thread number: 4,Local thread limit: 56
ABACUS v3.5.0
Atomic-orbital Based Ab-initio Computation at UStc
Website: http://abacus.ustc.edu.cn/
Documentation: https://abacus.deepmodeling.com/
Repository: https://github.com/abacusmodeling/abacus-develop
https://github.com/deepmodeling/abacus-develop
Commit: e135c71cb (Sat Jan 13 11:32:38 2024 +0000)
Sun Jan 14 11:21:56 2024
MAKE THE DIR : OUT.ABACUS/
RUNNING WITH DEVICE : GPU / NVIDIA GeForce RTX 4090
UNIFORM GRID DIM : 36 * 36 * 36
UNIFORM GRID DIM(BIG) : 36 * 36 * 36
DONE(1.49074 SEC) : SETUP UNITCELL
DONE(1.54515 SEC) : SYMMETRY
DONE(1.7073 SEC) : INIT K-POINTS
---------------------------------------------------------
Self-consistent calculations for electrons
---------------------------------------------------------
SPIN KPOINTS PROCESSORS
1 8 2
---------------------------------------------------------
Use plane wave basis
---------------------------------------------------------
ELEMENT NATOM XC
Si 2
---------------------------------------------------------
Initial plane wave basis and FFT box
---------------------------------------------------------
DONE(1.71664 SEC) : INIT PLANEWAVE
MEMORY FOR PSI (MB) : 1.78162
DONE(1.72678 SEC) : LOCAL POTENTIAL
DONE(1.75068 SEC) : NON-LOCAL POTENTIAL
DONE(1.7729 SEC) : INIT BASIS
-------------------------------------------
SELF-CONSISTENT :
-------------------------------------------
START CHARGE : atomic
DONE(1.80318 SEC) : INIT SCF
ITER ETOT(eV) EDIFF(eV) DRHO TIME(s)
cuBLAS Assert: CUBLAS_STATUS_INVALID_VALUE /home/denghui/abacus-develop/source/module_hsolver/kernels/cuda/math_kernel_op.cu 855
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[53215,1],1]
Exit code: 7
--------------------------------------------------------------------------
Fix this issue.
Describe the bug
When executing K-point parallelism on GPU environment, we may encounter a segmentation fault.
Expected behavior
Fix this issue.
To Reproduce
No response
Environment
No response
Additional Context
No response
Task list for Issue attackers (only for developers)