Skip to content

Bug: nspin=4 not working with device=gpu #5306

@AsTonyshment

Description

@AsTonyshment

Describe the bug

When setting device=gpu, the nspin=4 calculations result in an error (see log below).

 << Start SCF iteration.
[Workstation:842863] *** Process received signal ***
[Workstation:842863] Signal: Segmentation fault (11)
[Workstation:842863] Signal code: Address not mapped (1)
[Workstation:842863] Failing at address: 0x8
[Workstation:842863] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0xebbad842520]
[Workstation:842863] [ 1] /home/abacus-develop/build/abacus(+0x75d2a5)[0x60d5e747f2a5]
[Workstation:842863] [ 2] /home/abacus-develop/build/abacus(+0x75724d)[0x60d5e747924d]
[Workstation:842863] [ 3] /home/abacus-develop/build/abacus(+0x73270b)[0x60d5e745470b]
[Workstation:842863] [ 4] /home/abacus-develop/build/abacus(+0x683b58)[0x60d5e73a5b58]
[Workstation:842863] [ 5] /home/abacus-develop/build/abacus(+0x682811)[0x60d5e73a4811]
[Workstation:842863] [ 6] /home/abacus-develop/build/abacus(+0x67b1a5)[0x60d5e739d1a5]
[Workstation:842863] [ 7] /home/abacus-develop/build/abacus(+0x3e54b9)[0x60d5e71074b9]
[Workstation:842863] [ 8] /home/abacus-develop/build/abacus(+0x5b37b5)[0x60d5e72d57b5]
[Workstation:842863] [ 9] /home/abacus-develop/build/abacus(+0x56994d)[0x60d5e728b94d]
[Workstation:842863] [10] /home/abacus-develop/build/abacus(+0x34ef5c)[0x60d5e7070f5c]
[Workstation:842863] [11] /home/abacus-develop/build/abacus(+0x36416b)[0x60d5e708616b]
[Workstation:842863] [12] /home/abacus-develop/build/abacus(+0x3621c1)[0x60d5e70841c1]
[Workstation:842863] [13] /home/abacus-develop/build/abacus(+0x3638b7)[0x60d5e70858b7]
[Workstation:842863] [14] /home/abacus-develop/build/abacus(+0x99b64)[0x60d5e6dbbb64]
[Workstation:842863] [15] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0xebbad829d90]
[Workstation:842863] [16] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0xebbad829e40]
[Workstation:842863] [17] /home/abacus-develop/build/abacus(+0x99a05)[0x60d5e6dbba05]
[Workstation:842863] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node Workstation exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Upon further testing, it appears that the issue is not related to the ks_solver itself, as device=cpu with ks_solver=cusolver works correctly. The problem seems to stem from <GPU grid integration> rather than the solver.

Expected behavior

Setting device=cpu works well.

To Reproduce

  1. Set device=gpu and nspin=4 for any SCF calculation.
  2. Run the calculation.

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).

Metadata

Metadata

Assignees

Labels

BugsBugs that only solvable with sufficient knowledge of DFTGPU & DCU & HPCGPU and DCU and HPC related any issuescollinear/non-collinear/SOCIssues related to SOC

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions