Skip to content

LTS has problem in SCF of large system via multiple V100 #6525

@QuantumMisaka

Description

@QuantumMisaka

Describe the bug

When doing SCF for MAPbI3 large systems with ~2500 number of atoms by using 8-V100 or 16-V100, ABACUS LTS will segfault after half of SCF steps, while ABACUS develop version have no problem. And this segfault seems to have no relation with OOM.

 EL15    -6.75299159e+05  -1.04934151e-08   3.7399e-08  53.21
terminate called after throwing an instance of 'std::length_error'
  what():  cannot create std::vector larger than max_size()
[4v100pxn10:2094864] *** Process received signal ***
[4v100pxn10:2094864] Signal: Aborted (6)
[4v100pxn10:2094864] Signal code:  (-6)
[4v100pxn10:2094864] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x13067d245330]
[4v100pxn10:2094864] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x13067d29eb2c]
[4v100pxn10:2094864] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x13067d24527e]
[4v100pxn10:2094864] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x13067d2288ff]
[4v100pxn10:2094864] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5)[0x13067d6a5ff5]
[4v100pxn10:2094864] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da)[0x13067d6bb0da]
[4v100pxn10:2094864] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt10unexpectedv+0x0)[0x13067d6a5a55]
[4v100pxn10:2094864] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb391)[0x13067d6bb391]
[4v100pxn10:2094864] [ 8] /lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt20__throw_length_errorPKc+0x44)[0x13067d6a92d2]
[4v100pxn10:2094864] [ 9] abacus(+0x555c91)[0x630b2855ac91]
[4v100pxn10:2094864] [10] abacus(+0x8b29b9)[0x630b288b79b9]
[4v100pxn10:2094864] [11] abacus(+0x8cc244)[0x630b288d1244]
[4v100pxn10:2094864] [12] abacus(+0x6a889f)[0x630b286ad89f]
[4v100pxn10:2094864] [13] abacus(+0x3f82f5)[0x630b283fd2f5]
[4v100pxn10:2094864] [14] abacus(+0x412a94)[0x630b28417a94]
[4v100pxn10:2094864] [15] abacus(+0x4107e1)[0x630b284157e1]
[4v100pxn10:2094864] [16] abacus(+0x411f27)[0x630b28416f27]
[4v100pxn10:2094864] [17] abacus(+0xd08c6)[0x630b280d58c6]
[4v100pxn10:2094864] [18] /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x13067d22a1ca]
[4v100pxn10:2094864] [19] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x13067d22a28b]
[4v100pxn10:2094864] [20] abacus(+0xd0755)[0x630b280d5755]
[4v100pxn10:2094864] *** End of error message ***
terminate called after throwing an instance of 'std::length_error'
  what():  cannot create std::vector larger than max_size()
[4v100pxn10:2094863] *** Process received signal ***
[4v100pxn10:2094863] Signal: Aborted (6)
[4v100pxn10:2094863] Signal code:  (-6)
terminate called after throwing an instance of 'std::length_error'
  what():  cannot create std::vector larger than max_size()
[4v100pxn10:2094863] [ 0] [4v100pxn10:2094866] *** Process received signal ***
/lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x85221045330]
[4v100pxn10:2094863] [ 1] [4v100pxn10:2094866] Signal: Aborted (6)
[4v100pxn10:2094866] Signal code:  (-6)
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x8522109eb2c]
[4v100pxn10:2094863] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x8522104527e]
[4v100pxn10:2094863] [ 3] [4v100pxn10:2094866] [ 0] /lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x852210288ff]
[4v100pxn10:2094863] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x132c8c045330]
[4v100pxn10:2094866] [ 1] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5)[0x852214a5ff5]
[4v100pxn10:2094863] [ 5] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x132c8c09eb2c]
[4v100pxn10:2094866] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x132c8c04527e]
[4v100pxn10:2094866] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da)[0x852214bb0da]
[4v100pxn10:2094863] [ 6] /lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x132c8c0288ff]
[4v100pxn10:2094866] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt10unexpectedv+0x0)[0x852214a5a55]
[4v100pxn10:2094863] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb391)[0x852214bb391]
[4v100pxn10:2094863] [ 8] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5)[0x132c8c4a5ff5]
[4v100pxn10:2094866] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt20__throw_length_errorPKc+0x44)[0x852214a92d2]
[4v100pxn10:2094863] [ 9] abacus(+0x555c91)[0x55e269c16c91]
[4v100pxn10:2094863] [10] abacus(+0x8b29b9)[0x55e269f739b9]
[4v100pxn10:2094863] [11] abacus(+0x8cc244)[0x55e269f8d244]
[4v100pxn10:2094863] [12] abacus(+0x6a889f)[0x55e269d6989f]
[4v100pxn10:2094863] [13] abacus(+0x3f82f5)[0x55e269ab92f5]
[4v100pxn10:2094863] [14] abacus(+0x412a94)[0x55e269ad3a94]
[4v100pxn10:2094863] [15] abacus(+0x4107e1)[0x55e269ad17e1]
[4v100pxn10:2094863] [16] abacus(+0x411f27)[0x55e269ad2f27]
[4v100pxn10:2094863] [17] abacus(+0xd08c6)[0x55e2697918c6]
[4v100pxn10:2094863] [18] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da)[0x132c8c4bb0da]
[4v100pxn10:2094866] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x8522102a1ca]
[4v100pxn10:2094863] [19] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x8522102a28b]
[4v100pxn10:2094863] [20] abacus(+0xd0755)[0x55e269791755]
[4v100pxn10:2094863] *** End of error message ***

I wander that which PR in develop version fix this problem ?

Expected behavior

The LTS version should also work well

To Reproduce

MAPbI3_8gpu_test.tar.gz

Environment

SAI-1 HPC computer

  • CPU: AMD 9950X3D
  • GPU: V100 (8 or 16)
  • each GPU have 8 CPU threads.

Additional Context

No response

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).

Metadata

Metadata

Assignees

No one assigned

    Labels

    GPU & DCU & HPCGPU and DCU and HPC related any issuesLarge SystemsIssues related to large-size systemsLong-Time Support (LTS)Issues related to LTS version

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions