forked from abacusmodeling/abacus-develop
-
Notifications
You must be signed in to change notification settings - Fork 145
Open
Labels
GPU & DCU & HPCGPU and DCU and HPC related any issuesGPU and DCU and HPC related any issuesLarge SystemsIssues related to large-size systemsIssues related to large-size systemsLong-Time Support (LTS)Issues related to LTS versionIssues related to LTS version
Description
Describe the bug
When doing SCF for MAPbI3 large systems with ~2500 number of atoms by using 8-V100 or 16-V100, ABACUS LTS will segfault after half of SCF steps, while ABACUS develop version have no problem. And this segfault seems to have no relation with OOM.
EL15 -6.75299159e+05 -1.04934151e-08 3.7399e-08 53.21
terminate called after throwing an instance of 'std::length_error'
what(): cannot create std::vector larger than max_size()
[4v100pxn10:2094864] *** Process received signal ***
[4v100pxn10:2094864] Signal: Aborted (6)
[4v100pxn10:2094864] Signal code: (-6)
[4v100pxn10:2094864] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x13067d245330]
[4v100pxn10:2094864] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x13067d29eb2c]
[4v100pxn10:2094864] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x13067d24527e]
[4v100pxn10:2094864] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x13067d2288ff]
[4v100pxn10:2094864] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5)[0x13067d6a5ff5]
[4v100pxn10:2094864] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da)[0x13067d6bb0da]
[4v100pxn10:2094864] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt10unexpectedv+0x0)[0x13067d6a5a55]
[4v100pxn10:2094864] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb391)[0x13067d6bb391]
[4v100pxn10:2094864] [ 8] /lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt20__throw_length_errorPKc+0x44)[0x13067d6a92d2]
[4v100pxn10:2094864] [ 9] abacus(+0x555c91)[0x630b2855ac91]
[4v100pxn10:2094864] [10] abacus(+0x8b29b9)[0x630b288b79b9]
[4v100pxn10:2094864] [11] abacus(+0x8cc244)[0x630b288d1244]
[4v100pxn10:2094864] [12] abacus(+0x6a889f)[0x630b286ad89f]
[4v100pxn10:2094864] [13] abacus(+0x3f82f5)[0x630b283fd2f5]
[4v100pxn10:2094864] [14] abacus(+0x412a94)[0x630b28417a94]
[4v100pxn10:2094864] [15] abacus(+0x4107e1)[0x630b284157e1]
[4v100pxn10:2094864] [16] abacus(+0x411f27)[0x630b28416f27]
[4v100pxn10:2094864] [17] abacus(+0xd08c6)[0x630b280d58c6]
[4v100pxn10:2094864] [18] /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x13067d22a1ca]
[4v100pxn10:2094864] [19] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x13067d22a28b]
[4v100pxn10:2094864] [20] abacus(+0xd0755)[0x630b280d5755]
[4v100pxn10:2094864] *** End of error message ***
terminate called after throwing an instance of 'std::length_error'
what(): cannot create std::vector larger than max_size()
[4v100pxn10:2094863] *** Process received signal ***
[4v100pxn10:2094863] Signal: Aborted (6)
[4v100pxn10:2094863] Signal code: (-6)
terminate called after throwing an instance of 'std::length_error'
what(): cannot create std::vector larger than max_size()
[4v100pxn10:2094863] [ 0] [4v100pxn10:2094866] *** Process received signal ***
/lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x85221045330]
[4v100pxn10:2094863] [ 1] [4v100pxn10:2094866] Signal: Aborted (6)
[4v100pxn10:2094866] Signal code: (-6)
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x8522109eb2c]
[4v100pxn10:2094863] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x8522104527e]
[4v100pxn10:2094863] [ 3] [4v100pxn10:2094866] [ 0] /lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x852210288ff]
[4v100pxn10:2094863] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x132c8c045330]
[4v100pxn10:2094866] [ 1] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5)[0x852214a5ff5]
[4v100pxn10:2094863] [ 5] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x132c8c09eb2c]
[4v100pxn10:2094866] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x132c8c04527e]
[4v100pxn10:2094866] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da)[0x852214bb0da]
[4v100pxn10:2094863] [ 6] /lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x132c8c0288ff]
[4v100pxn10:2094866] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt10unexpectedv+0x0)[0x852214a5a55]
[4v100pxn10:2094863] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb391)[0x852214bb391]
[4v100pxn10:2094863] [ 8] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5)[0x132c8c4a5ff5]
[4v100pxn10:2094866] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt20__throw_length_errorPKc+0x44)[0x852214a92d2]
[4v100pxn10:2094863] [ 9] abacus(+0x555c91)[0x55e269c16c91]
[4v100pxn10:2094863] [10] abacus(+0x8b29b9)[0x55e269f739b9]
[4v100pxn10:2094863] [11] abacus(+0x8cc244)[0x55e269f8d244]
[4v100pxn10:2094863] [12] abacus(+0x6a889f)[0x55e269d6989f]
[4v100pxn10:2094863] [13] abacus(+0x3f82f5)[0x55e269ab92f5]
[4v100pxn10:2094863] [14] abacus(+0x412a94)[0x55e269ad3a94]
[4v100pxn10:2094863] [15] abacus(+0x4107e1)[0x55e269ad17e1]
[4v100pxn10:2094863] [16] abacus(+0x411f27)[0x55e269ad2f27]
[4v100pxn10:2094863] [17] abacus(+0xd08c6)[0x55e2697918c6]
[4v100pxn10:2094863] [18] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da)[0x132c8c4bb0da]
[4v100pxn10:2094866] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x8522102a1ca]
[4v100pxn10:2094863] [19] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x8522102a28b]
[4v100pxn10:2094863] [20] abacus(+0xd0755)[0x55e269791755]
[4v100pxn10:2094863] *** End of error message ***
I wander that which PR in develop version fix this problem ?
Expected behavior
The LTS version should also work well
To Reproduce
Environment
SAI-1 HPC computer
- CPU: AMD 9950X3D
- GPU: V100 (8 or 16)
- each GPU have 8 CPU threads.
Additional Context
No response
Task list for Issue attackers (only for developers)
- Verify the issue is not a duplicate.
- Describe the bug.
- Steps to reproduce.
- Expected behavior.
- Error message.
- Environment details.
- Additional context.
- Assign a priority level (low, medium, high, urgent).
- Assign the issue to a team member.
- Label the issue with relevant tags.
- Identify possible related issues.
- Create a unit test or automated test to reproduce the bug (if applicable).
- Fix the bug.
- Test the fix.
- Update documentation (if necessary).
- Close the issue and inform the reporter (if applicable).
Metadata
Metadata
Assignees
Labels
GPU & DCU & HPCGPU and DCU and HPC related any issuesGPU and DCU and HPC related any issuesLarge SystemsIssues related to large-size systemsIssues related to large-size systemsLong-Time Support (LTS)Issues related to LTS versionIssues related to LTS version