-
Notifications
You must be signed in to change notification settings - Fork 145
Description
Describe the bug
I try to do the SCF of Fe16 with nspin is 4 by using LCAO, and ABACUS throw error at the beginning of SCF.
Using machine c32_m64_cpu in bohrium (memory is 64G) with 16 cores parallel, the SCF with orbitals 6/7/8 au are successful, while the calculations with 9/10 au are failed.
While, using machine c32_m128_cpu (memory is 128G) with 32 cores parallel, the calculations of all orbitals are failed.
It seems that using large orbital and high parallelism will cause the error.
I have tested the calculation of 9au with c32_m256_cpu machine, and check the memory during SCF by htop. With 16 cores parallel, the memory cost is about 35G at most time with a peak memory of about 50G, while with 32 cores parallel, the memory is increased to about 47G for most time with a peak memory of about 74G, but after the peak memory (memory cost has down to about 47G) abacus will terminate abnormal.
* * * * * *
<< Start SCF iteration.
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 39 RUNNING AT dp-lbg-471-15043057
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
Expected behavior
No response
To Reproduce
No response
Environment
No response
Additional Context
No response
Task list for Issue attackers (only for developers)
- Verify the issue is not a duplicate.
- Describe the bug.
- Steps to reproduce.
- Expected behavior.
- Error message.
- Environment details.
- Additional context.
- Assign a priority level (low, medium, high, urgent).
- Assign the issue to a team member.
- Label the issue with relevant tags.
- Identify possible related issues.
- Create a unit test or automated test to reproduce the bug (if applicable).
- Fix the bug.
- Test the fix.
- Update documentation (if necessary).
- Close the issue and inform the reporter (if applicable).