Skip to content

ABACUS LTS 3.10.0 Numerical Instability (NaN Error) with AOCC/AOCL Toolchain, while GCC/OpenBLAS is Stable. #6420

@swanwu1996

Description

@swanwu1996

Describe the bug

I am experiencing a numerical stability issue when running ABACUS compiled with the AMD AOCC/AOCL toolchain. For a specific input file, the calculation fails during the SCF cycle with NaN (Not a Number) values appearing in the charge density, leading to a Charge_Mixing factorization error.

Crucially, the exact same input file and MPI configuration run successfully when using an ABACUS executable compiled with the standard GCC and OpenBLAS toolchain. This suggests the issue is related to subtle differences in floating-point behavior or optimization between the two compiler/library ecosystems.

The error message from the AOCC/AOCL build is:

...
charge before normalized = -nan
charge after normalized = -nan
...
Charge_Mixing warning : Error when factorizing beta.

Image

=========Environment Details===========

ABACUS Version: [3.10.0-LTS]
CPU Architecture: AMD EPYC (Zen architecture)

Toolchain 1 (Fails):
C/C++ Compiler: AOCC 5.0.0 (clang)
Fortran Compiler: gfortran 11.5.0
MPI: OpenMPI 5.0.3 (compiled with AOCC)
Math Libraries: AOCL 5.0.0 (libblis, libflame, libscalapack)
ELPA: 2025.01.001 (compiled with AOCC/gfortran + AOCL)

Toolchain 2 (Succeeds):
Compiler: GCC 13.2.0
MPI: OpenMPI (compiled with GCC)
Math Libraries: OpenBLAS scalapack libxc fftw3 elpa (all building from source , compiled with GCC)

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions