Skip to content

[Bug] Abnormal performance when running the IrO test case with 64 processes #7297

@Searcher97

Description

@Searcher97

Describe the bug

I conducted some verification tests using the provided test cases and have drawn the following conclusions:
1.For the IrO test case, when specifying nx, ny, nz as 64, 64, 64 (default is 54, 54, 40), the test with 64 processes runs normally.
2.Using the PW group's 001_4GaAs test case and specifying nx, ny, nzas 54, 54, 40, the calculation results deviate significantly from the reference values: TOTAL-PRESSURE 66.242796 kbar (-9.861349) and #TOTAL ENERGY#-7837.8815489 eV (-979.589643548583). Performance is also abnormal.
3.I verified this on both the x86 and FT platforms, and the conclusions are consistent.

The x86 platform used version 26 + fftw 3.3.10, and the FT platform used version 21 + fftw 3.3.7. Currently, it is suspected to be related to the MPI process partitioning. Could it be that when nz=40, partitioning across 64 processes for FFT calculations introduces errors in the final result? I noticed that the ABACUS code performs FFT calculations by partitioning based on nzacross different processes. How does it handle situations where some processes are not assigned any data?

Expected behavior

No response

To Reproduce

ABACUS-天河复现和Intel对比.zip
1.Unzip the example mentioned above and enter its directory.
2.Run the ABACUS calculation using 64 processes.
3.When the grid dimensions are 54, 54, 40, a performance anomaly occurs.
4.Performance returns to normal when modifying the grid to 64, 64, 64 or reducing the number of processes to below 40.

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceIssues related to fail running ABACUSQuestionsRaise your quesiton! We will answer it.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions