Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abnormal stopped of DCU jobs (Device & Memory) #4026

Closed
16 tasks
pxlxingliang opened this issue Apr 19, 2024 · 3 comments · Fixed by #4047
Closed
16 tasks

abnormal stopped of DCU jobs (Device & Memory) #4026

pxlxingliang opened this issue Apr 19, 2024 · 3 comments · Fixed by #4047
Assignees
Labels
GPU & DCU & HPC GPU and DCU and HPC related any issues

Comments

@pxlxingliang
Copy link
Collaborator

Describe the bug

Some jobs on DCU are stopped abnormal.

  1. Stopped before SCF
    beforescf.zip

The last line of screen output is:

 START CHARGE      : atomic
 DONE(12.4994    SEC) : INIT SCF
 ITER   ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)

The last lines of running_scf.log

 -------------------------------------------
 SELF-CONSISTENT
 -------------------------------------------
                                 init_chg = atomic
 DONE : INIT SCF Time : 12.4993 (SEC)

  1. Stoppend when calculating stress
    stress.zip

  2. Stoppend at beginning
    start.zip

 Init Non-Local PseudoPotential table :
 Init Non-Local-Pseudopotential done.
 DONE : NON-LOCAL POTENTIAL Time : 10.011598924 (SEC)


 Make real space PAO into reciprocal space.
       max mesh points in Pseudopotential = 1001
     dq(describe PAO in reciprocal space) = 0.01
                                    max q = 1204

 number of pseudo atomic orbitals for Sr is 0

 number of pseudo atomic orbitals for Al is 2

 Warning_Memory_Consuming allocated:  PW_B_K::ig2ixyz 8.63247299194 MB

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).
@pxlxingliang pxlxingliang added the Bugs (Exclude input and output) Bugs that only solvable with sufficient knowledge of DFT label Apr 19, 2024
@WHUweiqingzhou
Copy link
Collaborator

@denghuilu, could you have a look?

@denghuilu
Copy link
Member

I have reviewed each STDOUTER.log file and found that the abnormal stops were caused by an Out of Memory error.

COMMAND: echo ks_solver cg >> INPUT; bash run.sh -o 1 -n 4 -d 1 -s 0WARNING: Total thread number on this node mismatches with hardware availability. This may cause poor performance.
Info: Local MPI proc number: 4,OpenMP thread number: 1,Total thread number: 4,Local thread limit: 32
 Unexpected Device Error /public/home/abacus/abacus-dcu/source/module_psi/kernels/rocm/memory_op.hip.cu:48: hipErrorOutOfMemory, out of memory
 Unexpected Device Error /public/home/abacus/abacus-dcu/source/module_psi/kernels/rocm/memory_op.hip.cu:48: hipErrorOutOfMemory, out of memory
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing

@WHUweiqingzhou
Copy link
Collaborator

@dyzheng we need to check the usage of memory in these cases.

@mohanchen mohanchen added GPU & DCU & HPC GPU and DCU and HPC related any issues and removed Bugs (Exclude input and output) Bugs that only solvable with sufficient knowledge of DFT labels May 5, 2024
@WHUweiqingzhou WHUweiqingzhou changed the title abnormal stopped of DCU jobs abnormal stopped of DCU jobs (Device & Memory) May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GPU & DCU & HPC GPU and DCU and HPC related any issues
Projects
None yet
5 participants