Skip to content

ZC API crashes while running some tests/examples with gni-crayxe builds on bluewaters #2589

@nitbhat

Description

@nitbhat

ZC API (esp, p2p EM API and p2p EM Post API) tests occasionally crash on bluewaters with gni-crayxe builds.

Specifically, the following tests are crashing:

  1. tests/charm++/zerocopy/zerocopy_with_qd:
../../../../bin/testrun  +p4 ./zerocopy_with_qd 100

Running on 4 processors:  ./zerocopy_with_qd 100
aprun -n 4 -d 2 ./zerocopy_with_qd 100
Charm++> Running on Gemini (GNI) with 4 processes
Charm++> static SMSG
Charm++> SMSG memory: 19.8KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> Cray TLB page size: 8192K
Charm++> Running in non-SMP mode: 4 processes (PEs)
Converse/Charm++ Commit ID: v6.10.0-rc2-29-gd0f064260
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (2 sockets x 16 cores x 1 PUs = 32-way SMP)
[0][0][0] Test 1: QD has been reached for RO Variable Bcast
[0][0][0] Test 2: QD has been reached for Direct API
_pmiu_daemon(SIGCHLD): [NID 01160] [c2-1c2s4n2] [Wed Oct 30 15:41:54 2019] PE RANK 0 exit signal Killed
[NID 01160] 2019-10-30 15:41:54 Apid 81740147: initiated application termination
[NID 01160] 2019-10-30 15:41:54 Apid 81740147: Cray HSN detected critical error 0x40c[ptag 32]. Please contact admin for details. Killing pid 31266(zerocopy_with_q)
Application 81740147 exit codes: 137
Application 81740147 resources: utime ~2s, stime ~1s, Rss ~24300, inblocks ~14699, outblocks ~339091
  1. examples/charm++/zerocopy/entry_method_post_api/unreg/simpleZeroCopy
../../../../../../bin/testrun  +p4 ./simpleZeroCopy 32 +balancer RotateLB

Running on 4 processors:  ./simpleZeroCopy 32 +balancer RotateLB
aprun -n 4 -d 2 ./simpleZeroCopy 32 +balancer RotateLB
Charm++> Running on Gemini (GNI) with 4 processes
Charm++> static SMSG
Charm++> SMSG memory: 19.8KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> Cray TLB page size: 8192K
Charm++> Running in non-SMP mode: 4 processes (PEs)
Converse/Charm++ Commit ID: v6.10.0-rc2-26-g92bc4acf9
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (2 sockets x 16 cores x 1 PUs = 32-way SMP)
CharmLB> RotateLB created.
send: completed
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: CmiFree reference count was zero-- is this a duplicate free?
[0] Stack Traceback:
  [0:0] simpleZeroCopy 0x203e8f1c CmiAbortHelper(char const*, char const*, char const*, int, int)
  [0:1] simpleZeroCopy 0x203e9001 CmiGetNonLocal
  [0:2] simpleZeroCopy 0x203f4c61 CmiFree
  [0:3] simpleZeroCopy 0x203ecff8
  [0:4] simpleZeroCopy 0x203ed130
  [0:5] simpleZeroCopy 0x203ed282 LrtsAdvanceCommunication(int)
------------- Processor 1 Exiting: Called CmiAbort ------------
Reason: CmiFree reference count was zero-- is this a duplicate free?
[1] Stack Traceback:
  [0:6] simpleZeroCopy 0x203e8d59
  [0:7] simpleZeroCopy 0x203e913a
  [1:0] simpleZeroCopy 0x203e8f1c CmiAbortHelper(char const*, char const*, char const*, int, int)
  [0:8] simpleZeroCopy 0x203f6591
  [1:1] simpleZeroCopy 0x203e9001 CmiGetNonLocal
  [0:9] simpleZeroCopy 0x203f72db CcdRaiseCondition
  [1:2] simpleZeroCopy 0x203f4c61 CmiFree
  [0:10] simpleZeroCopy 0x203f23b9 CsdStillIdle
  [1:3] simpleZeroCopy 0x203ecff8
  [0:11] simpleZeroCopy 0x203f2736 CsdScheduleForever
  [1:4] simpleZeroCopy 0x203ed130
  [0:12] simpleZeroCopy 0x203f2641 CsdScheduler
  [1:5] simpleZeroCopy 0x203ed282 LrtsAdvanceCommunication(int)
  [0:13] simpleZeroCopy 0x203e8d1c
  [1:6] simpleZeroCopy 0x203e8d59
  [0:14] simpleZeroCopy 0x203e8c23 ConverseInit
  [1:7] simpleZeroCopy 0x203e9076 CmiGetNonLocal
  [0:15] simpleZeroCopy 0x202a42e7 charm_main
  [0:16] simpleZeroCopy 0x2029cda2 main
  [1:8] simpleZeroCopy 0x203f258b CsdNextMessage
  [0:17] libc.so.6 0x2aaaac80ac36 __libc_start_main
aborting job:
CmiFree reference count was zero-- is this a duplicate free?
  [0:18] simpleZeroCopy 0x20234fc9
  [1:9] simpleZeroCopy 0x203f26e7 CsdScheduleForever
  [1:10] simpleZeroCopy 0x203f2641 CsdScheduler
  [1:11] simpleZeroCopy 0x203e8d1c
  [1:12] simpleZeroCopy 0x203e8c23 ConverseInit
  [1:13] simpleZeroCopy 0x202a42e7 charm_main
  [1:14] simpleZeroCopy 0x2029cda2 main
  [1:15] libc.so.6 0x2aaaac80ac36 __libc_start_main
aborting job:
CmiFree reference count was zero-- is this a duplicate free?
  [1:16] simpleZeroCopy 0x20234fc9
[NID 07961] 2019-10-29 16:27:16 Apid 81629184: initiated application termination
Application 81629184 exit codes: 255
Application 81629184 resources: utime ~1s, stime ~1s, Rss ~26112, inblocks ~23996, outblocks ~55758

real    0m7.946s
user    0m2.828s
sys    0m0.532s

Metadata

Metadata

Assignees

Labels

BugSomething isn't workingCray

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions