-
Notifications
You must be signed in to change notification settings - Fork 56
Closed
Labels
Description
ZC API (esp, p2p EM API and p2p EM Post API) tests occasionally crash on bluewaters with gni-crayxe builds.
Specifically, the following tests are crashing:
tests/charm++/zerocopy/zerocopy_with_qd:
../../../../bin/testrun +p4 ./zerocopy_with_qd 100
Running on 4 processors: ./zerocopy_with_qd 100
aprun -n 4 -d 2 ./zerocopy_with_qd 100
Charm++> Running on Gemini (GNI) with 4 processes
Charm++> static SMSG
Charm++> SMSG memory: 19.8KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> Cray TLB page size: 8192K
Charm++> Running in non-SMP mode: 4 processes (PEs)
Converse/Charm++ Commit ID: v6.10.0-rc2-29-gd0f064260
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (2 sockets x 16 cores x 1 PUs = 32-way SMP)
[0][0][0] Test 1: QD has been reached for RO Variable Bcast
[0][0][0] Test 2: QD has been reached for Direct API
_pmiu_daemon(SIGCHLD): [NID 01160] [c2-1c2s4n2] [Wed Oct 30 15:41:54 2019] PE RANK 0 exit signal Killed
[NID 01160] 2019-10-30 15:41:54 Apid 81740147: initiated application termination
[NID 01160] 2019-10-30 15:41:54 Apid 81740147: Cray HSN detected critical error 0x40c[ptag 32]. Please contact admin for details. Killing pid 31266(zerocopy_with_q)
Application 81740147 exit codes: 137
Application 81740147 resources: utime ~2s, stime ~1s, Rss ~24300, inblocks ~14699, outblocks ~339091
- examples/charm++/zerocopy/entry_method_post_api/unreg/simpleZeroCopy
../../../../../../bin/testrun +p4 ./simpleZeroCopy 32 +balancer RotateLB
Running on 4 processors: ./simpleZeroCopy 32 +balancer RotateLB
aprun -n 4 -d 2 ./simpleZeroCopy 32 +balancer RotateLB
Charm++> Running on Gemini (GNI) with 4 processes
Charm++> static SMSG
Charm++> SMSG memory: 19.8KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> Cray TLB page size: 8192K
Charm++> Running in non-SMP mode: 4 processes (PEs)
Converse/Charm++ Commit ID: v6.10.0-rc2-26-g92bc4acf9
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (2 sockets x 16 cores x 1 PUs = 32-way SMP)
CharmLB> RotateLB created.
send: completed
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: CmiFree reference count was zero-- is this a duplicate free?
[0] Stack Traceback:
[0:0] simpleZeroCopy 0x203e8f1c CmiAbortHelper(char const*, char const*, char const*, int, int)
[0:1] simpleZeroCopy 0x203e9001 CmiGetNonLocal
[0:2] simpleZeroCopy 0x203f4c61 CmiFree
[0:3] simpleZeroCopy 0x203ecff8
[0:4] simpleZeroCopy 0x203ed130
[0:5] simpleZeroCopy 0x203ed282 LrtsAdvanceCommunication(int)
------------- Processor 1 Exiting: Called CmiAbort ------------
Reason: CmiFree reference count was zero-- is this a duplicate free?
[1] Stack Traceback:
[0:6] simpleZeroCopy 0x203e8d59
[0:7] simpleZeroCopy 0x203e913a
[1:0] simpleZeroCopy 0x203e8f1c CmiAbortHelper(char const*, char const*, char const*, int, int)
[0:8] simpleZeroCopy 0x203f6591
[1:1] simpleZeroCopy 0x203e9001 CmiGetNonLocal
[0:9] simpleZeroCopy 0x203f72db CcdRaiseCondition
[1:2] simpleZeroCopy 0x203f4c61 CmiFree
[0:10] simpleZeroCopy 0x203f23b9 CsdStillIdle
[1:3] simpleZeroCopy 0x203ecff8
[0:11] simpleZeroCopy 0x203f2736 CsdScheduleForever
[1:4] simpleZeroCopy 0x203ed130
[0:12] simpleZeroCopy 0x203f2641 CsdScheduler
[1:5] simpleZeroCopy 0x203ed282 LrtsAdvanceCommunication(int)
[0:13] simpleZeroCopy 0x203e8d1c
[1:6] simpleZeroCopy 0x203e8d59
[0:14] simpleZeroCopy 0x203e8c23 ConverseInit
[1:7] simpleZeroCopy 0x203e9076 CmiGetNonLocal
[0:15] simpleZeroCopy 0x202a42e7 charm_main
[0:16] simpleZeroCopy 0x2029cda2 main
[1:8] simpleZeroCopy 0x203f258b CsdNextMessage
[0:17] libc.so.6 0x2aaaac80ac36 __libc_start_main
aborting job:
CmiFree reference count was zero-- is this a duplicate free?
[0:18] simpleZeroCopy 0x20234fc9
[1:9] simpleZeroCopy 0x203f26e7 CsdScheduleForever
[1:10] simpleZeroCopy 0x203f2641 CsdScheduler
[1:11] simpleZeroCopy 0x203e8d1c
[1:12] simpleZeroCopy 0x203e8c23 ConverseInit
[1:13] simpleZeroCopy 0x202a42e7 charm_main
[1:14] simpleZeroCopy 0x2029cda2 main
[1:15] libc.so.6 0x2aaaac80ac36 __libc_start_main
aborting job:
CmiFree reference count was zero-- is this a duplicate free?
[1:16] simpleZeroCopy 0x20234fc9
[NID 07961] 2019-10-29 16:27:16 Apid 81629184: initiated application termination
Application 81629184 exit codes: 255
Application 81629184 resources: utime ~1s, stime ~1s, Rss ~26112, inblocks ~23996, outblocks ~55758
real 0m7.946s
user 0m2.828s
sys 0m0.532s