Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChaNGa crashes/hangs with UCX machine layer (in SMP mode) on Frontera #2636

Open
nitbhat opened this issue Dec 4, 2019 · 45 comments
Open

ChaNGa crashes/hangs with UCX machine layer (in SMP mode) on Frontera #2636

nitbhat opened this issue Dec 4, 2019 · 45 comments
Assignees
Labels
Bug Something isn't working UCX The UCX machine layer

Comments

@nitbhat
Copy link

nitbhat commented Dec 4, 2019

The 64 node run with the h148 cosmo dataset caused assertion failures at different places.
ChaNGa_6.10_debug_64nodes_h148_run.txt
is the run output.

The most common assertion failures was:

[3536] Stack Traceback:
[1574806880.828857] [c104-124:313563:0]          ib_md.c:478  UCX  ERROR ibv_reg_mr(address=0x2b4d043eaaf0, length=7472, access=0xf) failed: Cannot allocate memory
[1574806880.829289] [c104-124:313563:0]         ucp_mm.c:111  UCX  ERROR failed to register address 0x2b4d043eaaf0 length 7472 on md[4]=ib/mlx5_0: Input/output error
[1574806880.829292] [c104-124:313563:0]    ucp_request.c:264  UCX  ERROR failed to register user buffer datatype 0x8 address 0x2b4d043eaaf0 len 7472: Input/output error
------------- Processor 3535 Exiting: Called CmiAbort ------------
Reason: [3535] Assertion "!(((uintptr_t)(status_ptr)) >= ((uintptr_t)UCS_ERR_LAST))" failed in file machine.C line 570.

Others were:

Reason: Converse zero handler executed-- was a message corrupted?

[c104-122:253786:0:253845] ib_mlx5_log.c:139  Remote access on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
[c104-122:253786:0:253845] ib_mlx5_log.c:139  DCI QP 0x29e38 wqe[347]: RDMA_READ s-- [rqpn 0x2944c rlid 7737] [rva 0x2b82a5ee8ab0 rkey 0x744099] [va 0x2acf49804ee0 len 3872 lkey 0x14f64ed]
[715] Stack Traceback:
  [715:0] ChaNGa.smp 0x9ee9aa CmiAbortHelper(char const*, char const*, char const*, int, int)
  [715:1] ChaNGa.smp 0x9eea82 CmiGetNonLocal
  [715:2] ChaNGa.smp 0x9f6202
  [715:3] ChaNGa.smp 0x9f6c26 CmiHandleMessage
  [715:4] ChaNGa.smp 0x9f6feb CsdScheduleForever
  [715:5] ChaNGa.smp 0x9f6f31 CsdScheduler
  [715:6] ChaNGa.smp 0x9ee6f2
  [715:7] ChaNGa.smp 0x9ee2e5 ConverseInit
  [715:8] ChaNGa.smp 0x8abbaa charm_main
  [715:9] ChaNGa.smp 0x8a24b4 main
  [715:10] libc.so.6 0x2acd899df495 __libc_start_main
  [715:11] ChaNGa.smp 0x6c3d8f

------------- Processor 3562 Exiting: Called CmiAbort ------------
Reason: [3562] Assertion "status == UCS_OK" failed in file machine.C line 474.

[3562] Stack Traceback:
  [3562:0] ChaNGa.smp 0x9ee9aa CmiAbortHelper(char const*, char const*, char const*, int, int)
  [3562:1] ChaNGa.smp 0x9eea82 CmiGetNonLocal
  [3562:2] ChaNGa.smp 0x9fc899 CmiCopyMsg
  [3562:3] ChaNGa.smp 0x9f3585 UcxTxReqCompleted(void*, ucs_status_t)
  [3562:4] libucp.so.0 0x2b00f1674f5f ucp_proto_am_zcopy_req_complete
  [3562:5] libuct_ib.so.0 0x2b00f3c218bf uct_rc_txqp_purge_outstanding
  [3562:6] libuct_ib.so.0 0x2b00f3c3c402
  [3562:7] libuct_ib.so.0 0x2b00f3c3bc25
  [3562:8] libucp.so.0 0x2b00f1674122 ucp_worker_progress
  [3562:9] ChaNGa.smp 0x9f37b6 LrtsAdvanceCommunication(int)

@brminich mentioned that assertion failue Assertion "!(((uintptr_t)(status_ptr)) >= ((uintptr_t)UCS_ERR_LAST))" failed in file machine.C line 570. happens because of memory allocation failure.

@nitbhat nitbhat added this to the 6.10.0 milestone Dec 4, 2019
@nitbhat
Copy link
Author

nitbhat commented Dec 4, 2019

With just export UCX_ZCOPY_THRESH=-1, I saw a hang during the initial domain decomposition

Charm++> Running in SMP mode: 64 processes, 55 worker threads (PEs) + 1 comm threads per process, 3520 PEs total
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.10.0-rc2-20-g55984a468
Warning> Randomization of virtual memory (ASLR) is turned on in the kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try running with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled.
Charm++> cpuaffinity PE-core map : 1-55
Charm++> Running on 64 hosts (2 sockets x 28 cores x 1 PUs = 56-way SMP)
Charm++> cpu topology info is gathered in 0.079 seconds.
[0] MultistepLB_notopo created
WARNING: bKDK parameter ignored; KDK is always used.
WARNING: bStandard parameter ignored; Output is always standard.
Not Using CkLoop 0
WARNING: bCannonical parameter ignored; integration is always cannonical
WARNING: bOverwrite parameter ignored.
WARNING: bGeometric parameter ignored.
WARNING: star formation set without enabling SPH
Enabling SPH
ChaNGa version 3.4, commit v3.4-10-gf74cbe8
Running on 3520 processors/ 64 nodes with 262144 TreePieces
yieldPeriod set to 5
Prefetching...ON
Number of chunks for remote tree walk set to 1
Chunk Randomization...ON
cache 1
cacheLineDepth 4
Verbosity level 1
Domain decomposition...SFC Peano-Hilbert
CkLoopLib is used in SMP with simple dynamic scheduling (converse-level notification)
Created 262144 pieces of tree
Loading particles ... trying Tipsy ... took 11.011082 seconds.
N: 546538574
Input file, Time:0.286600 Redshift:0.140216 Expansion factor:0.877027
Simulation to Time:0.287039 Redshift:0.138666 Expansion factor:0.878220
Reading coolontime
Restarting Gas Simulation with array files.
dDeltaStarForm (set): 2.39444e-05, effectively: 8.06294e-07 = 33673.5 yrs, iStarFormRung: 0
SNII feedback: 1.49313e+49 ergs/solar mass
dDeltaSink (set): 0, effectively: 8.06294e-07 = 33673.5 yrs, iSinkRung: 0
Identified 15 sink particles
Initial Domain decomposition ... Sorter: Histograms balanced after 25 iterations

With both export UCX_ZCOPY_THRESH=-1 and +ucx_rndv_thresh=2048, ChaNGa ran 2 big steps (3553 and 3554), but didn't complete in 30 mins and was hence killed by the scheduler.

@trquinn was seeing another hang during the load balancing phase but mentioned that setting UCX_IB_RX_MAX_BUFS=32768 helped in getting past that hang on 256 nodes.

@brminich
Copy link
Collaborator

brminich commented Dec 4, 2019

@nitbhat, can you please try UCX_IB_RX_MAX_BUFS=32768 without any other settings?
if it does not help, smaller value is worth trying (say 8192)

@nitbhat nitbhat added Bug Something isn't working UCX The UCX machine layer labels Dec 4, 2019
@nitbhat
Copy link
Author

nitbhat commented Dec 4, 2019

@nitbhat, can you please try UCX_IB_RX_MAX_BUFS=32768 without any other settings?
if it does not help, smaller value is worth trying (say 8192)

Okay, I'll try that.

@nitbhat
Copy link
Author

nitbhat commented Dec 5, 2019

@brminich: I haven't been able to test that setting yet. (Frontera was down for maintenance on Tuesday and now for some reason, I'm getting weird errors while launching the MPI job. I'm in conversation with TACC about it).

I'll test it as soon as I can.

@nitbhat
Copy link
Author

nitbhat commented Dec 5, 2019

@brminich: I tried different values for UCX_IB_RX_MAX_BUFS from 32k to 2k, and I got the same error.

For the case when I set UCX_IB_RX_MAX_BUFS to 2048, I saw this warning: [1575585217.410997] [c101-081:277019:0] uct_iface.c:139 UCX WARN Memory pool rc_recv_desc is empty

@trquinn
Copy link
Collaborator

trquinn commented May 17, 2020

Following the suggestion in #2635 , I tried running ChaNGa with the master branch of ucx. With dwf1b running on 2 nodes/4 processes, I get the failure:

[1589667781.056885] [c161-001:28740:0]          ib_md.c:329  UCX  ERRO
R ibv_exp_reg_mr(address=0x2ab3f788a0a0, length=18960, access=0xf) failed: Cannot allocate memory
[1589667781.056924] [c161-001:28740:0]         ucp_mm.c:131  UCX  ERROR failed to register address 0x2ab3f788a0a0 mem_type bit
 0x1 length 18948 on md[5]=mlx5_0: Input/output error (md reg_mem_types 0x15)
[1589667781.056927] [c161-001:28740:0]    ucp_request.c:275  UCX  ERROR failed to register user buffer datatype 0x8 address 0x
2ab3f788a0a0 len 18948: Input/output error
[c161-001:28740:0:28805]        rndv.c:457  Assertion `status == UCS_OK' failed

/home1/00333/tg456090/src/ucx/src/ucp/tag/rndv.c: [ ucp_rndv_progress_rma_get_zcopy() ]

This run works for ucx releases 1.6.1, 1.7 and 1.8.0. Bisection says the failure starts to happen at ucx git hash 35f6d1189c410aa06a3c8f5fb18805527da91cf7 (although this fails with a seg fault earlier; the change to the registered memory issue happens at 896d76b8762bc5d54f8f74fbc805a25ed404d055)
My build script (starting in the ucx directory) is:

./autogen.sh
./contrib/configure-release-mt --prefix=$HOME/ucx/build_master
make clean
make -j16 install
cd ../charm
rm -rf ucx-linux-x86_64-smp
./build ChaNGa ucx-linux-x86_64 smp --with-production --basedir=$HOME/ucx/build_master -j16
cd ../changa
make clean
make -j 16

@yosefe
Copy link

yosefe commented May 18, 2020

@trquinn are there any errors in dmesg on the machine which failed to register memory?

@nitbhat
Copy link
Author

nitbhat commented Jun 30, 2020

@trquinn: How can I get access to the dwf1b benchmark? Is it the same as dwf1.6144 as listed in https://github.com/N-BodyShop/changa/wiki/ChaNGa-Benchmarks?

@nitbhat
Copy link
Author

nitbhat commented Jun 30, 2020

I tried the h148.cosmo50PLK.6144g3HbwK1BH.param benchmark on 64 nodes with 2 processes/node, built on charm that was built using ucx master, and I see the crash which is the same as the original crash that was seen back in 2019.


 121 dDeltaStarForm (set): 2.39444e-05, effectively: 8.06294e-07 = 33673.5 yrs, iStarFormRung: 0$
 122 SNII feedback: 1.49313e+49 ergs/solar mass$
 123 dDeltaSink (set): 0, effectively: 8.06294e-07 = 33673.5 yrs, iSinkRung: 0$
 124 Identified 15 sink particles$
 125 Initial Domain decomposition ... Sorter: Histograms balanced after 25 iterations.$
 126 [1593463314.947122] [c186-092:226510:0]          ib_md.c:329  UCX  ERROR ibv_exp_reg_mr(address=0x2ac99492d8c0, length=2096, access=0xf) failed: Cannot allocate memory$
 127 [1593463314.947795] [c186-092:226510:0]         ucp_mm.c:131  UCX  ERROR failed to register address 0x2ac99492d8c0 mem_type bit 0x1 length 2096 on md[5]=mlx5_0: Input/output error (md reg_mem_types 0x15)$
 128 [1593463314.947799] [c186-092:226510:0]    ucp_request.c:275  UCX  ERROR failed to register user buffer datatype 0x8 address 0x2ac99492d8c0 len 2096: Input/output error$
 129 ------------- Processor 3580 Exiting: Called CmiAbort ------------$
 130 Reason: [3580] Assertion "!(((uintptr_t)(status_ptr)) >= ((uintptr_t)UCS_ERR_LAST))" failed in file machine.C line 583.$
 131 $
 132 [3580] Stack Traceback:$
 133   [3580:0] ChaNGa.smp 0x9ff860 CmiAbortHelper(char const*, char const*, char const*, int, int)$
 134   [3580:1] ChaNGa.smp 0x9ff938 CmiGetNonLocal$
 135   [3580:2] ChaNGa.smp 0xa0d779 CmiCopyMsg$
 136   [3580:3] ChaNGa.smp 0xa045bb $
 137   [3580:4] ChaNGa.smp 0xa046cb LrtsAdvanceCommunication(int)$
 138   [3580:5] ChaNGa.smp 0x9ff5e0 $
 139   [3580:6] ChaNGa.smp 0x9ff5f8 $
 140   [3580:7] ChaNGa.smp 0x9ff663 CommunicationServerThread(int)$
 141   [3580:8] ChaNGa.smp 0x9ff56c $
 142   [3580:9] ChaNGa.smp 0x9fc39c $
 143   [3580:10] libpthread.so.0 0x2ac853994dd5 $
 144   [3580:11] libc.so.6 0x2ac854b9002d clone$

@trquinn: Have you tried running ChaNGa based on ucx master to reproduce the occasional hangs that you saw during load balancing?

@nitbhat
Copy link
Author

nitbhat commented Jun 30, 2020

I get past the registration error when I run with the nonsmp version. However, the run crashes after step 3553.875.


33237 Step: 3553.875000 Time: 0.286602 Rungs 3 to 4. Gravity Active: 65718, Gas Active: 65703$
33238 Domain decomposition ... total 0.49349 seconds.$
33239 Skipped DD$
33240 Load balancer ... Orb3dLB_notopo: Step 16$
33241 numActiveObjects: 12604, numInactiveObjects: 249540$
33242 active PROC range: 43 to 3563$
33243 [Orb3dLB_notopo] sorting$
33244 ***************************$
33245 Orb3dLB_notopo stats: maxObjLoad 1.318802$
33246 Orb3dLB_notopo stats: minWall 0.000735 maxWall 12.320339 avgWall 11.987582 maxWall/avgWall 1.027758$
33247 Orb3dLB_notopo stats: minIdle 0.000001 maxIdle 11.885347 avgIdle 10.641879 minIdle/avgIdle 0.000000$
33248 Orb3dLB_notopo stats: minPred 0.000000 maxPred 1.318946 avgPred 0.115882 maxPred/avgPred 11.381776$
33249 Orb3dLB_notopo stats: minPiece 0.000000 maxPiece 594.000000 avgPiece 73.142857 maxPiece/avgPiece 8.121094$
33250 Orb3dLB_notopo stats: minBg 0.000734 maxBg 0.457804 avgBg 0.112647 maxBg/avgBg 4.064063$
33251 Orb3dLB_notopo stats: orb migrated 12597 refine migrated 0 objects$
33252 took 0.260109 seconds.$
33253 Building trees ... took 0.801881 seconds.$
33254 Calculating gravity (tree bucket, theta = 0.900000) ... Calculating densities/divv on Actives ... took 0.381756 seconds.$
33255 Marking Neighbors ... took 0.529504 seconds.$
33256 ------------- Processor 257 Exiting: Called CmiAbort ------------$
33257 Reason: [257] Assertion "status == UCS_OK" failed in file machine.C line 487.$
33258 $
33259 [257] Stack Traceback:$
33260   [257:0] ChaNGa.smp 0x9e5600 CmiAbortHelper(char const*, char const*, char const*, int, int)$
33261   [257:1] ChaNGa.smp 0x9e56d8 CmiGetNonLocal$
33262   [257:2] ChaNGa.smp 0x9efc2c CmiCopyMsg$
33263   [257:3] ChaNGa.smp 0x9ea0b0 UcxTxReqCompleted(void*, ucs_status_t)$
33264   [257:4] libucp.so.0 0x2afea291bf82 ucp_proto_am_zcopy_req_complete$
33265   [257:5] libuct_ib.so.0 0x2afea2a3bc36 uct_rc_txqp_purge_outstanding$
33266   [257:6] libuct_ib.so.0 0x2afea2a5812d uct_dc_mlx5_ep_handle_failure$
33267   [257:7] libuct_ib.so.0 0x2afea2a59162 $
33268   [257:8] libucp.so.0 0x2afea291a63a ucp_worker_progress$
33269   [257:9] ChaNGa.smp 0x9ea18e LrtsAdvanceCommunication(int)$
33270   [257:10] ChaNGa.smp 0x9e5441 $
33271   [257:11] ChaNGa.smp 0x9e5811 $
33272   [257:12] ChaNGa.smp 0x9f0a0a $
33273   [257:13] ChaNGa.smp 0x9f174a CcdRaiseCondition$
33274   [257:14] ChaNGa.smp 0x9ec55a CsdStillIdle$
33275   [257:15] ChaNGa.smp 0x9ec8d9 CsdScheduleForever$
33276   [257:16] ChaNGa.smp 0x9ec7e4 CsdScheduler$
33277   [257:17] ChaNGa.smp 0x9e5409 $
33278   [257:18] ChaNGa.smp 0x9e5323 ConverseInit$
33279   [257:19] ChaNGa.smp 0x89edd6 charm_main$
33280   [257:20] ChaNGa.smp 0x897804 main$
33281   [257:21] libc.so.6 0x2afea3e4b495 __libc_start_main$
33282   [257:22] ChaNGa.smp 0x6c19af $

The assertion failure happens in the send completion callback and the failure indicates that one of the sends didn't complete successfully.

483 void UcxTxReqCompleted(void *request, ucs_status_t status)$
484 {$
485     UcxRequest *req = (UcxRequest*)request;$
486 $
487     CmiEnforce(status == UCS_OK);$
488     CmiEnforce(req->msgBuf);$
489 $
490     UCX_LOG(3, "TX req %p completed, free msg %p", req, req->msgBuf);$
491     CmiFree(req->msgBuf);$
492     UCX_REQUEST_FREE(req);$
493 }$

I'm guessing that the status object can be queried to understand more about why the send failed.

@trquinn
Copy link
Collaborator

trquinn commented Jun 30, 2020

@trquinn: How can I get access to the dwf1b benchmark? Is it the same as dwf1.6144 as listed in https://github.com/N-BodyShop/changa/wiki/ChaNGa-Benchmarks?

Correct: that benchmark can be downloaded from google drive.

@trquinn
Copy link
Collaborator

trquinn commented Jun 30, 2020

I tried the h148.cosmo50PLK.6144g3HbwK1BH.param benchmark on 64 nodes with 2 processes/node, built on charm that was built using ucx master, and I see the crash which is the same as the original crash that was seen back in 2019.


 121 dDeltaStarForm (set): 2.39444e-05, effectively: 8.06294e-07 = 33673.5 yrs, iStarFormRung: 0$
 122 SNII feedback: 1.49313e+49 ergs/solar mass$
 123 dDeltaSink (set): 0, effectively: 8.06294e-07 = 33673.5 yrs, iSinkRung: 0$
 124 Identified 15 sink particles$
 125 Initial Domain decomposition ... Sorter: Histograms balanced after 25 iterations.$
 126 [1593463314.947122] [c186-092:226510:0]          ib_md.c:329  UCX  ERROR ibv_exp_reg_mr(address=0x2ac99492d8c0, length=2096, access=0xf) failed: Cannot allocate memory$
 127 [1593463314.947795] [c186-092:226510:0]         ucp_mm.c:131  UCX  ERROR failed to register address 0x2ac99492d8c0 mem_type bit 0x1 length 2096 on md[5]=mlx5_0: Input/output error (md reg_mem_types 0x15)$
 128 [1593463314.947799] [c186-092:226510:0]    ucp_request.c:275  UCX  ERROR failed to register user buffer datatype 0x8 address 0x2ac99492d8c0 len 2096: Input/output error$
 129 ------------- Processor 3580 Exiting: Called CmiAbort ------------$
 130 Reason: [3580] Assertion "!(((uintptr_t)(status_ptr)) >= ((uintptr_t)UCS_ERR_LAST))" failed in file machine.C line 583.$
 131 $
 132 [3580] Stack Traceback:$
 133   [3580:0] ChaNGa.smp 0x9ff860 CmiAbortHelper(char const*, char const*, char const*, int, int)$
 134   [3580:1] ChaNGa.smp 0x9ff938 CmiGetNonLocal$
 135   [3580:2] ChaNGa.smp 0xa0d779 CmiCopyMsg$
 136   [3580:3] ChaNGa.smp 0xa045bb $
 137   [3580:4] ChaNGa.smp 0xa046cb LrtsAdvanceCommunication(int)$
 138   [3580:5] ChaNGa.smp 0x9ff5e0 $
 139   [3580:6] ChaNGa.smp 0x9ff5f8 $
 140   [3580:7] ChaNGa.smp 0x9ff663 CommunicationServerThread(int)$
 141   [3580:8] ChaNGa.smp 0x9ff56c $
 142   [3580:9] ChaNGa.smp 0x9fc39c $
 143   [3580:10] libpthread.so.0 0x2ac853994dd5 $
 144   [3580:11] libc.so.6 0x2ac854b9002d clone$

@trquinn: Have you tried running ChaNGa based on ucx master to reproduce the occasional hangs that you saw during load balancing?

Yes, and I got similar errors as you.

@nitbhat
Copy link
Author

nitbhat commented Jul 1, 2020

@trquinn I was able to reproduce the memory registration error that you were seeing on 2 nodes/4 processes while running the dwf1b benchmark.

[Orb3dLB_notopo] sorting
***************************
Orb3dLB_notopo stats: maxObjLoad 0.000342
Orb3dLB_notopo stats: minWall 0.002322 maxWall 0.004742 avgWall 0.002872 maxWall/avgWall 1.650815
Orb3dLB_notopo stats: minIdle 0.000000 maxIdle 0.000440 avgIdle 0.000033 minIdle/avgIdle 0.000000
Orb3dLB_notopo stats: minPred 0.000737 maxPred 0.001203 avgPred 0.000920 maxPred/avgPred 1.308291
Orb3dLB_notopo stats: minPiece 7.000000 maxPiece 9.000000 avgPiece 8.000000 maxPiece/avgPiece 1.125000
Orb3dLB_notopo stats: minBg 0.001020 maxBg 0.004558 avgBg 0.001920 maxBg/avgBg 2.374152
Orb3dLB_notopo stats: orb migrated 862 refine migrated 0 objects
Building trees ... took 0.302196 seconds.
Calculating gravity (tree bucket, theta = 0.700000) ... [1593618510.434801] [c191-041:155490:0]          ib_md.c:329  UCX  ERROR ibv_exp_reg_mr(address=0x2ad52577de20, length=9536, access=0xf) failed: Cannot allocate memory
[1593618510.434888] [c191-041:155490:0]         ucp_mm.c:131  UCX  ERROR failed to register address 0x2ad52577de20 mem_type bit 0x1 length 9524 on md[5]=mlx5_0: Input/output error (md reg_mem_types 0x15)
[1593618510.434900] [c191-041:155490:0]    ucp_request.c:275  UCX  ERROR failed to register user buffer datatype 0x8 address 0x2ad52577de20 len 9524: Input/output error
[c191-041:155490:0:155554]        rndv.c:523  Assertion `status == UCS_OK' failed

/scratch1/03808/nbhat4/ucx/src/ucp/tag/rndv.c: [ ucp_rndv_progress_rma_get_zcopy() ]
      ...
      520             }
      521             return UCS_OK;
      522         } else if (!UCS_STATUS_IS_ERR(status)) {
==>   523             /* in case if not all chunks are transmitted - return in_progress
      524              * status */
      525             return UCS_INPROGRESS;
      526         } else {

==== backtrace (tid: 155554) ====
 0 0x0000000000052563 ucs_debug_print_backtrace()  /scratch1/03808/nbhat4/ucx/src/ucs/debug/debug.c:656
 1 0x0000000000035f81 ucp_rndv_progress_rma_get_zcopy()  /scratch1/03808/nbhat4/ucx/src/ucp/tag/rndv.c:523
 2 0x0000000000036420 ucp_request_try_send()  /scratch1/03808/nbhat4/ucx/src/ucp/core/ucp_request.inl:213
 3 0x0000000000036420 ucp_request_send()  /scratch1/03808/nbhat4/ucx/src/ucp/core/ucp_request.inl:248
 4 0x0000000000036420 ucp_rndv_req_send_rma_get()  /scratch1/03808/nbhat4/ucx/src/ucp/tag/rndv.c:652
 5 0x0000000000037d5e ucp_rndv_matched()  /scratch1/03808/nbhat4/ucx/src/ucp/tag/rndv.c:1131
 6 0x0000000000038155 ucp_rndv_process_rts()  /scratch1/03808/nbhat4/ucx/src/ucp/tag/rndv.c:1185
 7 0x0000000000038155 ucp_rndv_process_rts()  /scratch1/03808/nbhat4/ucx/src/ucp/tag/rndv.c:1189
 8 0x0000000000014715 uct_iface_invoke_am()  /scratch1/03808/nbhat4/ucx/src/uct/base/uct_iface.h:635
 9 0x0000000000014715 uct_mm_iface_process_recv()  /scratch1/03808/nbhat4/ucx/src/uct/sm/mm/base/mm_iface.c:232
10 0x0000000000014715 uct_mm_iface_poll_fifo()  /scratch1/03808/nbhat4/ucx/src/uct/sm/mm/base/mm_iface.c:280
11 0x0000000000014715 uct_mm_iface_progress()  /scratch1/03808/nbhat4/ucx/src/uct/sm/mm/base/mm_iface.c:333
12 0x000000000002663a ucs_callbackq_dispatch()  /scratch1/03808/nbhat4/ucx/src/ucs/datastruct/callbackq.h:211
13 0x000000000002663a uct_worker_progress()  /scratch1/03808/nbhat4/ucx/src/uct/api/uct.h:2342
14 0x000000000002663a ucp_worker_progress()  /scratch1/03808/nbhat4/ucx/src/ucp/core/ucp_worker.c:2037
15 0x00000000009e571e LrtsAdvanceCommunication()  ???:0
16 0x00000000009e0680 AdvanceCommunication()  machine.C:0
17 0x00000000009e0698 CommunicationServer()  machine.C:0
18 0x00000000009e0703 CommunicationServerThread()  ???:0
19 0x00000000009e060c ConverseRunPE()  machine.C:0
20 0x00000000009dd43c call_startfn()  machine.C:0
21 0x0000000000007dd5 start_thread()  pthread_create.c:0
22 0x00000000000fe02d __clone()  ???:0
=================================

@yosefe: On seeing the dmesg output, I don't particularly see any errors related to memory registration. I'm attaching the dmesg output from both the nodes after the crash occurred.

dmesg_output_1009912_c191-034.txt
dmesg_output_1009912_c191-041.txt

@brminich
Copy link
Collaborator

brminich commented Jul 2, 2020

@nitbhat, are you running on Frontera?
can you please share the details on running the benchmark on 2 nodes? I'll try to reproduce locally, since I do not have an access to frontera

@nitbhat
Copy link
Author

nitbhat commented Jul 2, 2020

@brminich: Yes, I was running that on Frontera.

Sure.

  1. Build charm (ChaNGa target) using ./build ChaNGa ucx-linux-x86_64 smp --enable-error-checking --suffix=debug --basedir=<path-to-ucx> -j24 -g -O0
  2. Download ChaNGa from https://github.com/N-BodyShop/changa
  3. Download utility from https://github.com/N-BodyShop/utility and clone it in the ChaNGa repo's parent directory.
  4. Export CHARM_DIR to point to your charm build arch (export CHARM_DIR=/work/03808/nbhat4/frontera/charm_3/ucx-linux-x86_64-smp-debug)
  5. Inside the ChaNGa directory, run ./configure(Make sure that Charm path printed at the end of configure points to the correct charm directory).
  6. Build running make -j<num procs>. This should create ChaNGa.smp
  7. Get the benchmarking files (dwf1.6144.param and dwf1.6144.01472) from https://github.com/N-BodyShop/changa/wiki/ChaNGa-Benchmarks.
  8. On Frontera, the bug shows up when we run on 2 nodes (with 2 processes on each node). Each Frontera node has 56 cores, you can try something matching that configuration on your machine. The run script used on Frontera is as follows:
#!/bin/bash
#SBATCH -J changa_2nodes_4procs
#SBATCH -p normal
#SBATCH -t 00:30:00
#SBATCH -A ASC20007
#SBATCH -N 2
#SBATCH -n 4
#SBATCH --ntasks-per-node=2
#SBATCH -o /scratch1/03808/nbhat4/changa/results/changa-output-2nodes-ucx-prod-2procs_6.10.1-master-%A.out

cd /scratch1/03808/nbhat4/changa

ibrun ./ChaNGa.smp +ppn 27 +setcpuaffinity +commap 0,1 +pemap 2-54:2,3-55:2 dwf1.6144.param

Let me know if you have any questions. (and if you are/aren't able to reproduce the crash).

@brminich
Copy link
Collaborator

brminich commented Jul 7, 2020

@nitbhat, thanks for the instructions.
is it 100% reproducible on frontera?
how long does it typically take to fail? I managed to run it on local system, with 28 threads per node, but it seems to be running for quite long time (until my reservation ended)
Is it reproducible with non-SMP mode?

@nitbhat
Copy link
Author

nitbhat commented Jul 7, 2020

@brminich

Yes, it crashes every time I run on Frontera.
It takes about 14 mins to crash.

How many nodes did you run it on? 4 nodes?

On trying with non-SMP, it seems like there is an issue with memory, since I see this error:


 17 Initial Domain decomposition ... total 0.685047 seconds.$
 18 Initial load balancing ... Orb3dLB_notopo: Step 0$
 19 numActiveObjects: 896, numInactiveObjects: 0$
 20 active PROC range: 0 to 111$
 21 Migrating all: numActiveObjects: 896, numInactiveObjects: 0$
 22 [Orb3dLB_notopo] sorting$
 23 ***************************$
 24 Orb3dLB_notopo stats: maxObjLoad 0.000079$
 25 Orb3dLB_notopo stats: minWall 0.000279 maxWall 0.003832 avgWall 0.000472 maxWall/avgWall 8.112413$
 26 Orb3dLB_notopo stats: minIdle 0.000000 maxIdle 0.000242 avgIdle 0.000032 minIdle/avgIdle 0.000000$
 27 Orb3dLB_notopo stats: minPred 0.000017 maxPred 0.000128 avgPred 0.000051 maxPred/avgPred 2.523048$
 28 Orb3dLB_notopo stats: minPiece 7.000000 maxPiece 9.000000 avgPiece 8.000000 maxPiece/avgPiece 1.125000$
 29 Orb3dLB_notopo stats: minBg 0.000237 maxBg 0.003799 avgBg 0.000389 maxBg/avgBg 9.754210$
 30 Orb3dLB_notopo stats: orb migrated 893 refine migrated 0 objects$
 31 Building trees ... took 0.12406 seconds.$
 32 Calculating gravity (tree bucket, theta = 0.700000) ... ------------- Processor 70 Exiting: Called CmiAbort ------------$
 33 Reason: Unhandled C++ exception in user code.$
 34 $
 35 ------------- Processor 87 Exiting: Called CmiAbort ------------$
 36 Reason: Unhandled C++ exception in user code.$
 37 $
 38 [87] Stack Traceback:$
 39   [87:0] ChaNGa_ucx_nonsmp 0x7f64ae _Z14CmiAbortHelperPKcS0_S0_ii$
 40   [87:1] ChaNGa_ucx_nonsmp 0x7f65c6 $
 41   [87:2] ChaNGa_ucx_nonsmp 0x701c67 $
 42   [87:3] libstdc++.so.6 0x2b39bdf7f106 $
 43   [87:4] libstdc++.so.6 0x2b39bdf7f151 $
 44   [87:5] libstdc++.so.6 0x2b39bdf7f385 $
 45   [87:6] libstdc++.so.6 0x2b39bdf73301 $
 46 ------------- Processor 78 Exiting: Called CmiAbort ------------$
 47 Reason: Could not malloc()--are we out of memory? (used: 2074.969MB)$
 48 [78] Stack Traceback:$
 49   [78:0] ChaNGa_ucx_nonsmp 0x7f64ae CmiAbortHelper(char const*, char const*, char const*, int, int)$
 50   [78:1] ChaNGa_ucx_nonsmp 0x7f65c6 $

@brminich
Copy link
Collaborator

brminich commented Jul 7, 2020

I was running on 2 nodes with 4 processes.
Moving to thor, maybe can catch it there.
Can it be that with SMP lack of memory is also an issue? Is there any memory consumption estimation for that example?

@nitbhat
Copy link
Author

nitbhat commented Jul 8, 2020

Okay, I think you can run it on 4 nodes (with 28 cores each) to better suit the 2 Frontera nodes (with 56 cores each).

Yes, in some runs, I saw similar errors 'Could not malloc - are we out of memory' from an SMP 2 node run as well. So, in the nonsmp case, it's always that error. In the smp run, I sometimes see that error and some other times, I see the error related to memory registration.

However, running it on 2 nodes with the MPI layer (both smp and nonsmp) doesn't crash, but takes longer to complete.

Interestingly, when I try increasing the number of nodes (to 4 and 8), I still see "out of memory" errors for UCX. (And MPI runs successfully for those cases as well). I'll try to determine the exact memory usage for UCX runs on 2/4/8 nodes.

@trquinn
Copy link
Collaborator

trquinn commented Jul 8, 2020

I checked on expected memory use: when running on a single SMP process, this benchmark uses 16.3GB.
Using netlrts with 4 SMP processes, the benchmark uses 5.3GB/process (i.e., ~22GB total).

@brminich
Copy link
Collaborator

@nitbhat, maybe we can have joint debug session on Frontera?

@trquinn
Copy link
Collaborator

trquinn commented Jul 18, 2020

They just upgraded the ofed libraries (and the system installed UCX) on Frontera. We should see if that makes a difference first.
I'm in meetings until 14:30 PDT all next week.

@nitbhat
Copy link
Author

nitbhat commented Aug 6, 2020

@brminich Yes, let's schedule a debugging session sometime next week if that works for you?

@trquinn Okay, I can check if that is making a difference.

@brminich
Copy link
Collaborator

brminich commented Aug 6, 2020

@nitbhat, next week is ok

@trquinn
Copy link
Collaborator

trquinn commented Aug 22, 2020

Note that a similar issue is reported on in the UCX repository:
openucx/ucx#5291

@trquinn
Copy link
Collaborator

trquinn commented Aug 22, 2020

I've done a little more investigation on frontera, using the master branches of ucx and charm (as of Aug. 18), and the dwf1b benchmark, running 8 processors on 4 nodes.
I instrumented uct_ib_reg_mr() to see how much memory was being registered with the following patch:

diff --git a/src/uct/ib/base/ib_md.c b/src/uct/ib/base/ib_md.c
index 08443d0b0..592fe839f 100644
--- a/src/uct/ib/base/ib_md.c
+++ b/src/uct/ib/base/ib_md.c
@@ -516,6 +516,9 @@ static ucs_status_t uct_ib_md_reg_mr(uct_ib_md_t *md, void *address,
                             silent);
 }
 
+static size_t ib_total_reg = 0;
+static size_t ib_total_reg_segs = 0;
+
 ucs_status_t uct_ib_reg_mr(struct ibv_pd *pd, void *addr, size_t length,
                            uint64_t access_flags, struct ibv_mr **mr_p,
                            int silent)
@@ -532,7 +535,11 @@ ucs_status_t uct_ib_reg_mr(struct ibv_pd *pd, void *addr, size_t length,
 #else
     mr = UCS_PROFILE_CALL(ibv_reg_mr, pd, addr, length, access_flags);
 #endif
+    ib_total_reg += length;
+    ib_total_reg_segs++;
+    fprintf(stderr, "ibv_reg %ld in %ld\n", ib_total_reg, ib_total_reg_segs);
     if (mr == NULL) {
+       fprintf(stderr, "ibv_reg failed errno %d\n", errno);
         uct_ib_md_print_mem_reg_err_msg(addr, length, access_flags,
                                         errno, silent);
         return UCS_ERR_IO_ERROR;
@@ -550,6 +557,9 @@ ucs_status_t uct_ib_dereg_mr(struct ibv_mr *mr)
         return UCS_OK;
     }
 
+    ib_total_reg -= mr->length;
+    ib_total_reg_segs--;
+    fprintf(stderr, "ibv_dereg %ld in %ld\n", ib_total_reg, ib_total_reg_segs);
     ret = UCS_PROFILE_CALL(ibv_dereg_mr, mr);
     if (ret != 0) {
         ucs_error("ibv_dereg_mr() failed: %m");

and the output at the time of crash typically looks like this:

ibv_reg 4850063904 in 41809
ibv_reg 3614325824 in 43902
ibv_reg 4810457792 in 56123
ibv_reg 4850074352 in 41810
ibv_reg 4059565248 in 43322
ibv_dereg 5032840000 in 45050
ibv_reg 4810486480 in 56124
ibv_reg 4850089664 in 41811
ibv_reg 5032978160 in 45051
ibv_reg failed errno 12

So: ucx is registering a large number of memory segments. The actual amount of memory is large but (I think) not excessive: the total memory used by each process is about 32GB. (Again, 2 procs/node.) But I think the total number of memory segments seems very large: each node has of order 100,000 memory segments registered with the IB interface. I'm wondering if there is some fragmentation in the ucx memory pool.

Trying to reduce the number of receive buffers with export UCX_IB_RX_MAX_BUFS=25000
causes a hang with:
uct_iface.c:152 UCX WARN Memory pool rc_recv_desc is empty

@trquinn
Copy link
Collaborator

trquinn commented Oct 21, 2020

Any chance this will be fixed in 6.11?

@evan-charmworks
Copy link
Contributor

@trquinn Does the issue still occur with UCX 1.9.0?

@trquinn
Copy link
Collaborator

trquinn commented Oct 24, 2020

I just tried with UCX v1.9.0 and Charm v6.11.0-beta. The issue still occurs.

@nitbhat
Copy link
Author

nitbhat commented Oct 29, 2020

@brminich: Do you have any insights as to what might be happening here? (Or the linked issue on the UCX repo openucx/ucx#5291)

@brminich
Copy link
Collaborator

brminich commented Nov 3, 2020

@nitbhat, no the internal issue is not fixed yet, but according to the symptoms receiver is overflowed by the huge amount of sends.
@trquinn, is it possible to set sudo sysctl -w vm.max_map_count=95530 on your cluster and try to reproduce the issue?

@trquinn
Copy link
Collaborator

trquinn commented Nov 8, 2020

I doubt I have su privileges on Frontera, the current setting is sysctl: vm.max_map_count = 65530

@brminich
Copy link
Collaborator

@trquinn, is it possible to try UCX master and Charm from this branch?https://github.com/brminich/charm/tree/topic/ucx_van_using_am
In this branch I updated UCX machine layer to use AM API instead of TAGs. This API better matches charm++ and has better flow control capabilities (which will hopefully help to avoid receiver flooding). Need to use latest UCX from master, because UCX1.9 does not have full AM support (will be included to the next release).

@trquinn
Copy link
Collaborator

trquinn commented Nov 21, 2020

I get compile errors when building that branch of charm. I'm using gcc 9.1.0. Here are the first few:

/home1/00333/tg456090/src/charm/src/arch/ucx/machine.C:` In function ‘int Process                                                        TxQueue()’:                                                                                                                             
/home1/00333/tg456090/src/charm/src/arch/ucx/machine.C:546:17: error: variable o                                                        r field ‘req’ declared void                                                                                                             
  546 |            void req = UcxSendAm(req->dNode, req->size, req->msgBuf, req-                                                        >id,                                                                                            
      |                 ^~~                                                                                                             
/home1/00333/tg456090/src/charm/src/arch/ucx/machine.C:546:61: error: invalid co                                                        nversion from ‘void*’ to ‘char*’ [-fpermissive]                                                                                         
  546 |            void req = UcxSendAm(req->dNode, req->size, req->msgBuf, req-                                                        >id,                                                                                            
      |                                                        ~~~~~^~~~~~                                                              
      |                                                             |                                                                   
      |                                                             void*                                                               
/home1/00333/tg456090/src/charm/src/arch/ucx/machine.C:456:61: note:   initializ                                                        ing argument 3 of ‘void* UcxSendAm(int, int, char*, unsigned int, unsigned int, ucp_send_nbx_callback_t)’
  456 | static inline void* UcxSendAm(int destNode, int size, char *msg, unsigned amId,
      |                                                       ~~~~~~^~~
/home1/00333/tg456090/src/charm/src/arch/ucx/machine.C:556:15: error: expected ’ before ‘else’
  556 |             } else {
      |               ^~~~
/home1/00333/tg456090/src/charm/src/arch/ucx/machine.C:545:36: note: to match this ‘{’
  545 |         if(req->op == UCX_SEND_OP) { // Regular Message
      |                                    ^
/home1/00333/tg456090/src/charm/src/arch/ucx/machine.C:557:31: error: ‘status_ptr’ was not declared in this scope
  557 |                 ((UcxRequest*)status_ptr)->msgBuf = req->msgBuf;
      |                               ^~~~~~~~~~
/home1/00333/tg456090/src/charm/src/arch/ucx/machine.C: At global scope:
/home1/00333/tg456090/src/charm/src/arch/ucx/machine.C:573:5: error: expected unqualified-id before ‘return’

@brminich
Copy link
Collaborator

brminich commented Nov 23, 2020

@trquinn, sorry forgot to test SMP version. Can you please check now (updated the same branch)? It can be used with UCX master only

@trquinn
Copy link
Collaborator

trquinn commented Nov 29, 2020

I ran some tests with this on Frontera. The good news is that my standard benchmark runs very well. It used to fail at around four nodes, and now scales up to 24 nodes.
However, if I run it with only two nodes, I get:

[1606624679.201120] [c136-063:91491:0]          ib_md.c:348  UCX  ERROR ibv_reg_mr(address=0x2b61d404bb40, length=9232, access=0xf) failed: Cannot allocate memory
[1606624679.201161] [c136-063:91491:0]         ucp_mm.c:137  UCX  ERROR failed to register address 0x2b61d404bb40 mem_type bit 0x1 length 9220 on md[5]=mlx5_0: Input/output error (md reg_mem_types 0x1)
[1606624679.201165] [c136-063:91491:0]    ucp_request.c:280  UCX  ERROR failed to register user buffer datatype 0x8 address 0x2b61d404bb40 len 9220: Input/output error
[c136-063:91491:0:91559]        rndv.c:495  Assertion `status == UCS_OK' failed

/home1/00333/tg456090/src/ucx/src/ucp/rndv/rndv.c: [ ucp_rndv_progress_rma_get_zcopy() ]
      ...
      492     if (!rndv_req->send.mdesc) {
      493         status = ucp_send_request_add_reg_lane(rndv_req, lane);
      494         ucs_assert_always(status == UCS_OK);
==>   495     }
      496 
      497     rsc_index = ucp_ep_get_rsc_index(ep, lane);
      498     attrs     = ucp_worker_iface_get_attr(ep->worker, rsc_index);

==== backtrace (tid:  91559) ====
 0 0x0000000000054833 ucs_debug_print_backtrace()  /home1/00333/tg456090/src/ucx/src/ucs/debug/debug.c:656
 1 0x0000000000040da8 ucp_rndv_progress_rma_get_zcopy()  /home1/00333/tg456090/src/ucx/src/ucp/rndv/rndv.c:495
 2 0x000000000004130a ucp_request_try_send()  /home1/00333/tg456090/src/ucx/src/ucp/core/ucp_request.inl:239
 3 0x000000000004130a ucp_request_send()  /home1/00333/tg456090/src/ucx/src/ucp/core/ucp_request.inl:264
 4 0x000000000004130a ucp_rndv_req_send_rma_get()  /home1/00333/tg456090/src/ucx/src/ucp/rndv/rndv.c:713
 5 0x0000000000043615 ucp_rndv_receive()  /home1/00333/tg456090/src/ucx/src/ucp/rndv/rndv.c:1226
 6 0x000000000001cdba ucp_am_recv_data_nbx()  /home1/00333/tg456090/src/ucx/src/ucp/core/ucp_am.c:1060
 7 0x0000000000840566 UcxAmRxDataCb()  machine.C:0
 8 0x000000000001e2a8 ucp_am_rndv_process_rts()  /home1/00333/tg456090/src/ucx/src/ucp/core/ucp_am.c:1430
 9 0x0000000000015a05 uct_iface_invoke_am()  /home1/00333/tg456090/src/ucx/src/uct/base/uct_iface.h:662
10 0x0000000000015a05 uct_mm_iface_process_recv()  /home1/00333/tg456090/src/ucx/src/uct/sm/mm/base/mm_iface.c:233
11 0x0000000000015a05 uct_mm_iface_poll_fifo()  /home1/00333/tg456090/src/ucx/src/uct/sm/mm/base/mm_iface.c:282
12 0x0000000000015a05 uct_mm_iface_progress()  /home1/00333/tg456090/src/ucx/src/uct/sm/mm/base/mm_iface.c:335
13 0x000000000002eb92 ucs_callbackq_dispatch()  /home1/00333/tg456090/src/ucx/src/ucs/datastruct/callbackq.h:211
14 0x000000000002eb92 uct_worker_progress()  /home1/00333/tg456090/src/ucx/src/uct/api/uct.h:2429
15 0x000000000002eb92 ucp_worker_progress()  /home1/00333/tg456090/src/ucx/src/ucp/core/ucp_worker.c:2408
16 0x0000000000840ff3 LrtsAdvanceCommunication()  ???:0

@brminich
Copy link
Collaborator

@trquinn, thanks for the info! I'll debug the issue with 2 nodes
@nitbhat, what is the best way to run charm tests? Currently I just run a couple of tests manually (mega pingpong, zero copy, etc). Is there any automation which runs a bunch of tests on various configurations?

@evan-charmworks
Copy link
Contributor

what is the best way to run charm tests? Currently I just run a couple of tests manually (mega pingpong, zero copy, etc). Is there any automation which runs a bunch of tests on various configurations?

The various .yml files in the repository are examples of how we run our test suite as part of continuous integration. For example:

UCX non-SMP: ./build all-test ucx-linux-x86_64 ompipmix -g3 -j4 --with-production --enable-error-checking && make -C ucx-linux-x86_64-ompipmix/tmp test && make -C ucx-linux-x86_64-ompipmix/tmp testp P=2
UCX SMP: ./build all-test ucx-linux-x86_64 smp ompipmix -g3 -j4 --with-production --enable-error-checking && make -C ucx-linux-x86_64-smp-ompipmix/tmp test TESTOPTS="+setcpuaffinity +CmiSleepOnIdle" && make -C ucx-linux-x86_64-smp-ompipmix/tmp testp P=4 TESTOPTS="+setcpuaffinity +CmiSleepOnIdle ++ppn 2"

@ericjbohm ericjbohm modified the milestones: 7.0, 7.1 Jul 1, 2021
@evan-charmworks
Copy link
Contributor

Has this been fixed?

@trquinn
Copy link
Collaborator

trquinn commented Mar 22, 2022

No. This problem has become more wide-spread since OpenMPI version 4 is now built on top of UCX. A new example is from SDSC Expanse, building charm with
./build ChaNGa mpi-linux-x86_64 smp -j8 --with-production -march=znver2
and running changa with
srun --mpi=pmi2 --cpus-per-task=64 -n 16 ./ChaNGa.smp ++ppn 63 +setcpuaffinity +commap 0,64 +pemap 1-63,65-127 -p 32768 -v 1 +useckloop +balancer MultistepLB_notopo new-cal-IV-f.param
gives:
ib_md.c:325 UCX ERROR ibv_reg_mr(address=0x1554cb0e1260, length=8080, access=0xf) failed: Cannot allocate memory

@nilsdeppe
Copy link
Contributor

We are seeing the same thing with SpECTRE. We've so far only tested on Frontera. Specifically, we also get UCX ERROR ibv_reg_mr(address=0x2b280d07a9f0, length=12816, access=0xf) failed: Cannot allocate memory This seems to take ~1 hour on 4 nodes on Frontera to show up for us.

We're testing OpenMPI 4.1.2 on our own cluster using UCX 1.12.1. I'll report back if really long runs work fine. Frontera uses UCX 1.11.0, for what it's worth.

@rbuch
Copy link
Contributor

rbuch commented Apr 14, 2022

Slightly off topic, but if you all are using the MPI layer while the issues with UCX are being diagnosed, it might be worth trying to use the preposted MPI receives option. You can enable it via compiling with the -DMPI_POST_RECV=1 flag while using the mpi-post-recv-define branch from PR #3596. At runtime, you can specify the size bounds for the preposted receives by using the +postRecvLowerSize <lower size> and +postRecvUpperSize <upper size> flags. In our own internal testing, we've seen that this can significantly improve performance for certain applications and networks.

If you do try this out, we'd appreciate if you'd share some data on how it affected performance for you.

@jszaday
Copy link
Contributor

jszaday commented Apr 20, 2022

I ran some tests with this on Frontera. The good news is that my standard benchmark runs very well. It used to fail at around four nodes, and now scales up to 24 nodes. However, if I run it with only two nodes, I get:

[1606624679.201120] [c136-063:91491:0]          ib_md.c:348  UCX  ERROR ibv_reg_mr(address=0x2b61d404bb40, length=9232, access=0xf) failed: Cannot allocate memory
[1606624679.201161] [c136-063:91491:0]         ucp_mm.c:137  UCX  ERROR failed to register address 0x2b61d404bb40 mem_type bit 0x1 length 9220 on md[5]=mlx5_0: Input/output error (md reg_mem_types 0x1)
[1606624679.201165] [c136-063:91491:0]    ucp_request.c:280  UCX  ERROR failed to register user buffer datatype 0x8 address 0x2b61d404bb40 len 9220: Input/output error
[c136-063:91491:0:91559]        rndv.c:495  Assertion `status == UCS_OK' failed

/home1/00333/tg456090/src/ucx/src/ucp/rndv/rndv.c: [ ucp_rndv_progress_rma_get_zcopy() ]
      ...
      492     if (!rndv_req->send.mdesc) {
      493         status = ucp_send_request_add_reg_lane(rndv_req, lane);
      494         ucs_assert_always(status == UCS_OK);
==>   495     }
      496 
      497     rsc_index = ucp_ep_get_rsc_index(ep, lane);
      498     attrs     = ucp_worker_iface_get_attr(ep->worker, rsc_index);

==== backtrace (tid:  91559) ====
 0 0x0000000000054833 ucs_debug_print_backtrace()  /home1/00333/tg456090/src/ucx/src/ucs/debug/debug.c:656
 1 0x0000000000040da8 ucp_rndv_progress_rma_get_zcopy()  /home1/00333/tg456090/src/ucx/src/ucp/rndv/rndv.c:495
 2 0x000000000004130a ucp_request_try_send()  /home1/00333/tg456090/src/ucx/src/ucp/core/ucp_request.inl:239
 3 0x000000000004130a ucp_request_send()  /home1/00333/tg456090/src/ucx/src/ucp/core/ucp_request.inl:264
 4 0x000000000004130a ucp_rndv_req_send_rma_get()  /home1/00333/tg456090/src/ucx/src/ucp/rndv/rndv.c:713
 5 0x0000000000043615 ucp_rndv_receive()  /home1/00333/tg456090/src/ucx/src/ucp/rndv/rndv.c:1226
 6 0x000000000001cdba ucp_am_recv_data_nbx()  /home1/00333/tg456090/src/ucx/src/ucp/core/ucp_am.c:1060
 7 0x0000000000840566 UcxAmRxDataCb()  machine.C:0
 8 0x000000000001e2a8 ucp_am_rndv_process_rts()  /home1/00333/tg456090/src/ucx/src/ucp/core/ucp_am.c:1430
 9 0x0000000000015a05 uct_iface_invoke_am()  /home1/00333/tg456090/src/ucx/src/uct/base/uct_iface.h:662
10 0x0000000000015a05 uct_mm_iface_process_recv()  /home1/00333/tg456090/src/ucx/src/uct/sm/mm/base/mm_iface.c:233
11 0x0000000000015a05 uct_mm_iface_poll_fifo()  /home1/00333/tg456090/src/ucx/src/uct/sm/mm/base/mm_iface.c:282
12 0x0000000000015a05 uct_mm_iface_progress()  /home1/00333/tg456090/src/ucx/src/uct/sm/mm/base/mm_iface.c:335
13 0x000000000002eb92 ucs_callbackq_dispatch()  /home1/00333/tg456090/src/ucx/src/ucs/datastruct/callbackq.h:211
14 0x000000000002eb92 uct_worker_progress()  /home1/00333/tg456090/src/ucx/src/uct/api/uct.h:2429
15 0x000000000002eb92 ucp_worker_progress()  /home1/00333/tg456090/src/ucx/src/ucp/core/ucp_worker.c:2408
16 0x0000000000840ff3 LrtsAdvanceCommunication()  ???:0

I am testing a rebased version of the UCX AM branch from @brminich, and I was able to reproduce this crash on Bridges2. It occurs shortly after ChaNGa's startup with small node-count runs. I will start testing other configurations to see whether I can make any headway 👍

[jszaday@r179 runs]$ srun --mpi=pmi2 ./ChaNGa.smp ++ppn 63 +setcpuaffinity +commap 0,64 +pemap 1-63,65-127 -p 32768 -v 1 +useckloop +balancer MultistepLB_notopo ./dwf1.6144.param 
Charm++> Running in SMP mode: 2 processes, 63 worker threads (PEs) + 1 comm threads per process, 126 PEs total
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v7.1.0-devel-154-g35b71e259
Isomalloc> Synchronized global address space.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled. 
Charm++> cpuaffinity PE-core map (OS indices): 1-63,65-127
Charm++> Running on 1 hosts (2 sockets x 64 cores x 1 PUs = 128-way SMP)
Charm++> cpu topology info is gathered in 0.008 seconds.
WARNING: +useckloop is a command line argument beginning with a '+' but was not parsed by the RTS.
If any of the above arguments were intended for the RTS you may need to recompile Charm++ with different options.
CharmLB> Load balancing instrumentation for communication is off.
[0] MultistepLB_notopo created
WARNING: bStandard parameter ignored; Output is always standard.
Using CkLoop 1
ChaNGa version 3.4, commit v3.4-167-g29e0a16
Running on 126 processors/ 2 nodes with 32768 TreePieces
yieldPeriod set to 5
Prefetching...ON
Number of chunks for remote tree walk set to 1
Chunk Randomization...ON
cache 1
cacheLineDepth 4
Verbosity level 1
Domain decomposition...SFC Peano-Hilbert
CkLoopLib is used in SMP with simple dynamic scheduling (converse-level notification)
Created 32768 pieces of tree
Loading particles ... trying Tipsy ... took 3.751485 seconds.
N: 51794908
Input file, Time:0.239443 Redshift:0.341010 Expansion factor:0.745706
Simulation to Time:0.239543 Redshift:0.340547 Expansion factor:0.745964
WARNING: Could not open redshift input file: dwf1.6144.bench.red
Initial Domain decomposition ... Sorter: Histograms balanced after 19 iterations.
[1650475974.560995] [r179:59318:0]           ib_md.c:349  UCX  ERROR ibv_reg_mr(address=0x1535f68bc330, length=10368, access=0xf) failed: Cannot allocate memory
[1650475974.561048] [r179:59318:0]          ucp_mm.c:153  UCX  ERROR failed to register address 0x1535f68bc330 mem_type bit 0x1 length 10368 on md[4]=mlx5_0: Input/output error (md reg_mem_types 0x1)
[1650475974.561051] [r179:59318:0]     ucp_request.c:501  UCX  ERROR failed to register user buffer datatype 0x8 address 0x1535f68bc330 len 10368: Input/output error
[r179:59318:0:59439]        rndv.c:454  Assertion `status == UCS_OK' failed
==== backtrace (tid:  59439) ====
 0 0x000000000005b858 ucp_rndv_progress_rma_zcopy_common()  /jet/home/wozniak/ucx/src/ucp/rndv/rndv.c:454
 1 0x000000000005b858 ucp_rndv_progress_rma_zcopy_common()  /jet/home/wozniak/ucx/src/ucp/rndv/rndv.c:513
 2 0x000000000005c255 ucp_request_try_send()  /jet/home/wozniak/ucx/src/ucp/core/ucp_request.inl:327
 3 0x000000000005c255 ucp_request_send()  /jet/home/wozniak/ucx/src/ucp/core/ucp_request.inl:350
 4 0x000000000005c255 ucp_rndv_req_send_rma_get()  /jet/home/wozniak/ucx/src/ucp/rndv/rndv.c:790
 5 0x000000000005dcc1 ucp_rndv_receive()  /jet/home/wozniak/ucx/src/ucp/rndv/rndv.c:1386
 6 0x0000000000022821 ucp_am_recv_data_nbx()  /jet/home/wozniak/ucx/src/ucp/core/ucp_am.c:1116
 7 0x000000000086b620 UcxAmRxDataCb()  machine.C:0
 8 0x00000000000244bd ucp_am_rndv_process_rts()  /jet/home/wozniak/ucx/src/ucp/core/ucp_am.c:1630
 9 0x00000000000244bd ucp_am_rndv_process_rts()  /jet/home/wozniak/ucx/src/ucp/core/ucp_am.c:1632
10 0x0000000000016595 uct_iface_invoke_am()  /jet/home/wozniak/ucx/src/uct/base/uct_iface.h:774
11 0x0000000000016595 uct_mm_iface_process_recv()  /jet/home/wozniak/ucx/src/uct/sm/mm/base/mm_iface.c:251
12 0x0000000000016595 uct_mm_iface_poll_fifo()  /jet/home/wozniak/ucx/src/uct/sm/mm/base/mm_iface.c:299
13 0x0000000000016595 uct_mm_iface_progress()  /jet/home/wozniak/ucx/src/uct/sm/mm/base/mm_iface.c:352
14 0x0000000000039dfa ucs_callbackq_dispatch()  /jet/home/wozniak/ucx/src/ucs/datastruct/callbackq.h:211
15 0x0000000000039dfa uct_worker_progress()  /jet/home/wozniak/ucx/src/uct/api/uct.h:2591
16 0x0000000000039dfa ucp_worker_progress()  /jet/home/wozniak/ucx/src/ucp/core/ucp_worker.c:2574
17 0x000000000086b997 LrtsAdvanceCommunication()  ???:0
18 0x0000000000866bb6 CommunicationServerThread()  ???:0
19 0x0000000000866b15 ConverseRunPE()  machine.C:0
20 0x000000000086c457 call_startfn()  machine.C:0
21 0x00000000000082de start_thread()  pthread_create.c:0
22 0x00000000000fbe83 __GI___clone()  :0
=================================
srun: error: r179: task 0: Aborted (core dumped)

@brminich
Copy link
Collaborator

@jszaday, is it possible to try increasing map count with sudo sysctl -w vm.max_map_count=95530 (needs to be done on all nodes)?

@jszaday
Copy link
Contributor

jszaday commented Apr 22, 2022

vm.max_map_count

Unfortunately since this is running on an externally managed machine, it doesn't seem like I can change that option. Is there anything else I can modulate that doesn't require sudo permissions?

@ericjbohm ericjbohm removed this from the 7.1 milestone Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working UCX The UCX machine layer
Projects
None yet
Development

No branches or pull requests

9 participants