Add ucx use case test and driver script #46

jdaaph · 2020-11-09T01:44:35Z

Add a use case test for UCX van with example usage pattern, see L251 in tests/test_test_benchmark_ucx.cc. Main points:

Push / Pull src and dst can be either GPU or CPU addresses. Will update the test with proposed API (e.g. label SArray's src/dst device id) after API implementation.
More than one concurrent worker sessions in the same worker process.

To run the tests from driver scripts, first $ make test from root dir and then run $ tests/ucx_multi_node.sh on node 1 and run $ tests/ucx_multi_node.sh remote on node 2.

brminich · 2020-11-10T18:05:44Z

tests/test_benchmark_ucx.cc

+  //
+  // UCX will enable all the following src-dst memory location combinations.
+  //
+  // DataScatter: ZPush (src local CPU, dst remote GPU). A session calls a ZPush for every GPU dst on every remote node.


Just to clarify
Does worker session define a separate worker performing some part of the job?
And every worker is supposed to work with a single local GPU, but all remote GPUs, right?

(1) In our use case, all worker sessions (threads) in the same process will use the same singleton PSWorker for all communication jobs, they will annotate src / dst with the new SArray API proposed. (2) And yes, each worker will work with a single local GPU and all remote GPUs.

Thanks.
Will every server also work with a single local GPU (so workers will push/pull to many serververs, one for each GPU)?
Or it will be a single server instance?

We have only one server per node, working with multiple GPUs. We may move the PSServer and PSWorker in the same node to the same process, so server might not be its own process in the future (in 2 month or so) I hope this won't cause an issue for UCX context handling.

@brminich When I'm running this use pattern test with UCX (simply add export DMLC_ENABLE_UCX=1 to tests/ucx_multi_node.sh and run it on two machines), I found that one of the machines (the one not colocating with ps-lite scheulder) will fail at the CHECK:
CHECK(addr != w_pool_.end()); in ucx_van.h L:563

Could you check if you can reproduce it and propose a fix? Thanks!

Stack trace:

#0 ps::UCXVan::GetRxBuffer (this=0x6eacb0, key=31, size=size@entry=30000000, push=<optimized out>) at ps-lite/src/./ucx_van.h:563 #1 0x0000000000457084 in ps::UCXVan::PostRecvData (this=this@entry=0x6eacb0, meta_req=meta_req@entry=0xa0c9e8) at ps-lite/src/./ucx_van.h:682 #2 0x0000000000457bb2 in ps::UCXVan::PollUCX (this=0x6eacb0) at ps-lite/src/./ucx_van.h:612 #3 0x00007ffff5ce3eb0 in std::execute_native_thread_routine (__p=<optimized out>) at ../../../../../gcc-4.9.4/libstdc++-v3/src/c++11/thread.cc:84 #4 0x00007ffff6ea51d3 in start_thread (arg=0x7fffe5476700) at pthread_create.c:309 #5 0x00007ffff4ddfd6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

It seems like ucx_van is assuming we never pull a key before pushing the same key, this is okay for the allreduce use case, but we implemented some APIs like GatherV in newer version of byteps.

ok, what is the desired behavior?
Should I use memory registered with new RegisterRecvBuffer() API or allocate the buffer on CPU?

jdaaph · 2020-12-04T23:35:07Z

@brminich When I'm running this use pattern test with UCX (simply add export DMLC_ENABLE_UCX=1 to tests/ucx_multi_node.sh and run it on two machines), I found that one of the machines (the one not colocating with ps-lite scheulder) will fail at the CHECK:
CHECK(addr != w_pool_.end()); in ucx_van.h L:563

Could you check if you can reproduce it and propose a fix? Thanks!

Stack trace:

#0  ps::UCXVan::GetRxBuffer (this=0x6eacb0, key=31, size=size@entry=30000000, push=<optimized out>) at ps-lite/src/./ucx_van.h:563
#1  0x0000000000457084 in ps::UCXVan::PostRecvData (this=this@entry=0x6eacb0, meta_req=meta_req@entry=0xa0c9e8) at ps-lite/src/./ucx_van.h:682
#2  0x0000000000457bb2 in ps::UCXVan::PollUCX (this=0x6eacb0) at ps-lite/src/./ucx_van.h:612
#3  0x00007ffff5ce3eb0 in std::execute_native_thread_routine (__p=<optimized out>) at ../../../../../gcc-4.9.4/libstdc++-v3/src/c++11/thread.cc:84
#4  0x00007ffff6ea51d3 in start_thread (arg=0x7fffe5476700) at pthread_create.c:309
#5  0x00007ffff4ddfd6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

brminich · 2020-12-05T18:47:26Z

@jdaaph, please use latest master, #49 and #50 should fix the issue

jdaaph · 2020-12-12T06:25:34Z

@brminich Some updates on this unit test,

Add correctness test in tests/test_correctness.cc, however I couldn't get it up and running and debug it (more info later)
To reproduce the issue that occurs when using gpu_ptr to ZPush / ZPull, I simply modified test_benchmark.cc to add gpu_ptr in it. And found with UCX van, using UCX_TLS=tcp will work (even with gpu ptr passed in), but when UCX_TLS=all, I see almost identical error message as in this Issue: PyTorch+OpenMPI+UCX broadcast giving invalid device context openucx/ucx#4707
on main node:

cuda_ipc_md.c:102  UCX  ERROR cuIpcGetMemHandle(&(key->ph), (CUdeviceptr) addr) is failed. ret:invalid device context

And on the remote node (the one not running scheduler)

Running lsmod | grep nv_peer we have normal output, but we have not installed gdrcopy and compile it with ucx as I understand that's for to Host copy. Thanks!

brminich · 2020-12-12T07:34:42Z

@jdaaph, what UCX version are you using? Also can you please provide:

all UCX env variables you used
modified test_benchmark.cc

jdaaph · 2020-12-12T21:46:48Z

@jdaaph, what UCX version are you using? Also can you please provide:

all UCX env variables you used

modified test_benchmark.cc

Thanks for the prompt reply!
UCX version: master branch, commit hash d31041f01
UCX Env variables: already included in the tests/ucx_multi_node.sh (this PR has already been updated),

export DMLC_ENABLE_UCX=1          # test ucx
export UCX_TLS=all                # not working
export UCX_IB_GPU_DIRECT_RDMA=no

Modified test_benchmark.cc is also already included here in the second commit: basically we change the aligned_memory_alloc to also allocate gpu_ptr.
40f2094#diff-12ae770ac5336b36135685644b3e80c7c4ae2e07f1a35ac4d80d3f55b3a4e5c0R32

brminich · 2020-12-14T18:39:40Z

@brminich Some updates on this unit test,

Add correctness test in tests/test_correctness.cc, however I couldn't get it up and running and debug it (more info later)

To reproduce the issue that occurs when using gpu_ptr to ZPush / ZPull, I simply modified test_benchmark.cc to add gpu_ptr in it. And found with UCX van, using UCX_TLS=tcp will work (even with gpu ptr passed in), but when UCX_TLS=all, I see almost identical error message as in this Issue: openucx/ucx#4707
on main node:
cuda_ipc_md.c:102  UCX  ERROR cuIpcGetMemHandle(&(key->ph), (CUdeviceptr) addr) is failed. ret:invalid device context
And on the remote node (the one not running scheduler)

Running lsmod | grep nv_peer we have normal output, but we have not installed gdrcopy and compile it with ucx as I understand that's for to Host copy. Thanks!

@Akshay-Venkatesh, can you please advise?

brminich · 2020-12-14T18:41:03Z

@jdaaph, what UCX version are you using? Also can you please provide:

all UCX env variables you used

modified test_benchmark.cc

Thanks for the prompt reply!
UCX version: master branch, commit hash d31041f01
UCX Env variables: already included in the tests/ucx_multi_node.sh (this PR has already been updated),
export DMLC_ENABLE_UCX=1          # test ucx
export UCX_TLS=all                # not working
export UCX_IB_GPU_DIRECT_RDMA=no
Modified test_benchmark.cc is also already included here in the second commit: basically we change the aligned_memory_alloc to also allocate gpu_ptr.
40f2094#diff-12ae770ac5336b36135685644b3e80c7c4ae2e07f1a35ac4d80d3f55b3a4e5c0R32

@jdaaph, you may try to set UCX_TLS=tcp,ib,cuda_copy,gdr_copy as a workaound

jdaaph · 2020-12-14T19:50:01Z

@jdaaph, what UCX version are you using? Also can you please provide:

all UCX env variables you used

modified test_benchmark.cc

Thanks for the prompt reply!
UCX version: master branch, commit hash d31041f01
UCX Env variables: already included in the tests/ucx_multi_node.sh (this PR has already been updated),
export DMLC_ENABLE_UCX=1          # test ucx
export UCX_TLS=all                # not working
export UCX_IB_GPU_DIRECT_RDMA=no
Modified test_benchmark.cc is also already included here in the second commit: basically we change the aligned_memory_alloc to also allocate gpu_ptr.
40f2094#diff-12ae770ac5336b36135685644b3e80c7c4ae2e07f1a35ac4d80d3f55b3a4e5c0R32
@jdaaph, you may try to set UCX_TLS=tcp,ib,cuda_copy,gdr_copy as a workaound

After setting UCX_TLS=tcp,ib,cuda_copy (as we didn't install gdr_copy in our container) the test-benchmark works, thank you!

Akshay-Venkatesh · 2020-12-14T20:40:49Z

@brminich Some updates on this unit test,

Add correctness test in tests/test_correctness.cc, however I couldn't get it up and running and debug it (more info later)

To reproduce the issue that occurs when using gpu_ptr to ZPush / ZPull, I simply modified test_benchmark.cc to add gpu_ptr in it. And found with UCX van, using UCX_TLS=tcp will work (even with gpu ptr passed in), but when UCX_TLS=all, I see almost identical error message as in this Issue: openucx/ucx#4707
on main node:
cuda_ipc_md.c:102  UCX  ERROR cuIpcGetMemHandle(&(key->ph), (CUdeviceptr) addr) is failed. ret:invalid device context
And on the remote node (the one not running scheduler)

Running lsmod | grep nv_peer we have normal output, but we have not installed gdrcopy and compile it with ucx as I understand that's for to Host copy. Thanks!
@Akshay-Venkatesh, can you please advise?

@brminich does the calling thread have a device context associated with it? Normally done by calling cudaSetDevice or by calling driver API cuCtxSetCurrent.

pleasantrabbit · 2020-12-16T10:00:08Z

@brminich the test tests/test_benchmark.cc in this PR only reports ~ 8 Gbps between two nodes, each with a 100G NIC. Do you have performance numbers of using UCX with GDR that we can compare with?

brminich · 2020-12-16T10:10:50Z

@brminich the test tests/test_benchmark.cc in this PR only reports ~ 8 Gbps between two nodes, each with a 100G NIC. Do you have performance numbers of using UCX with GDR that we can compare with?

Can you please share the details of your configuration? Are you sure GDR is enabled?

@Akshay-Venkatesh, @bureddy

pleasantrabbit · 2020-12-16T10:31:46Z

@brminich

Can you please share the details of your configuration?

I used the script tests/ucx_multi_node.sh in this PR to launch the scheduler and server on one node, one worker on a second node. I changedUCX_TLS to export UCX_TLS=ib,tcp,cuda_ipc,cuda_copy.

Are you sure GDR is enabled?

How to I verify gdr is actually being used? I checked nv_peer_mem, it's loaded. UCX is configured using:

./contrib/configure-release --enable-mt --with-cuda=/path/to/cuda

There must be something misconfigured in my environment, 8 Gbps is too low even if the data go through main memory.

brminich · 2020-12-16T13:21:34Z

@pleasantrabbit, can you please try to run the test with UCX_RNDV_SCHEME=get_zcopy and UCX_RNDV_SCHEME=put_zcopy?
btw, since cuda_ipc is not failing anymore, did you fix the test by setting proper device id?

pleasantrabbit · 2020-12-17T00:59:08Z

@pleasantrabbit, can you please try to run the test with UCX_RNDV_SCHEME=get_zcopy and UCX_RNDV_SCHEME=put_zcopy?

@brminich Using UCX_RNDV_SCHEME=get_zcopy the speed didn't change. With UCX_RNDV_SCHEME=put_zcopy the speed dropped to ~3 Gbps. However, after I changed the GPUs to 2 GPUs attached to the same PCIe switch as the the NIC, and using 4096000 bytes message size, the test reports ~88 Gbps. Is it required that GDR requires the GPU and the NIC to be under the same PCIe switch?

btw, since cuda_ipc is not failing anymore, did you fix the test by setting proper device id?

this is the first time I ran the test, the cuda_ipc error didn't happen to me.

brminich · 2020-12-17T15:55:50Z

Is it required that GDR requires the GPU and the NIC to be under the same PCIe switch

For good performance yes. Please try to set UCX_IB_GPU_DIRECT_RDMA=no and run the test with the GPU residing on the different pci switch

Akshay-Venkatesh · 2020-12-17T16:05:08Z

@pleasantrabbit If you don't mind, can you point to the setup details used for the experiments?

Also please refer to gpudirect rdma support here https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#supported-systems for guidance on NIC/GPU configuration. Generally, good performance can be expected when NIC and GPU are accessible through PCIe switch.

eric-haibin-lin · 2020-12-22T23:59:37Z

Hi @brminich @Akshay-Venkatesh I got the following error when launching test_benchmark:

[[15:49:0715:49:07] ] ps-lite/tests/test_benchmark.ccps-lite/tests/test_benchmark.cc::328328: : number of threads for the same worker = number of threads for the same worker = 11

[[15:49:0715:49:07] ] ps-lite/src/postoffice.ccps-lite/src/postoffice.cc::2222: : enable UCX for networkingenable UCX for networking

[[15:49:0715:49:07] ] ps-lite/include/dmlc/logging.hps-lite/include/dmlc/logging.h::284284: : [15:49:07] ps-lite/src/./ucx_van.h:373: Check failed: (status) == UCS_OK ucp_init failed: Unsupported operation

Stack trace returned 8 entries:
[bt] (0) ./test_benchmark() [0x410b53]
[bt] (1) ./test_benchmark() [0x410edd]
[bt] (2) ./test_benchmark() [0x44e1ad]
[bt] (3) ./test_benchmark() [0x42382e]
[bt] (4) ./test_benchmark() [0x40aae5]
[bt] (5) bundle/libc.so.6(__libc_start_main+0xf0) [0x7f2b204b9010]
[bt] (6) ./test_benchmark() [0x40b7b1]
[bt] (7) [(nil)]

My ucx build info:

#define UCX_MODULE_SUBDIR         "ucx"
#define VERSION                   "1.10"
#define restrict                  __restrict
#define test_MODULES              ":module"
#define ucm_MODULES               ":cuda"
#define uct_MODULES               ":cuda:ib:rdmacm:cma"
#define uct_cuda_MODULES          ":gdrcopy"
#define uct_ib_MODULES            ""
#define uct_rocm_MODULES          ""
#define ucx_perftest_MODULES      ":cuda"

Any suggestion?

eric-haibin-lin · 2021-01-04T19:14:52Z

Hi @brminich @Akshay-Venkatesh I got the following error when launching test_benchmark:

[[15:49:0715:49:07] ] ps-lite/tests/test_benchmark.ccps-lite/tests/test_benchmark.cc::328328: : number of threads for the same worker = number of threads for the same worker = 11

[[15:49:0715:49:07] ] ps-lite/src/postoffice.ccps-lite/src/postoffice.cc::2222: : enable UCX for networkingenable UCX for networking

[[15:49:0715:49:07] ] ps-lite/include/dmlc/logging.hps-lite/include/dmlc/logging.h::284284: : [15:49:07] ps-lite/src/./ucx_van.h:373: Check failed: (status) == UCS_OK ucp_init failed: Unsupported operation

Stack trace returned 8 entries:
[bt] (0) ./test_benchmark() [0x410b53]
[bt] (1) ./test_benchmark() [0x410edd]
[bt] (2) ./test_benchmark() [0x44e1ad]
[bt] (3) ./test_benchmark() [0x42382e]
[bt] (4) ./test_benchmark() [0x40aae5]
[bt] (5) bundle/libc.so.6(__libc_start_main+0xf0) [0x7f2b204b9010]
[bt] (6) ./test_benchmark() [0x40b7b1]
[bt] (7) [(nil)]

My ucx build info:

#define UCX_MODULE_SUBDIR         "ucx"
#define VERSION                   "1.10"
#define restrict                  __restrict
#define test_MODULES              ":module"
#define ucm_MODULES               ":cuda"
#define uct_MODULES               ":cuda:ib:rdmacm:cma"
#define uct_cuda_MODULES          ":gdrcopy"
#define uct_ib_MODULES            ""
#define uct_rocm_MODULES          ""
#define ucx_perftest_MODULES      ":cuda"

Any suggestion?

Just to provide an update, I found there was some issue linking the ucx build as a dynamic library on my side. Now it is fixed.

* Gather scatter pattern test and fixes This commit builds on top of #46, with an update on tests and scripts. It also makes two minor changes to UCXVan: * ZPull does not require a preceding ZPush * pull request still pass the correct value of val_len in msg.meta * added a thread to ensure the order of push/pull request messages upon reception * It also adds a USE_CUDA=1 compilation option. - fix w_pool_ for pull - fix val len bug - add USE_CUDA=1 compilation option - unset export UCX_TLS=ib,cuda - support 48 bit keys Co-authored-by: Chengyu Dai <chengyu.dai@bytedance.com> Co-authored-by: Yulu Jia <yulu.jia@bytedance.com> Co-authored-by: haibin.lin <haibin.lin@bytedance.com>

eric-haibin-lin · 2021-02-16T17:26:09Z

superseded by #57

add ucx use case test and driver script

408427d

eric-haibin-lin mentioned this pull request Nov 9, 2020

UCXVan A100 multi-GPU support #45

Closed

brminich reviewed Nov 10, 2020

View reviewed changes

Add correctness test and use gpu ptr

40f2094

eric-haibin-lin mentioned this pull request Jan 21, 2021

Gather scatter pattern test and fixes #57

Merged

eric-haibin-lin closed this Feb 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ucx use case test and driver script #46

Add ucx use case test and driver script #46

jdaaph commented Nov 9, 2020

brminich Nov 10, 2020

jdaaph Nov 10, 2020

brminich Nov 11, 2020

jdaaph Nov 11, 2020

jdaaph Dec 4, 2020

brminich Dec 5, 2020 •

edited

Loading

jdaaph commented Dec 4, 2020

brminich commented Dec 5, 2020

jdaaph commented Dec 12, 2020

brminich commented Dec 12, 2020

jdaaph commented Dec 12, 2020 •

edited

Loading

brminich commented Dec 14, 2020

brminich commented Dec 14, 2020

jdaaph commented Dec 14, 2020

Akshay-Venkatesh commented Dec 14, 2020

pleasantrabbit commented Dec 16, 2020

brminich commented Dec 16, 2020

pleasantrabbit commented Dec 16, 2020

brminich commented Dec 16, 2020

pleasantrabbit commented Dec 17, 2020

brminich commented Dec 17, 2020

Akshay-Venkatesh commented Dec 17, 2020

eric-haibin-lin commented Dec 22, 2020

eric-haibin-lin commented Jan 4, 2021

eric-haibin-lin commented Feb 16, 2021

Add ucx use case test and driver script #46

Add ucx use case test and driver script #46

Conversation

jdaaph commented Nov 9, 2020

brminich Nov 10, 2020

Choose a reason for hiding this comment

jdaaph Nov 10, 2020

Choose a reason for hiding this comment

brminich Nov 11, 2020

Choose a reason for hiding this comment

jdaaph Nov 11, 2020

Choose a reason for hiding this comment

jdaaph Dec 4, 2020

Choose a reason for hiding this comment

brminich Dec 5, 2020 • edited Loading

Choose a reason for hiding this comment

jdaaph commented Dec 4, 2020

brminich commented Dec 5, 2020

jdaaph commented Dec 12, 2020

brminich commented Dec 12, 2020

jdaaph commented Dec 12, 2020 • edited Loading

brminich commented Dec 14, 2020

brminich commented Dec 14, 2020

jdaaph commented Dec 14, 2020

Akshay-Venkatesh commented Dec 14, 2020

pleasantrabbit commented Dec 16, 2020

brminich commented Dec 16, 2020

pleasantrabbit commented Dec 16, 2020

brminich commented Dec 16, 2020

pleasantrabbit commented Dec 17, 2020

brminich commented Dec 17, 2020

Akshay-Venkatesh commented Dec 17, 2020

eric-haibin-lin commented Dec 22, 2020

eric-haibin-lin commented Jan 4, 2021

eric-haibin-lin commented Feb 16, 2021

brminich Dec 5, 2020 •

edited

Loading

jdaaph commented Dec 12, 2020 •

edited

Loading