Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ucx use case test and driver script #46

Closed
wants to merge 2 commits into from

Conversation

jdaaph
Copy link
Collaborator

@jdaaph jdaaph commented Nov 9, 2020

Add a use case test for UCX van with example usage pattern, see L251 in tests/test_test_benchmark_ucx.cc. Main points:

  1. Push / Pull src and dst can be either GPU or CPU addresses. Will update the test with proposed API (e.g. label SArray's src/dst device id) after API implementation.
  2. More than one concurrent worker sessions in the same worker process.

To run the tests from driver scripts, first $ make test from root dir and then run $ tests/ucx_multi_node.sh on node 1 and run $ tests/ucx_multi_node.sh remote on node 2.

//
// UCX will enable all the following src-dst memory location combinations.
//
// DataScatter: ZPush (src local CPU, dst remote GPU). A session calls a ZPush for every GPU dst on every remote node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify
Does worker session define a separate worker performing some part of the job?
And every worker is supposed to work with a single local GPU, but all remote GPUs, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) In our use case, all worker sessions (threads) in the same process will use the same singleton PSWorker for all communication jobs, they will annotate src / dst with the new SArray API proposed. (2) And yes, each worker will work with a single local GPU and all remote GPUs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
Will every server also work with a single local GPU (so workers will push/pull to many serververs, one for each GPU)?
Or it will be a single server instance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have only one server per node, working with multiple GPUs. We may move the PSServer and PSWorker in the same node to the same process, so server might not be its own process in the future (in 2 month or so) I hope this won't cause an issue for UCX context handling.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich When I'm running this use pattern test with UCX (simply add export DMLC_ENABLE_UCX=1 to tests/ucx_multi_node.sh and run it on two machines), I found that one of the machines (the one not colocating with ps-lite scheulder) will fail at the CHECK:
CHECK(addr != w_pool_.end()); in ucx_van.h L:563

Could you check if you can reproduce it and propose a fix? Thanks!

Stack trace:

#0  ps::UCXVan::GetRxBuffer (this=0x6eacb0, key=31, size=size@entry=30000000, push=<optimized out>) at ps-lite/src/./ucx_van.h:563
#1  0x0000000000457084 in ps::UCXVan::PostRecvData (this=this@entry=0x6eacb0, meta_req=meta_req@entry=0xa0c9e8) at ps-lite/src/./ucx_van.h:682
#2  0x0000000000457bb2 in ps::UCXVan::PollUCX (this=0x6eacb0) at ps-lite/src/./ucx_van.h:612
#3  0x00007ffff5ce3eb0 in std::execute_native_thread_routine (__p=<optimized out>) at ../../../../../gcc-4.9.4/libstdc++-v3/src/c++11/thread.cc:84
#4  0x00007ffff6ea51d3 in start_thread (arg=0x7fffe5476700) at pthread_create.c:309
#5  0x00007ffff4ddfd6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

It seems like ucx_van is assuming we never pull a key before pushing the same key, this is okay for the allreduce use case, but we implemented some APIs like GatherV in newer version of byteps.

Copy link

@brminich brminich Dec 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, what is the desired behavior?
Should I use memory registered with new RegisterRecvBuffer() API or allocate the buffer on CPU?

@jdaaph
Copy link
Collaborator Author

jdaaph commented Dec 4, 2020

@brminich When I'm running this use pattern test with UCX (simply add export DMLC_ENABLE_UCX=1 to tests/ucx_multi_node.sh and run it on two machines), I found that one of the machines (the one not colocating with ps-lite scheulder) will fail at the CHECK:
CHECK(addr != w_pool_.end()); in ucx_van.h L:563

Could you check if you can reproduce it and propose a fix? Thanks!

Stack trace:

#0  ps::UCXVan::GetRxBuffer (this=0x6eacb0, key=31, size=size@entry=30000000, push=<optimized out>) at ps-lite/src/./ucx_van.h:563
#1  0x0000000000457084 in ps::UCXVan::PostRecvData (this=this@entry=0x6eacb0, meta_req=meta_req@entry=0xa0c9e8) at ps-lite/src/./ucx_van.h:682
#2  0x0000000000457bb2 in ps::UCXVan::PollUCX (this=0x6eacb0) at ps-lite/src/./ucx_van.h:612
#3  0x00007ffff5ce3eb0 in std::execute_native_thread_routine (__p=<optimized out>) at ../../../../../gcc-4.9.4/libstdc++-v3/src/c++11/thread.cc:84
#4  0x00007ffff6ea51d3 in start_thread (arg=0x7fffe5476700) at pthread_create.c:309
#5  0x00007ffff4ddfd6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

@brminich
Copy link

brminich commented Dec 5, 2020

@jdaaph, please use latest master, #49 and #50 should fix the issue

@jdaaph
Copy link
Collaborator Author

jdaaph commented Dec 12, 2020

@brminich Some updates on this unit test,

  1. Add correctness test in tests/test_correctness.cc, however I couldn't get it up and running and debug it (more info later)
  2. To reproduce the issue that occurs when using gpu_ptr to ZPush / ZPull, I simply modified test_benchmark.cc to add gpu_ptr in it. And found with UCX van, using UCX_TLS=tcp will work (even with gpu ptr passed in), but when UCX_TLS=all, I see almost identical error message as in this Issue: PyTorch+OpenMPI+UCX broadcast giving invalid device context openucx/ucx#4707
    on main node:
cuda_ipc_md.c:102  UCX  ERROR cuIpcGetMemHandle(&(key->ph), (CUdeviceptr) addr) is failed. ret:invalid device context

And on the remote node (the one not running scheduler)
Snip20201211_1

Running lsmod | grep nv_peer we have normal output, but we have not installed gdrcopy and compile it with ucx as I understand that's for to Host copy. Thanks!

@brminich
Copy link

@jdaaph, what UCX version are you using? Also can you please provide:

  • all UCX env variables you used
  • modified test_benchmark.cc

@jdaaph
Copy link
Collaborator Author

jdaaph commented Dec 12, 2020

@jdaaph, what UCX version are you using? Also can you please provide:

  • all UCX env variables you used
  • modified test_benchmark.cc

Thanks for the prompt reply!
UCX version: master branch, commit hash d31041f01
UCX Env variables: already included in the tests/ucx_multi_node.sh (this PR has already been updated),

export DMLC_ENABLE_UCX=1          # test ucx
export UCX_TLS=all                # not working
export UCX_IB_GPU_DIRECT_RDMA=no

Modified test_benchmark.cc is also already included here in the second commit: basically we change the aligned_memory_alloc to also allocate gpu_ptr.
40f2094#diff-12ae770ac5336b36135685644b3e80c7c4ae2e07f1a35ac4d80d3f55b3a4e5c0R32

@brminich
Copy link

@brminich Some updates on this unit test,

  1. Add correctness test in tests/test_correctness.cc, however I couldn't get it up and running and debug it (more info later)
  2. To reproduce the issue that occurs when using gpu_ptr to ZPush / ZPull, I simply modified test_benchmark.cc to add gpu_ptr in it. And found with UCX van, using UCX_TLS=tcp will work (even with gpu ptr passed in), but when UCX_TLS=all, I see almost identical error message as in this Issue: openucx/ucx#4707
    on main node:
cuda_ipc_md.c:102  UCX  ERROR cuIpcGetMemHandle(&(key->ph), (CUdeviceptr) addr) is failed. ret:invalid device context

And on the remote node (the one not running scheduler)
Snip20201211_1

Running lsmod | grep nv_peer we have normal output, but we have not installed gdrcopy and compile it with ucx as I understand that's for to Host copy. Thanks!

@Akshay-Venkatesh, can you please advise?

@brminich
Copy link

@jdaaph, what UCX version are you using? Also can you please provide:

  • all UCX env variables you used
  • modified test_benchmark.cc

Thanks for the prompt reply!
UCX version: master branch, commit hash d31041f01
UCX Env variables: already included in the tests/ucx_multi_node.sh (this PR has already been updated),

export DMLC_ENABLE_UCX=1          # test ucx
export UCX_TLS=all                # not working
export UCX_IB_GPU_DIRECT_RDMA=no

Modified test_benchmark.cc is also already included here in the second commit: basically we change the aligned_memory_alloc to also allocate gpu_ptr.
40f2094#diff-12ae770ac5336b36135685644b3e80c7c4ae2e07f1a35ac4d80d3f55b3a4e5c0R32

@jdaaph, you may try to set UCX_TLS=tcp,ib,cuda_copy,gdr_copy as a workaound

@jdaaph
Copy link
Collaborator Author

jdaaph commented Dec 14, 2020

@jdaaph, what UCX version are you using? Also can you please provide:

  • all UCX env variables you used
  • modified test_benchmark.cc

Thanks for the prompt reply!
UCX version: master branch, commit hash d31041f01
UCX Env variables: already included in the tests/ucx_multi_node.sh (this PR has already been updated),

export DMLC_ENABLE_UCX=1          # test ucx
export UCX_TLS=all                # not working
export UCX_IB_GPU_DIRECT_RDMA=no

Modified test_benchmark.cc is also already included here in the second commit: basically we change the aligned_memory_alloc to also allocate gpu_ptr.
40f2094#diff-12ae770ac5336b36135685644b3e80c7c4ae2e07f1a35ac4d80d3f55b3a4e5c0R32

@jdaaph, you may try to set UCX_TLS=tcp,ib,cuda_copy,gdr_copy as a workaound

After setting UCX_TLS=tcp,ib,cuda_copy (as we didn't install gdr_copy in our container) the test-benchmark works, thank you!

@Akshay-Venkatesh
Copy link

@brminich Some updates on this unit test,

  1. Add correctness test in tests/test_correctness.cc, however I couldn't get it up and running and debug it (more info later)
  2. To reproduce the issue that occurs when using gpu_ptr to ZPush / ZPull, I simply modified test_benchmark.cc to add gpu_ptr in it. And found with UCX van, using UCX_TLS=tcp will work (even with gpu ptr passed in), but when UCX_TLS=all, I see almost identical error message as in this Issue: openucx/ucx#4707
    on main node:
cuda_ipc_md.c:102  UCX  ERROR cuIpcGetMemHandle(&(key->ph), (CUdeviceptr) addr) is failed. ret:invalid device context

And on the remote node (the one not running scheduler)
Snip20201211_1
Running lsmod | grep nv_peer we have normal output, but we have not installed gdrcopy and compile it with ucx as I understand that's for to Host copy. Thanks!

@Akshay-Venkatesh, can you please advise?

@brminich does the calling thread have a device context associated with it? Normally done by calling cudaSetDevice or by calling driver API cuCtxSetCurrent.

@pleasantrabbit
Copy link
Collaborator

@brminich the test tests/test_benchmark.cc in this PR only reports ~ 8 Gbps between two nodes, each with a 100G NIC. Do you have performance numbers of using UCX with GDR that we can compare with?

@brminich
Copy link

@brminich the test tests/test_benchmark.cc in this PR only reports ~ 8 Gbps between two nodes, each with a 100G NIC. Do you have performance numbers of using UCX with GDR that we can compare with?

Can you please share the details of your configuration? Are you sure GDR is enabled?

@Akshay-Venkatesh, @bureddy

@pleasantrabbit
Copy link
Collaborator

@brminich

Can you please share the details of your configuration?

I used the script tests/ucx_multi_node.sh in this PR to launch the scheduler and server on one node, one worker on a second node. I changedUCX_TLS to export UCX_TLS=ib,tcp,cuda_ipc,cuda_copy.

Are you sure GDR is enabled?

How to I verify gdr is actually being used? I checked nv_peer_mem, it's loaded. UCX is configured using:

./contrib/configure-release --enable-mt --with-cuda=/path/to/cuda

There must be something misconfigured in my environment, 8 Gbps is too low even if the data go through main memory.

@brminich
Copy link

@pleasantrabbit, can you please try to run the test with UCX_RNDV_SCHEME=get_zcopy and UCX_RNDV_SCHEME=put_zcopy?
btw, since cuda_ipc is not failing anymore, did you fix the test by setting proper device id?

@pleasantrabbit
Copy link
Collaborator

@pleasantrabbit, can you please try to run the test with UCX_RNDV_SCHEME=get_zcopy and UCX_RNDV_SCHEME=put_zcopy?

@brminich Using UCX_RNDV_SCHEME=get_zcopy the speed didn't change. With UCX_RNDV_SCHEME=put_zcopy the speed dropped to ~3 Gbps. However, after I changed the GPUs to 2 GPUs attached to the same PCIe switch as the the NIC, and using 4096000 bytes message size, the test reports ~88 Gbps. Is it required that GDR requires the GPU and the NIC to be under the same PCIe switch?

btw, since cuda_ipc is not failing anymore, did you fix the test by setting proper device id?

this is the first time I ran the test, the cuda_ipc error didn't happen to me.

@brminich
Copy link

Is it required that GDR requires the GPU and the NIC to be under the same PCIe switch

For good performance yes. Please try to set UCX_IB_GPU_DIRECT_RDMA=no and run the test with the GPU residing on the different pci switch

@Akshay-Venkatesh
Copy link

@pleasantrabbit If you don't mind, can you point to the setup details used for the experiments?

Also please refer to gpudirect rdma support here https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#supported-systems for guidance on NIC/GPU configuration. Generally, good performance can be expected when NIC and GPU are accessible through PCIe switch.

@eric-haibin-lin
Copy link
Collaborator

Hi @brminich @Akshay-Venkatesh I got the following error when launching test_benchmark:

[[15:49:0715:49:07] ] ps-lite/tests/test_benchmark.ccps-lite/tests/test_benchmark.cc::328328: : number of threads for the same worker = number of threads for the same worker = 11

[[15:49:0715:49:07] ] ps-lite/src/postoffice.ccps-lite/src/postoffice.cc::2222: : enable UCX for networkingenable UCX for networking

[[15:49:0715:49:07] ] ps-lite/include/dmlc/logging.hps-lite/include/dmlc/logging.h::284284: : [15:49:07] ps-lite/src/./ucx_van.h:373: Check failed: (status) == UCS_OK ucp_init failed: Unsupported operation

Stack trace returned 8 entries:
[bt] (0) ./test_benchmark() [0x410b53]
[bt] (1) ./test_benchmark() [0x410edd]
[bt] (2) ./test_benchmark() [0x44e1ad]
[bt] (3) ./test_benchmark() [0x42382e]
[bt] (4) ./test_benchmark() [0x40aae5]
[bt] (5) bundle/libc.so.6(__libc_start_main+0xf0) [0x7f2b204b9010]
[bt] (6) ./test_benchmark() [0x40b7b1]
[bt] (7) [(nil)]

My ucx build info:

#define UCX_MODULE_SUBDIR         "ucx"
#define VERSION                   "1.10"
#define restrict                  __restrict
#define test_MODULES              ":module"
#define ucm_MODULES               ":cuda"
#define uct_MODULES               ":cuda:ib:rdmacm:cma"
#define uct_cuda_MODULES          ":gdrcopy"
#define uct_ib_MODULES            ""
#define uct_rocm_MODULES          ""
#define ucx_perftest_MODULES      ":cuda"

Any suggestion?

@eric-haibin-lin
Copy link
Collaborator

Hi @brminich @Akshay-Venkatesh I got the following error when launching test_benchmark:

[[15:49:0715:49:07] ] ps-lite/tests/test_benchmark.ccps-lite/tests/test_benchmark.cc::328328: : number of threads for the same worker = number of threads for the same worker = 11

[[15:49:0715:49:07] ] ps-lite/src/postoffice.ccps-lite/src/postoffice.cc::2222: : enable UCX for networkingenable UCX for networking

[[15:49:0715:49:07] ] ps-lite/include/dmlc/logging.hps-lite/include/dmlc/logging.h::284284: : [15:49:07] ps-lite/src/./ucx_van.h:373: Check failed: (status) == UCS_OK ucp_init failed: Unsupported operation

Stack trace returned 8 entries:
[bt] (0) ./test_benchmark() [0x410b53]
[bt] (1) ./test_benchmark() [0x410edd]
[bt] (2) ./test_benchmark() [0x44e1ad]
[bt] (3) ./test_benchmark() [0x42382e]
[bt] (4) ./test_benchmark() [0x40aae5]
[bt] (5) bundle/libc.so.6(__libc_start_main+0xf0) [0x7f2b204b9010]
[bt] (6) ./test_benchmark() [0x40b7b1]
[bt] (7) [(nil)]

My ucx build info:

#define UCX_MODULE_SUBDIR         "ucx"
#define VERSION                   "1.10"
#define restrict                  __restrict
#define test_MODULES              ":module"
#define ucm_MODULES               ":cuda"
#define uct_MODULES               ":cuda:ib:rdmacm:cma"
#define uct_cuda_MODULES          ":gdrcopy"
#define uct_ib_MODULES            ""
#define uct_rocm_MODULES          ""
#define ucx_perftest_MODULES      ":cuda"

Any suggestion?

Just to provide an update, I found there was some issue linking the ucx build as a dynamic library on my side. Now it is fixed.

pleasantrabbit added a commit that referenced this pull request Jan 25, 2021
* Gather scatter pattern test and fixes

This commit builds on top of #46, with an update on tests and scripts. It also makes two minor changes to UCXVan:

* ZPull does not require a preceding ZPush
* pull request still pass the correct value of val_len in msg.meta
* added a thread to ensure the order of push/pull request messages upon reception
* It also adds a USE_CUDA=1 compilation option.

- fix w_pool_ for pull
- fix val len bug
- add USE_CUDA=1 compilation option
- unset export UCX_TLS=ib,cuda
- support 48 bit keys

Co-authored-by: Chengyu Dai <chengyu.dai@bytedance.com>
Co-authored-by: Yulu Jia <yulu.jia@bytedance.com>
Co-authored-by: haibin.lin <haibin.lin@bytedance.com>
@eric-haibin-lin
Copy link
Collaborator

superseded by #57

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants