-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ucx use case test and driver script #46
Conversation
// | ||
// UCX will enable all the following src-dst memory location combinations. | ||
// | ||
// DataScatter: ZPush (src local CPU, dst remote GPU). A session calls a ZPush for every GPU dst on every remote node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify
Does worker session define a separate worker performing some part of the job?
And every worker is supposed to work with a single local GPU, but all remote GPUs, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) In our use case, all worker sessions (threads) in the same process will use the same singleton PSWorker for all communication jobs, they will annotate src / dst with the new SArray API proposed. (2) And yes, each worker will work with a single local GPU and all remote GPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Will every server also work with a single local GPU (so workers will push/pull to many serververs, one for each GPU)?
Or it will be a single server instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have only one server per node, working with multiple GPUs. We may move the PSServer and PSWorker in the same node to the same process, so server might not be its own process in the future (in 2 month or so) I hope this won't cause an issue for UCX context handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brminich When I'm running this use pattern test with UCX (simply add
export DMLC_ENABLE_UCX=1
to tests/ucx_multi_node.sh and run it on two machines), I found that one of the machines (the one not colocating with ps-lite scheulder) will fail at the CHECK:
CHECK(addr != w_pool_.end());
in ucx_van.h L:563Could you check if you can reproduce it and propose a fix? Thanks!
Stack trace:
#0 ps::UCXVan::GetRxBuffer (this=0x6eacb0, key=31, size=size@entry=30000000, push=<optimized out>) at ps-lite/src/./ucx_van.h:563 #1 0x0000000000457084 in ps::UCXVan::PostRecvData (this=this@entry=0x6eacb0, meta_req=meta_req@entry=0xa0c9e8) at ps-lite/src/./ucx_van.h:682 #2 0x0000000000457bb2 in ps::UCXVan::PollUCX (this=0x6eacb0) at ps-lite/src/./ucx_van.h:612 #3 0x00007ffff5ce3eb0 in std::execute_native_thread_routine (__p=<optimized out>) at ../../../../../gcc-4.9.4/libstdc++-v3/src/c++11/thread.cc:84 #4 0x00007ffff6ea51d3 in start_thread (arg=0x7fffe5476700) at pthread_create.c:309 #5 0x00007ffff4ddfd6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
It seems like ucx_van is assuming we never pull a key before pushing the same key, this is okay for the allreduce use case, but we implemented some APIs like GatherV in newer version of byteps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, what is the desired behavior?
Should I use memory registered with new RegisterRecvBuffer()
API or allocate the buffer on CPU?
@brminich When I'm running this use pattern test with UCX (simply add Could you check if you can reproduce it and propose a fix? Thanks! Stack trace:
|
@brminich Some updates on this unit test,
And on the remote node (the one not running scheduler) Running |
@jdaaph, what UCX version are you using? Also can you please provide:
|
Thanks for the prompt reply!
Modified test_benchmark.cc is also already included here in the second commit: basically we change the aligned_memory_alloc to also allocate gpu_ptr. |
@Akshay-Venkatesh, can you please advise? |
@jdaaph, you may try to set |
After setting |
@brminich does the calling thread have a device context associated with it? Normally done by calling cudaSetDevice or by calling driver API cuCtxSetCurrent. |
@brminich the test |
Can you please share the details of your configuration? Are you sure GDR is enabled? |
I used the script
How to I verify gdr is actually being used? I checked
There must be something misconfigured in my environment, 8 Gbps is too low even if the data go through main memory. |
@pleasantrabbit, can you please try to run the test with |
@brminich Using
this is the first time I ran the test, the |
For good performance yes. Please try to set |
@pleasantrabbit If you don't mind, can you point to the setup details used for the experiments? Also please refer to gpudirect rdma support here https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#supported-systems for guidance on NIC/GPU configuration. Generally, good performance can be expected when NIC and GPU are accessible through PCIe switch. |
Hi @brminich @Akshay-Venkatesh I got the following error when launching test_benchmark:
My ucx build info:
Any suggestion? |
Just to provide an update, I found there was some issue linking the ucx build as a dynamic library on my side. Now it is fixed. |
* Gather scatter pattern test and fixes This commit builds on top of #46, with an update on tests and scripts. It also makes two minor changes to UCXVan: * ZPull does not require a preceding ZPush * pull request still pass the correct value of val_len in msg.meta * added a thread to ensure the order of push/pull request messages upon reception * It also adds a USE_CUDA=1 compilation option. - fix w_pool_ for pull - fix val len bug - add USE_CUDA=1 compilation option - unset export UCX_TLS=ib,cuda - support 48 bit keys Co-authored-by: Chengyu Dai <chengyu.dai@bytedance.com> Co-authored-by: Yulu Jia <yulu.jia@bytedance.com> Co-authored-by: haibin.lin <haibin.lin@bytedance.com>
superseded by #57 |
Add a use case test for UCX van with example usage pattern, see L251 in
tests/test_test_benchmark_ucx.cc
. Main points:To run the tests from driver scripts, first
$ make test
from root dir and then run$ tests/ucx_multi_node.sh
on node 1 and run$ tests/ucx_multi_node.sh remote
on node 2.