Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Disco] Add MSCCLPP initialization along side NCCL #16842

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

csullivan
Copy link
Contributor

  • Use CCL type traits to share common code between NCCL and MSCCLPP API invocations in disco
  • Add bench to validate results and compare various supported CCL approaches for cuda.

Aggregated profiling results over the sweep of transfer sizes introduced in the above mentioned bench 2**(12 -> 24).

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     33.2        277809887       4048   68628.9   45504.0     15104  25069360     545729.8  ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm *, unsigned long, ncclWork *)
     24.3        203015590       4040   50251.4   48256.0     12992    315934      29470.2  void tensorrt_llm::twoShotAllReduceKernel<__half, (int)8>(tensorrt_llm::AllReduceParams)
     20.9        174549284       4040   43205.3   39440.0      3711   1085846      52387.0  void tensorrt_llm::oneShotAllReduceKernel<__half, (int)8>(tensorrt_llm::AllReduceParams)
     20.7        173275472       4040   42890.0   40112.0      5375    750653      42864.2  void tvm::runtime::allreduce_simple<__half>(mscclpp::SmChannelDeviceHandle *, const T1 *, T1 *, voi…
      0.8          6985121        120   58209.3   55871.5      9695    158239      35854.8  ncclDevKernel_AllGather_RING_LL(ncclDevComm *, unsigned long, ncclWork *)

I noted significant variance between runs, so e2e or use of cuda graph launch for synchronization could help give a clearer picture.

@csullivan csullivan force-pushed the feature/2024-03-28/mscclpp-allreduce-disco branch 2 times, most recently from b033802 to 89c937a Compare April 3, 2024 18:19
* Use CCL type traits to share common code between
NCCL and MSCCLPP API invocations in disco
@csullivan csullivan force-pushed the feature/2024-03-28/mscclpp-allreduce-disco branch from 89c937a to 092fc93 Compare April 3, 2024 18:20
@tqchen
Copy link
Member

tqchen commented Apr 9, 2024

@csullivan please fix the lint error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants