Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads #125

Open
gaocegege opened this issue Jan 27, 2019 · 6 comments

Comments

@gaocegege
Copy link
Member

https://arxiv.org/pdf/1901.05758.pdf

@gaocegege
Copy link
Member Author

来自微软的工作,这篇文章是研究了以下三个问题对 DNN 训练的工作负载的调度的影响:

  • Gang scheduling 与 locality constraints on queueing 的影响(相关工作:kube-batch 等)
  • locality 对 GPU 利用率的影响
  • 训练时候的 failure

作者根据这些提出了一些设计的 guidelines,来指导下一代为 DNN 训练设计的调度器。

@gaocegege
Copy link
Member Author

作者根据自己的经验,提出了三个值得注意的点,这种点我觉得我上我也行

  • locality 很关键
  • 在同一个机器上分享 GPU 的不同任务可能会相互干扰
  • 许多错误应该被早点捕捉出来,比如通过 profiling 等方式

We plan to release traces used for our study and hope that insights and data from our study inform the burgeoning work of scheduling research for machine learning workloads. (求你快一点)

@gaocegege
Copy link
Member Author

本文针对的工作负载是用 TF,PyTorch,Caffe,MXNet 等框架进行的 LSTM,CNN 等模型训练。在分布式中,采取的数据并行。AllReduce 和参数服务器的更新方式都是支持的。

default

本文的调度是基于 Yarn 的,跟其他的调度器的比较如图所示:

default

@gaocegege
Copy link
Member Author

剩下的内容就是通过实验来验证上面说的三点,以及提出一些 guidelines,这里就不说了,具体见论文

@at15
Copy link
Member

at15 commented Jan 27, 2019

@gaocegege 你行你上呀

@gaocegege
Copy link
Member Author

我这不是弃研究从工业界了么

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants