Gandiva: Introspective Cluster Scheduling for Deep Learning #109

gaocegege · 2018-10-12T08:27:53Z

OSDI'18

https://www.usenix.org/system/files/osdi18-xiao.pdf

https://www.usenix.org/sites/default/files/conference/protected-files/osdi18_slides_sivathanu.pdf

也是基于 k8s 实现了原型

MS 的

Jack47 · 2018-10-25T04:12:30Z

I'm reading this paper too, maybe we can share and discuss together.

gaocegege · 2018-10-25T04:13:25Z

@Jack47

Sure, 希望我能在最近写一些阅读的笔记

gaocegege · 2018-10-25T04:13:42Z

另外推荐 #86 ，也是类似的工作

gaocegege · 2018-11-08T09:10:50Z

这篇论文是北邮的博士学长的一作，据说有开源的计划，可以期待一下。

论文是基于这样几个 key insight:

深度学习是反馈驱动的探索，用户经常运行一批训练，取其中结果最好的。这个可以理解为是类似参数搜索，模型结构搜索这样的场景。
在资源使用的异构性，导致很难得到最优解
intra-job predictability，这是全文比较关键的一个概念，如下图所示，GPU 的内存使用存在一定的周期性

论文在不同方面利用了第三个 key insight，针对场景做了一些优化，主要是为机器学习工作负载的调度增加了几个新的原语，包括对 GPU 资源的时间共享，任务的迁移，对 GPU 资源的动态 alloc 一类的，但是很多原语没有说具体是如何做的，比如 Grow-Shrink，和 time-sharing，以及怎么保证资源的隔离性的同时提供高的性能。

gaocegege · 2018-11-27T02:55:52Z

并没有看懂 Grow-Shrink 是通过什么方式实现的

gaocegege · 2018-12-03T06:07:34Z

对自动机器学习场景的介绍略显不足，不太够。

huyutuo · 2019-02-22T01:49:23Z

不太清楚在suspend-resume过程中，如何实现的，在论文中提到需要将GPU内的一些信息转移至CPU，然后再进行挂起，我有个问题，如何将GPU内的信息转移至CPU啊谢谢

huyutuo · 2019-02-22T06:50:35Z

这篇论文是北邮的博士学长的一作，据说有开源的计划，可以期待一下。

论文是基于这样几个 key insight:

深度学习是反馈驱动的探索，用户经常运行一批训练，取其中结果最好的。这个可以理解为是类似参数搜索，模型结构搜索这样的场景。

在资源使用的异构性，导致很难得到最优解

intra-job predictability，这是全文比较关键的一个概念，如下图所示，GPU 的内存使用存在一定的周期性

论文在不同方面利用了第三个 key insight，针对场景做了一些优化，主要是为机器学习工作负载的调度增加了几个新的原语，包括对 GPU 资源的时间共享，任务的迁移，对 GPU 资源的动态 alloc 一类的，但是很多原语没有说具体是如何做的，比如 Grow-Shrink，和 time-sharing，以及怎么保证资源的隔离性的同时提供高的性能。

不太清楚在suspend-resume过程中，如何实现的，在论文中提到需要将GPU内的一些信息转移至CPU，然后再进行挂起，这个里面GPU与CPU的通信是如何完成的，希望能得到您的帮助！谢谢！

gaocegege · 2019-12-30T04:10:31Z

@huyutuo 不好意思现在才看到

因为文章写的不是很细节，我也不太清楚。按照我自己的猜测，应该是实现了把 GPU 显存的对象完全 dump 到内存，但是具体怎么实现不清楚。可能 CUDA 有一些 API 可以用吧

JiangShanCode · 2022-03-15T12:35:37Z

请问现在有开源了吗？找了一遍没发现

gaocegege · 2022-03-16T01:25:59Z

好像没有，如果没记错的话

Jack47 · 2022-03-18T02:10:45Z

这篇论文是北邮的博士学长的一作，据说有开源的计划，可以期待一下。
论文是基于这样几个 key insight:

深度学习是反馈驱动的探索，用户经常运行一批训练，取其中结果最好的。这个可以理解为是类似参数搜索，模型结构搜索这样的场景。

在资源使用的异构性，导致很难得到最优解

intra-job predictability，这是全文比较关键的一个概念，如下图所示，GPU 的内存使用存在一定的周期性

论文在不同方面利用了第三个 key insight，针对场景做了一些优化，主要是为机器学习工作负载的调度增加了几个新的原语，包括对 GPU 资源的时间共享，任务的迁移，对 GPU 资源的动态 alloc 一类的，但是很多原语没有说具体是如何做的，比如 Grow-Shrink，和 time-sharing，以及怎么保证资源的隔离性的同时提供高的性能。

不太清楚在suspend-resume过程中，如何实现的，在论文中提到需要将GPU内的一些信息转移至CPU，然后再进行挂起，这个里面GPU与CPU的通信是如何完成的，希望能得到您的帮助！谢谢！

这个就是平常用的 checkpoint 机制，类似游戏存档。Google 搜 "pytorch checkpoint" 就出来了。https://pytorch.org/docs/stable/checkpoint.html

gaocegege · 2022-03-18T02:15:22Z

应该不是这么简单吧，checkpoint 是完全 dump 到内存或者硬盘了

lambda7xx · 2022-04-01T00:06:06Z

这不是北航和MSRA的合作吗？发了好几年了，我感觉应该不会开源发自我的iPhone------------------ Original ------------------From: Jack Chen ***@***.***>Date: Fri,Mar 18,2022 10:11 AMTo: dyweb/papers-notebook ***@***.***>Cc: Subscribed ***@***.***>Subject: Re: [dyweb/papers-notebook] Gandiva: Introspective Cluster Schedulingfor Deep Learning (#109) 这篇论文是北邮的博士学长的一作，据说有开源的计划，可以期待一下。论文是基于这样几个 key insight: 深度学习是反馈驱动的探索，用户经常运行一批训练，取其中结果最好的。这个可以理解为是类似参数搜索，模型结构搜索这样的场景。在资源使用的异构性，导致很难得到最优解 intra-job predictability，这是全文比较关键的一个概念，如下图所示，GPU 的内存使用存在一定的周期性论文在不同方面利用了第三个 key insight，针对场景做了一些优化，主要是为机器学习工作负载的调度增加了几个新的原语，包括对 GPU 资源的时间共享，任务的迁移，对 GPU 资源的动态 alloc 一类的，但是很多原语没有说具体是如何做的，比如 Grow-Shrink，和 time-sharing，以及怎么保证资源的隔离性的同时提供高的性能。不太清楚在suspend-resume过程中，如何实现的，在论文中提到需要将GPU内的一些信息转移至CPU，然后再进行挂起，这个里面GPU与CPU的通信是如何完成的，希望能得到您的帮助！谢谢！这个就是平常用的 checkpoint 机制，类似游戏存档。Google 搜 "pytorch checkpoint" 就出来了。https://pytorch.org/docs/stable/checkpoint.html —Reply to this email directly, view it on GitHub, or unsubscribe.Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***> [ { ***@***.***": "http://schema.org", ***@***.***": "EmailMessage", "potentialAction": { ***@***.***": "ViewAction", "target": "#109 (comment)", "url": "#109 (comment)", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { ***@***.***": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

gaocegege added area/scheduler TODO-未读 type/paper area/large-scale-ml area/ml star labels Oct 12, 2018

gaocegege removed TODO-未读 star labels Dec 3, 2018

yylin1 added the yylin1/未讀 label Dec 6, 2018

gaocegege mentioned this issue Nov 19, 2019

Characterizing and Synthesizing Task Dependencies of Data-Parallel Jobs in Alibaba Cloud #193

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gandiva: Introspective Cluster Scheduling for Deep Learning #109

Gandiva: Introspective Cluster Scheduling for Deep Learning #109

gaocegege commented Oct 12, 2018 •

edited

Loading

Jack47 commented Oct 25, 2018 •

edited

Loading

gaocegege commented Oct 25, 2018

gaocegege commented Oct 25, 2018

gaocegege commented Nov 8, 2018 •

edited

Loading

gaocegege commented Nov 27, 2018

gaocegege commented Dec 3, 2018

huyutuo commented Feb 22, 2019

huyutuo commented Feb 22, 2019

gaocegege commented Dec 30, 2019

JiangShanCode commented Mar 15, 2022

gaocegege commented Mar 16, 2022

Jack47 commented Mar 18, 2022

gaocegege commented Mar 18, 2022

lambda7xx commented Apr 1, 2022 via email

Gandiva: Introspective Cluster Scheduling for Deep Learning #109

Gandiva: Introspective Cluster Scheduling for Deep Learning #109

Comments

gaocegege commented Oct 12, 2018 • edited Loading

Jack47 commented Oct 25, 2018 • edited Loading

gaocegege commented Oct 25, 2018

gaocegege commented Oct 25, 2018

gaocegege commented Nov 8, 2018 • edited Loading

gaocegege commented Nov 27, 2018

gaocegege commented Dec 3, 2018

huyutuo commented Feb 22, 2019

huyutuo commented Feb 22, 2019

gaocegege commented Dec 30, 2019

JiangShanCode commented Mar 15, 2022

gaocegege commented Mar 16, 2022

Jack47 commented Mar 18, 2022

gaocegege commented Mar 18, 2022

lambda7xx commented Apr 1, 2022 via email

gaocegege commented Oct 12, 2018 •

edited

Loading

Jack47 commented Oct 25, 2018 •

edited

Loading

gaocegege commented Nov 8, 2018 •

edited

Loading