[SPARK-14082][MESOS] Enable GPU support with Mesos #14644

tnachen · 2016-08-15T06:41:04Z

What changes were proposed in this pull request?

Enable GPU resources to be used when running coarse grain mode with Mesos.

How was this patch tested?

Manual test with GPU.

srowen · 2016-08-15T07:56:29Z

Does Mesos have a general node labeling mechanism like YARN that one could use to choose machines with a certain property? It seems OK to also support a Mesos-specific property under spark.mesos.* if this is broken out separately in Mesos but just wondering if that provides a simpler and more generic solution.

SparkQA · 2016-08-15T08:43:40Z

Test build #63776 has finished for PR 14644 at commit 4edc6db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-15T08:47:31Z

Test build #63777 has finished for PR 14644 at commit ef59b34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-15T18:20:17Z

Test build #63792 has finished for PR 14644 at commit 2f658b4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

tnachen · 2016-08-15T18:45:13Z

@srowen Mesos also supports node labels as well (which is how constraints is implemented in Spark framework). However GPUs are implemented as a resource (as we want to account for # of GPUs instead of just placing a task there).

As for the config name, I just picked that to begin with. I was also thinking we should consider having a generic config name (spark.gpus?) as I believe it could be reused. But I wasn't sure how we like to account for this yet as GPUs are quite different from CPUs (Mesos currently just do a integer number of GPUs, not sharing or topology information yet). You have suggestons?

srowen · 2016-08-16T10:25:40Z

OK, that makes sense. I think a property under spark.mesos makes sense right now.

keithchambers · 2016-09-05T17:00:41Z

@tnachen does this resolve SPARK-14082?

/cc @mgummelt

klueska · 2016-09-08T08:31:12Z

...main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala

@@ -103,6 +103,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
  private val stateLock = new ReentrantLock

  val extraCoresPerExecutor = conf.getInt("spark.mesos.extra.cores", 0)
+  val maxGpus = conf.getInt("spark.mesos.gpus.max", 0)


I know in a previous iteration of this patch we were using the config variable spark.mesos.gpu.enabled=true instead of spark.mesos.gpu.max=XX. Why did this change? If I nee to set a max, how do I know how big to make the value? Do the spark configuration scripts allow me to run bash commands to programattically count the total number of GPUs on the machine if that's my desired max?

My thoughts was that by only allowing a Boolean flag a spark job either uses all GPUs from a host or not, which it won't be able to have different GPu devices shared by different jobs. By specifying a limit at least there is ability to let a job specify how much GPUs it should grab per node. thoughts?

Yeah, it makes sense. The only additional question I had was if there's a
way to auto-discover the number of GPUs on the machine and do some math on
it inside the config. i.e. does the config allow bash syntax or something
so I can inspect /dev/ and count the number of GPUs installed? I'm
picturing a scenario where I'd like to deploy the same config to a bunch of
hosts, but have the autodiscover this value themselves.

On Thu, Sep 8, 2016 at 3:17 PM Timothy Chen notifications@github.com
wrote:

In
core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
#14644 (comment):

@@ -103,6 +103,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
private val stateLock = new ReentrantLock

val extraCoresPerExecutor = conf.getInt("spark.mesos.extra.cores", 0)

val maxGpus = conf.getInt("spark.mesos.gpus.max", 0)

My thoughts was that by only allowing a Boolean flag a spark job either
uses all GPUs from a host or not, which it won't be able to have different
GPu devices shared by different jobs. By specifying a limit at least there
is ability to let a job specify how much GPUs it should grab per node.
thoughts?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/apache/spark/pull/14644/files/2f658b40251ff75ab4695e4afc55487a6a6eb015#r78002761,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAF4o5c68RBufSO5UF2DpuLtcbkspxClks5qoArjgaJpZM4JkJdn
.

@klueska I think you do not need to autodiscover anything, the concept is similar to cpus max in the scheduler. Thoughts?
@tnachen I think there should be some logic checking current total against the max gpus configured like in the case of cpusMax. I dont see any. I expect offers to be splitted. In that case we need to check the sum of the assigned gpus against the max right?
Am I missing something?

I just looked up what the semantics of "cores.max" are and they seem to be slightly different than what this patch includes:
http://spark.apache.org/docs/latest/running-on-mesos.html#coarse-grained

It says that if "cores.max" is not included, then spark will just blindly accept all cpu offers it is given. In the current patch, no gpu resources will be accepted if "gpus.max" is not set.

Which sounds sensible to me since GPU is not usually required to run your Spark job. And also cores.max is an aggregate max, where gpu.max as the current patch is a per node max. I think I will change this into how cores.max work, but default to 0.

I'm not saying it's not sensible. I'm just trying to figure what I can do to tell it to accept all GPUs in the offer (which is what I want in my setup). Some offers have more than others and it feels weird to just pick a really big number to ensure that I get them all.

I see, in this case it's the same semantics as cpus.max, so I think using a really big number seems right to me.

@tnachen I think it is only a threshold for each offer per node not merely per node. You may get multiple offers for gpus from the same node correct? From what I see this pr does not do any counting of how many gpus have been assigned per node so far.

klueska · 2016-09-08T08:47:22Z

I just tested this against the newest GPU support in Mesos 1.0.1 and everything seems to work as expected. My only question is the use of spark.mesos.gpus.max instead of spark.mesos.gpus.enabled (which we used in a previous iteration of this patch). Other than this, LGTM.

andrewor14 · 2016-09-16T21:01:54Z

Tim, please file a JIRA!

SparkQA · 2016-09-21T03:34:29Z

Test build #65700 has finished for PR 14644 at commit 3031915.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-21T08:33:20Z

Test build #65708 has finished for PR 14644 at commit 5e65a51.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-22T15:57:53Z

Test build #65775 has finished for PR 14644 at commit 6d34632.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tnachen · 2016-09-23T03:49:15Z

@klueska Just updated the patch and I think it's using the right semantics now, where it has a global gpus max just like cores. Can you try it out?

tnachen · 2016-09-29T00:09:15Z

@mgummelt @srowen Please review as well

srowen

The changes themselves look reasonable, though I don't know Mesos. If you and/or @klueska @mgummelt think it's right, OK.

steveloughran · 2016-09-29T13:48:06Z

Any plans to add documentation?
What happens if you ask for (any, more) GPUs than there are?
If it fails, that could be a good test: ask for a very large number and expect scheduling to fail. (that said, I don't know how scheduling will fail here...YARN varies depending on whether you ask for out of range resources vs. unsatisfied ones)

tnachen · 2016-09-29T17:16:17Z

Good catch, my old patch had docs but I rebased and it didn't apply for some reason. Let me add it.

2,3: we don't fail if you ask for more GPUs since it's not a hard requirement but simply a max, just like how CPUs.max work. I didn't add a required amount setting but we can certainly add it in the future.

steveloughran · 2016-09-29T17:26:22Z

...so if you ask for 1 GPU you may only get 0?

tnachen · 2016-09-29T22:11:20Z

Ya, the default GPU requirement I have is 0 (cores per executor/node is 1).
I'm still gathering feedback what's the more sensible thing to do for GPUs. We can either set a configurable amount that each executor has to use, or have a max.

SparkQA · 2016-09-30T06:53:52Z

Test build #66160 has finished for PR 14644 at commit de6d201.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tnachen · 2016-10-09T22:24:23Z

I just tested it with a GPU instance and it works. @mgummelt @klueska any more comments? otherwise @srowen I think we should merge as we no longer have any comments

SparkQA · 2016-10-09T22:42:16Z

Test build #66613 has finished for PR 14644 at commit 98377b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-10-10T21:20:33Z

Merged to master

## What changes were proposed in this pull request? Enable GPU resources to be used when running coarse grain mode with Mesos. ## How was this patch tested? Manual test with GPU. Author: Timothy Chen <tnachen@gmail.com> Closes apache#14644 from tnachen/gpu_mesos.

jomach · 2018-11-12T14:12:46Z

We have some servers running 8 GPUs on mesos. I would like to run Spark on it but I need to be able from spark to allocate a GPU only per map phase. On Hadoop 3.0 you can do spark.yarn.executor.resource.yarn.io/gpu. I have a Spark job that receives a list of files to process, each map on spark should call a c script that reads a chunk of the list and process it on the gpu. For this I need that Spark recognizes the allocated gpu from Mesos like GPU0 is yours and of course mesos needs to mark that gpu as used. with this gpu.max this is not possible

tnachen changed the title ~~Enable GPU support with Mesos~~ [MESOS] Enable GPU support with Mesos Aug 15, 2016

tnachen force-pushed the gpu_mesos branch from 4edc6db to ef59b34 Compare August 15, 2016 06:42

tnachen force-pushed the gpu_mesos branch from ef59b34 to 2f658b4 Compare August 15, 2016 18:13

klueska reviewed Sep 8, 2016
View reviewed changes

tnachen changed the title ~~[MESOS] Enable GPU support with Mesos~~ [SPARK-14082][MESOS] Enable GPU support with Mesos Sep 19, 2016

tnachen force-pushed the gpu_mesos branch from 2f658b4 to 3031915 Compare September 21, 2016 03:28

tnachen force-pushed the gpu_mesos branch from 3031915 to 5e65a51 Compare September 21, 2016 08:08

tnachen force-pushed the gpu_mesos branch from 5e65a51 to 6d34632 Compare September 22, 2016 15:33

srowen approved these changes Sep 29, 2016

View reviewed changes

tnachen force-pushed the gpu_mesos branch from 6d34632 to de6d201 Compare September 30, 2016 06:27

Add GPU support for Mesos

17105c1

Add docs

98377b5

tnachen force-pushed the gpu_mesos branch from de6d201 to 98377b5 Compare October 9, 2016 22:20

asfgit closed this in 29f186b Oct 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14082][MESOS] Enable GPU support with Mesos #14644

[SPARK-14082][MESOS] Enable GPU support with Mesos #14644

tnachen commented Aug 15, 2016

srowen commented Aug 15, 2016

SparkQA commented Aug 15, 2016

SparkQA commented Aug 15, 2016

SparkQA commented Aug 15, 2016

tnachen commented Aug 15, 2016

srowen commented Aug 16, 2016

keithchambers commented Sep 5, 2016

klueska Sep 8, 2016

tnachen Sep 8, 2016

klueska Sep 8, 2016

skonto Sep 10, 2016 •

edited

Loading

klueska Sep 11, 2016

tnachen Sep 11, 2016

klueska Sep 11, 2016

tnachen Sep 12, 2016

skonto Sep 12, 2016 •

edited

Loading

klueska commented Sep 8, 2016

andrewor14 commented Sep 16, 2016

SparkQA commented Sep 21, 2016

SparkQA commented Sep 21, 2016

SparkQA commented Sep 22, 2016

tnachen commented Sep 23, 2016

tnachen commented Sep 29, 2016

srowen left a comment

steveloughran commented Sep 29, 2016

tnachen commented Sep 29, 2016

steveloughran commented Sep 29, 2016

tnachen commented Sep 29, 2016

SparkQA commented Sep 30, 2016

tnachen commented Oct 9, 2016

SparkQA commented Oct 9, 2016

srowen commented Oct 10, 2016

jomach commented Nov 12, 2018

[SPARK-14082][MESOS] Enable GPU support with Mesos #14644

[SPARK-14082][MESOS] Enable GPU support with Mesos #14644

Conversation

tnachen commented Aug 15, 2016

What changes were proposed in this pull request?

How was this patch tested?

srowen commented Aug 15, 2016

SparkQA commented Aug 15, 2016

SparkQA commented Aug 15, 2016

SparkQA commented Aug 15, 2016

tnachen commented Aug 15, 2016

srowen commented Aug 16, 2016

keithchambers commented Sep 5, 2016

klueska Sep 8, 2016

Choose a reason for hiding this comment

tnachen Sep 8, 2016

Choose a reason for hiding this comment

klueska Sep 8, 2016

Choose a reason for hiding this comment

skonto Sep 10, 2016 • edited Loading

Choose a reason for hiding this comment

klueska Sep 11, 2016

Choose a reason for hiding this comment

tnachen Sep 11, 2016

Choose a reason for hiding this comment

klueska Sep 11, 2016

Choose a reason for hiding this comment

tnachen Sep 12, 2016

Choose a reason for hiding this comment

skonto Sep 12, 2016 • edited Loading

Choose a reason for hiding this comment

klueska commented Sep 8, 2016

andrewor14 commented Sep 16, 2016

SparkQA commented Sep 21, 2016

SparkQA commented Sep 21, 2016

SparkQA commented Sep 22, 2016

tnachen commented Sep 23, 2016

tnachen commented Sep 29, 2016

srowen left a comment

Choose a reason for hiding this comment

steveloughran commented Sep 29, 2016

tnachen commented Sep 29, 2016

steveloughran commented Sep 29, 2016

tnachen commented Sep 29, 2016

SparkQA commented Sep 30, 2016

tnachen commented Oct 9, 2016

SparkQA commented Oct 9, 2016

srowen commented Oct 10, 2016

jomach commented Nov 12, 2018

skonto Sep 10, 2016 •

edited

Loading

skonto Sep 12, 2016 •

edited

Loading