[SPARK-27725][EXAMPLES] Add an example discovery Script for GPU resources #24731

tgravescs · 2019-05-28T16:13:19Z

What changes were proposed in this pull request?

Example GPU resource discovery script that can be used with Nvidia GPUs and passed into SPARK via spark.{driver/executor}.resource.gpu.discoveryScript

For example:
./bin/spark-shell --master yarn --deploy-mode client --driver-memory 1g --conf spark.yarn.am.memory=3g --num-executors 1 --executor-memory 1g --conf spark.driver.resource.gpu.count=2 --executor-cores 1 --conf spark.driver.resource.gpu.discoveryScript=/home/tgraves/workspace/tgravescs-spark/examples/src/main/resources/getGpusResources.sh --conf spark.executor.resource.gpu.count=1 --conf spark.task.resource.gpu.count=1 --conf spark.executor.resource.gpu.discoveryScript=/home/tgraves/workspace/tgravescs-spark/examples/src/main/resources/getGpusResources.sh

How was this patch tested?

Manually tested local cluster mode and yarn mode. Tested on a node with 8 GPUs and one with 2 GPUs.

SparkQA · 2019-05-28T16:29:54Z

Test build #105874 has finished for PR 24731 at commit 1e41b2e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

abellina

Ran on a box with 1 gpu and got the expected result. LGTM

felixcheung

shouldn't this be under spark/examples/src/main/ and not resources?

dongjoon-hyun

+1, LGTM. Merged to master. Thank you, @tgravescs and @abellina .

On EC2 g3.8xlarge instance, I tested this PR in the following ways.

Compare the result of the script itself with nvidia-smi result.

$ examples/src/main/resources/getGpusResources.sh
{"name": "gpu", "addresses":["0","1"]}

$ nvidia-smi
Thu May 30 05:30:36 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:00:1D.0 Off |                    0 |
| N/A   30C    P8    23W / 150W |      0MiB /  7618MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   37C    P8    22W / 150W |      0MiB /  7618MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Check the success case with the output of SparkContext ( spark.driver.resource.gpu.count <= the actual resources)

19/05/30 05:23:50 INFO SparkContext: Running Spark version 3.0.0-SNAPSHOT
19/05/30 05:23:51 INFO SparkContext: ===============================================================================
19/05/30 05:23:51 INFO SparkContext: Driver Resources:
19/05/30 05:23:51 INFO SparkContext: gpu -> [name: gpu, addresses: 0,1]
19/05/30 05:23:51 INFO SparkContext: ===============================================================================

Check the failure case with the output of SparkContext ( spark.driver.resource.gpu.count > the actual resources)

19/05/30 05:29:37 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Resource: gpu, with addresses: 0,1 is less than what the user requested: 3)

dongjoon-hyun · 2019-05-30T05:36:07Z

@felixcheung . We can make src/main/sh(scripts?), but it's more natural to consider this as a resource or configuration, isn't it?

tgravescs · 2019-05-30T12:36:01Z

Thanks for the reviews.

Yeah I didn't want to put it into the top level main directory since everything else has a subdirectory there. We already have python/r/scala but those are really spark programming languages and this isn't so that is why I put it under resources.
I could create a scripts directory there, thoughts on that?

tgravescs · 2019-05-30T21:26:15Z

@felixcheung are you ok with the scripts directory? If so I'll file another jira and move it

felixcheung · 2019-05-31T04:00:22Z

script sounds good

tgravescs · 2019-05-31T13:48:22Z

#24754 to move

tgravescs added 2 commits May 28, 2019 10:54

[SPARK-27725] GPU Scheduling - add an example discovery Script

97287c0

remove extra docs

1e41b2e

abellina approved these changes May 29, 2019

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-27725][Core] Add an example discovery Script for GPU resources~~ [SPARK-27725][EXAMPLES] Add an example discovery Script for GPU resources May 30, 2019

felixcheung reviewed May 30, 2019

View reviewed changes

dongjoon-hyun approved these changes May 30, 2019

View reviewed changes

dongjoon-hyun closed this in d0a5aea May 30, 2019

tgravescs mentioned this pull request May 31, 2019

[SPARK-27897] [EXAMPLES] Move the get Gpu resources script to a scripts directory #24754

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27725][EXAMPLES] Add an example discovery Script for GPU resources #24731

[SPARK-27725][EXAMPLES] Add an example discovery Script for GPU resources #24731

tgravescs commented May 28, 2019

SparkQA commented May 28, 2019

abellina left a comment

felixcheung left a comment

dongjoon-hyun left a comment •

edited

dongjoon-hyun commented May 30, 2019 •

edited

tgravescs commented May 30, 2019 •

edited

tgravescs commented May 30, 2019

felixcheung commented May 31, 2019

tgravescs commented May 31, 2019

[SPARK-27725][EXAMPLES] Add an example discovery Script for GPU resources #24731

[SPARK-27725][EXAMPLES] Add an example discovery Script for GPU resources #24731

Conversation

tgravescs commented May 28, 2019

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented May 28, 2019

abellina left a comment

Choose a reason for hiding this comment

felixcheung left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited

Choose a reason for hiding this comment

dongjoon-hyun commented May 30, 2019 • edited

tgravescs commented May 30, 2019 • edited

tgravescs commented May 30, 2019

felixcheung commented May 31, 2019

tgravescs commented May 31, 2019

dongjoon-hyun left a comment •

edited

dongjoon-hyun commented May 30, 2019 •

edited

tgravescs commented May 30, 2019 •

edited