Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27725][EXAMPLES] Add an example discovery Script for GPU resources #24731

Closed
wants to merge 2 commits into from

Conversation

tgravescs
Copy link
Contributor

What changes were proposed in this pull request?

Example GPU resource discovery script that can be used with Nvidia GPUs and passed into SPARK via spark.{driver/executor}.resource.gpu.discoveryScript

For example:
./bin/spark-shell --master yarn --deploy-mode client --driver-memory 1g --conf spark.yarn.am.memory=3g --num-executors 1 --executor-memory 1g --conf spark.driver.resource.gpu.count=2 --executor-cores 1 --conf spark.driver.resource.gpu.discoveryScript=/home/tgraves/workspace/tgravescs-spark/examples/src/main/resources/getGpusResources.sh --conf spark.executor.resource.gpu.count=1 --conf spark.task.resource.gpu.count=1 --conf spark.executor.resource.gpu.discoveryScript=/home/tgraves/workspace/tgravescs-spark/examples/src/main/resources/getGpusResources.sh

How was this patch tested?

Manually tested local cluster mode and yarn mode. Tested on a node with 8 GPUs and one with 2 GPUs.

@SparkQA
Copy link

SparkQA commented May 28, 2019

Test build #105874 has finished for PR 24731 at commit 1e41b2e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran on a box with 1 gpu and got the expected result. LGTM

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27725][Core] Add an example discovery Script for GPU resources [SPARK-27725][EXAMPLES] Add an example discovery Script for GPU resources May 30, 2019
Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be under spark/examples/src/main/ and not resources?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master. Thank you, @tgravescs and @abellina .

On EC2 g3.8xlarge instance, I tested this PR in the following ways.

  1. Compare the result of the script itself with nvidia-smi result.
$ examples/src/main/resources/getGpusResources.sh
{"name": "gpu", "addresses":["0","1"]}

$ nvidia-smi
Thu May 30 05:30:36 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:00:1D.0 Off |                    0 |
| N/A   30C    P8    23W / 150W |      0MiB /  7618MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   37C    P8    22W / 150W |      0MiB /  7618MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
  1. Check the success case with the output of SparkContext ( spark.driver.resource.gpu.count <= the actual resources)
19/05/30 05:23:50 INFO SparkContext: Running Spark version 3.0.0-SNAPSHOT
19/05/30 05:23:51 INFO SparkContext: ===============================================================================
19/05/30 05:23:51 INFO SparkContext: Driver Resources:
19/05/30 05:23:51 INFO SparkContext: gpu -> [name: gpu, addresses: 0,1]
19/05/30 05:23:51 INFO SparkContext: ===============================================================================
  1. Check the failure case with the output of SparkContext ( spark.driver.resource.gpu.count > the actual resources)
19/05/30 05:29:37 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Resource: gpu, with addresses: 0,1 is less than what the user requested: 3)

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 30, 2019

@felixcheung . We can make src/main/sh(scripts?), but it's more natural to consider this as a resource or configuration, isn't it?

@tgravescs
Copy link
Contributor Author

tgravescs commented May 30, 2019

Thanks for the reviews.

Yeah I didn't want to put it into the top level main directory since everything else has a subdirectory there. We already have python/r/scala but those are really spark programming languages and this isn't so that is why I put it under resources.
I could create a scripts directory there, thoughts on that?

@tgravescs
Copy link
Contributor Author

@felixcheung are you ok with the scripts directory? If so I'll file another jira and move it

@felixcheung
Copy link
Member

script sounds good

@tgravescs
Copy link
Contributor Author

#24754 to move

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants