New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27725][EXAMPLES] Add an example discovery Script for GPU resources #24731
Conversation
Test build #105874 has finished for PR 24731 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran on a box with 1 gpu and got the expected result. LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be under spark/examples/src/main/
and not resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master. Thank you, @tgravescs and @abellina .
On EC2 g3.8xlarge
instance, I tested this PR in the following ways.
- Compare the result of the script itself with
nvidia-smi
result.
$ examples/src/main/resources/getGpusResources.sh
{"name": "gpu", "addresses":["0","1"]}
$ nvidia-smi
Thu May 30 05:30:36 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04 Driver Version: 418.40.04 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 00000000:00:1D.0 Off | 0 |
| N/A 30C P8 23W / 150W | 0MiB / 7618MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 On | 00000000:00:1E.0 Off | 0 |
| N/A 37C P8 22W / 150W | 0MiB / 7618MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
- Check the success case with the output of
SparkContext
( spark.driver.resource.gpu.count <= the actual resources)
19/05/30 05:23:50 INFO SparkContext: Running Spark version 3.0.0-SNAPSHOT
19/05/30 05:23:51 INFO SparkContext: ===============================================================================
19/05/30 05:23:51 INFO SparkContext: Driver Resources:
19/05/30 05:23:51 INFO SparkContext: gpu -> [name: gpu, addresses: 0,1]
19/05/30 05:23:51 INFO SparkContext: ===============================================================================
- Check the failure case with the output of
SparkContext
( spark.driver.resource.gpu.count > the actual resources)
19/05/30 05:29:37 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Resource: gpu, with addresses: 0,1 is less than what the user requested: 3)
@felixcheung . We can make |
Thanks for the reviews. Yeah I didn't want to put it into the top level main directory since everything else has a subdirectory there. We already have python/r/scala but those are really spark programming languages and this isn't so that is why I put it under resources. |
@felixcheung are you ok with the scripts directory? If so I'll file another jira and move it |
script sounds good |
#24754 to move |
What changes were proposed in this pull request?
Example GPU resource discovery script that can be used with Nvidia GPUs and passed into SPARK via spark.{driver/executor}.resource.gpu.discoveryScript
For example:
./bin/spark-shell --master yarn --deploy-mode client --driver-memory 1g --conf spark.yarn.am.memory=3g --num-executors 1 --executor-memory 1g --conf spark.driver.resource.gpu.count=2 --executor-cores 1 --conf spark.driver.resource.gpu.discoveryScript=/home/tgraves/workspace/tgravescs-spark/examples/src/main/resources/getGpusResources.sh --conf spark.executor.resource.gpu.count=1 --conf spark.task.resource.gpu.count=1 --conf spark.executor.resource.gpu.discoveryScript=/home/tgraves/workspace/tgravescs-spark/examples/src/main/resources/getGpusResources.sh
How was this patch tested?
Manually tested local cluster mode and yarn mode. Tested on a node with 8 GPUs and one with 2 GPUs.