From 0fb3a40d8bc6d9186328348881a0bd0a28895124 Mon Sep 17 00:00:00 2001 From: felixcheung Date: Sun, 27 Dec 2015 19:21:42 -0800 Subject: [PATCH 1/6] update doc --- docs/configuration.md | 18 +++++++++++------- docs/running-on-yarn.md | 7 ++++++- 2 files changed, 17 insertions(+), 8 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index a9ef37a9b1cd9..70b5028b0b5d9 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -120,7 +120,8 @@ of the most common options to set are: spark.driver.cores 1 - Number of cores to use for the driver process, only in cluster mode. + Number of cores to use for the driver process, only in cluster mode. This can be set through + --driver-cores command line option. spark.driver.maxResultSize @@ -151,7 +152,8 @@ of the most common options to set are: spark.executor.memory 1g - Amount of memory to use per executor process (e.g. 2g, 8g). + Amount of memory to use per executor process (e.g. 2g, 8g). This can + be set through the --executor-memory command line option. @@ -173,7 +175,7 @@ of the most common options to set are: stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. - NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or + NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager. @@ -687,10 +689,10 @@ Apart from these, the following properties are also available, and may be useful spark.rdd.compress false - Whether to compress serialized RDD partitions (e.g. for - StorageLevel.MEMORY_ONLY_SER in Java - and Scala or StorageLevel.MEMORY_ONLY in Python). - Can save substantial space at the cost of some extra CPU time. + Whether to compress serialized RDD partitions (e.g. for + StorageLevel.MEMORY_ONLY_SER in Java + and Scala or StorageLevel.MEMORY_ONLY in Python). + Can save substantial space at the cost of some extra CPU time. @@ -850,6 +852,8 @@ Apart from these, the following properties are also available, and may be useful In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker. + + This can be set through --executor-cores in the command line. diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index 06413f83c3a71..3d1035db3329a 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -193,6 +193,7 @@ If you need a reference to the proper location to put log files in the YARN so t (none) Comma separated list of archives to be extracted into the working directory of each executor. + This can be set through --archives command line option. @@ -207,6 +208,7 @@ If you need a reference to the proper location to put log files in the YARN so t 2 The number of executors. Note that this property is incompatible with spark.dynamicAllocation.enabled. If both spark.dynamicAllocation.enabled and spark.executor.instances are specified, dynamic allocation is turned off and the specified number of spark.executor.instances is used. + This can be set through --num-executors command line option. @@ -241,7 +243,8 @@ If you need a reference to the proper location to put log files in the YARN so t spark.yarn.queue default - The name of the YARN queue to which the application is submitted. + The name of the YARN queue to which the application is submitted. This can be set through + --queue command line option. @@ -359,6 +362,7 @@ If you need a reference to the proper location to put log files in the YARN so t The full path to the file that contains the keytab for the principal specified above. This keytab will be copied to the node running the YARN Application Master via the Secure Distributed Cache, for renewing the login tickets and the delegation tokens periodically. (Works also with the "local" master) + This can be set through --keytab command line option. @@ -366,6 +370,7 @@ If you need a reference to the proper location to put log files in the YARN so t (none) Principal to be used to login to KDC, while running on secure HDFS. (Works also with the "local" master) + This can be set through --principal command line option. From 27c6976cb33c8a418635a46255301b027db8615c Mon Sep 17 00:00:00 2001 From: felixcheung Date: Mon, 28 Dec 2015 17:35:43 -0800 Subject: [PATCH 2/6] move command line option to yarn doc --- docs/configuration.md | 8 ++------ docs/running-on-yarn.md | 22 ++++++++++++++++++++++ 2 files changed, 24 insertions(+), 6 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index 70b5028b0b5d9..2b2ac76f4dd45 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -120,8 +120,7 @@ of the most common options to set are: spark.driver.cores 1 - Number of cores to use for the driver process, only in cluster mode. This can be set through - --driver-cores command line option. + Number of cores to use for the driver process, only in cluster mode. spark.driver.maxResultSize @@ -152,8 +151,7 @@ of the most common options to set are: spark.executor.memory 1g - Amount of memory to use per executor process (e.g. 2g, 8g). This can - be set through the --executor-memory command line option. + Amount of memory to use per executor process (e.g. 2g, 8g). @@ -852,8 +850,6 @@ Apart from these, the following properties are also available, and may be useful In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker. - - This can be set through --executor-cores in the command line. diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index 3d1035db3329a..5316631efa27f 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -120,6 +120,7 @@ If you need a reference to the proper location to put log files in the YARN so t Number of cores used by the driver in YARN cluster mode. Since the driver is run in the same JVM as the YARN Application Master in cluster mode, this also controls the cores used by the YARN Application Master. In client mode, use spark.yarn.am.cores to control the number of cores used by the YARN Application Master instead. + This can be set through --driver-cores command line option. @@ -203,6 +204,19 @@ If you need a reference to the proper location to put log files in the YARN so t Comma-separated list of files to be placed in the working directory of each executor. + + spark.executor.cores + 1 in YARN mode, all the available cores on the worker in standalone mode. + + The number of cores to use on each executor. For YARN and standalone mode only. + + In standalone mode, setting this parameter allows an application to run multiple executors on + the same worker, provided that there are enough cores on that worker. Otherwise, only one + executor per application will run on each worker. + + This can be set through --executor-cores in the command line. + + spark.executor.instances 2 @@ -211,6 +225,14 @@ If you need a reference to the proper location to put log files in the YARN so t This can be set through --num-executors command line option. + + spark.executor.memory + 1g + + Amount of memory to use per executor process (e.g. 2g, 8g). This can + be set through the --executor-memory command line option. + + spark.yarn.executor.memoryOverhead executorMemory * 0.10, with minimum of 384 From a934bf9d1f47038808c149443e9f56cedc015e40 Mon Sep 17 00:00:00 2001 From: felixcheung Date: Mon, 18 Jan 2016 20:33:55 -0800 Subject: [PATCH 3/6] moved config properties to job-scheduling --- docs/job-scheduling.md | 5 ++++- docs/running-on-yarn.md | 13 ++----------- 2 files changed, 6 insertions(+), 12 deletions(-) diff --git a/docs/job-scheduling.md b/docs/job-scheduling.md index 36327c6efeaf3..04d0d4c27f507 100644 --- a/docs/job-scheduling.md +++ b/docs/job-scheduling.md @@ -39,7 +39,10 @@ Resource allocation can be configured as follows, based on the cluster type: and optionally set `spark.cores.max` to limit each application's resource share as in the standalone mode. You should also set `spark.executor.memory` to control the executor memory. * **YARN:** The `--num-executors` option to the Spark YARN client controls how many executors it will allocate - on the cluster, while `--executor-memory` and `--executor-cores` control the resources per executor. + on the cluster (`spark.executor.instances` as configuration property), while `--executor-memory` + (`spark.executor.memory` configuration property) and `--executor-cores` (`spark.executor.cores`) configuration + property) control the resources per executor. For more information, see the + [YARN Spark Properties](running-on-yarn.html). A second option available on Mesos is _dynamic sharing_ of CPU cores. In this mode, each Spark application still has a fixed and independent memory allocation (set by `spark.executor.memory`), but when the diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index 5316631efa27f..1dd35b008e338 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -120,7 +120,6 @@ If you need a reference to the proper location to put log files in the YARN so t Number of cores used by the driver in YARN cluster mode. Since the driver is run in the same JVM as the YARN Application Master in cluster mode, this also controls the cores used by the YARN Application Master. In client mode, use spark.yarn.am.cores to control the number of cores used by the YARN Application Master instead. - This can be set through --driver-cores command line option. @@ -194,7 +193,6 @@ If you need a reference to the proper location to put log files in the YARN so t (none) Comma separated list of archives to be extracted into the working directory of each executor. - This can be set through --archives command line option. @@ -213,8 +211,6 @@ If you need a reference to the proper location to put log files in the YARN so t In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker. - - This can be set through --executor-cores in the command line. @@ -222,15 +218,13 @@ If you need a reference to the proper location to put log files in the YARN so t 2 The number of executors. Note that this property is incompatible with spark.dynamicAllocation.enabled. If both spark.dynamicAllocation.enabled and spark.executor.instances are specified, dynamic allocation is turned off and the specified number of spark.executor.instances is used. - This can be set through --num-executors command line option. spark.executor.memory 1g - Amount of memory to use per executor process (e.g. 2g, 8g). This can - be set through the --executor-memory command line option. + Amount of memory to use per executor process (e.g. 2g, 8g). @@ -265,8 +259,7 @@ If you need a reference to the proper location to put log files in the YARN so t spark.yarn.queue default - The name of the YARN queue to which the application is submitted. This can be set through - --queue command line option. + The name of the YARN queue to which the application is submitted. @@ -384,7 +377,6 @@ If you need a reference to the proper location to put log files in the YARN so t The full path to the file that contains the keytab for the principal specified above. This keytab will be copied to the node running the YARN Application Master via the Secure Distributed Cache, for renewing the login tickets and the delegation tokens periodically. (Works also with the "local" master) - This can be set through --keytab command line option. @@ -392,7 +384,6 @@ If you need a reference to the proper location to put log files in the YARN so t (none) Principal to be used to login to KDC, while running on secure HDFS. (Works also with the "local" master) - This can be set through --principal command line option. From 664bc70a2e05d76b1619345a0b51c1198dfca045 Mon Sep 17 00:00:00 2001 From: felixcheung Date: Mon, 18 Jan 2016 20:37:33 -0800 Subject: [PATCH 4/6] fix format --- docs/job-scheduling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/job-scheduling.md b/docs/job-scheduling.md index 04d0d4c27f507..71701475b622b 100644 --- a/docs/job-scheduling.md +++ b/docs/job-scheduling.md @@ -40,7 +40,7 @@ Resource allocation can be configured as follows, based on the cluster type: You should also set `spark.executor.memory` to control the executor memory. * **YARN:** The `--num-executors` option to the Spark YARN client controls how many executors it will allocate on the cluster (`spark.executor.instances` as configuration property), while `--executor-memory` - (`spark.executor.memory` configuration property) and `--executor-cores` (`spark.executor.cores`) configuration + (`spark.executor.memory` configuration property) and `--executor-cores` (`spark.executor.cores` configuration property) control the resources per executor. For more information, see the [YARN Spark Properties](running-on-yarn.html). From 6983d15524d260ae49574b27e47221b902165f30 Mon Sep 17 00:00:00 2001 From: felixcheung Date: Mon, 18 Jan 2016 20:39:07 -0800 Subject: [PATCH 5/6] remove standalone mode text --- docs/running-on-yarn.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index 1dd35b008e338..088f3a68777cc 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -207,10 +207,6 @@ If you need a reference to the proper location to put log files in the YARN so t 1 in YARN mode, all the available cores on the worker in standalone mode. The number of cores to use on each executor. For YARN and standalone mode only. - - In standalone mode, setting this parameter allows an application to run multiple executors on - the same worker, provided that there are enough cores on that worker. Otherwise, only one - executor per application will run on each worker. From 94001ddf7eb8a366091b37efab68249a7eed043d Mon Sep 17 00:00:00 2001 From: felixcheung Date: Tue, 19 Jan 2016 13:53:32 -0800 Subject: [PATCH 6/6] add spark.driver.memory --- docs/running-on-yarn.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index 088f3a68777cc..569e36d9eb2d7 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -113,6 +113,19 @@ If you need a reference to the proper location to put log files in the YARN so t Use lower-case suffixes, e.g. k, m, g, t, and p, for kibi-, mebi-, gibi-, tebi-, and pebibytes, respectively. + + spark.driver.memory + 1g + + Amount of memory to use for the driver process, i.e. where SparkContext is initialized. + (e.g. 1g, 2g). + +
Note: In client mode, this config must not be set through the SparkConf + directly in your application, because the driver JVM has already started at that point. + Instead, please set this through the --driver-memory command line option + or in your default properties file. + + spark.driver.cores 1