diff --git a/docs/configuration.md b/docs/configuration.md index a7a1477b35628..a8fddbc084568 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -143,6 +143,7 @@ of the most common options to set are: The name of your application. This will appear in the UI and in log data. + 0.9.0 spark.driver.cores @@ -206,6 +207,7 @@ of the most common options to set are: spark.driver.resource.{resourceName}.discoveryScript for the driver to find the resource on startup. + 3.0.0 spark.driver.resource.{resourceName}.discoveryScript @@ -216,6 +218,7 @@ of the most common options to set are: name and an array of addresses. For a client-submitted driver, discovery script must assign different resource addresses to this driver comparing to other drivers on the same host. + 3.0.0 spark.driver.resource.{resourceName}.vendor @@ -226,6 +229,7 @@ of the most common options to set are: the Kubernetes device plugin naming convention. (e.g. For GPUs on Kubernetes this config would be set to nvidia.com or amd.com) + 3.0.0 spark.resources.discoveryPlugin @@ -293,6 +297,7 @@ of the most common options to set are: spark.executor.resource.{resourceName}.discoveryScript for the executor to find the resource on startup. + 3.0.0 spark.executor.resource.{resourceName}.discoveryScript @@ -302,6 +307,7 @@ of the most common options to set are: write to STDOUT a JSON string in the format of the ResourceInformation class. This has a name and an array of addresses. + 3.0.0 spark.executor.resource.{resourceName}.vendor @@ -312,6 +318,7 @@ of the most common options to set are: the Kubernetes device plugin naming convention. (e.g. For GPUs on Kubernetes this config would be set to nvidia.com or amd.com) + 3.0.0 spark.extraListeners @@ -337,6 +344,7 @@ of the most common options to set are: Note: This will be overridden by SPARK_LOCAL_DIRS (Standalone), MESOS_SANDBOX (Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager. + 0.5.0 spark.logConf @@ -344,6 +352,7 @@ of the most common options to set are: Logs the effective SparkConf as INFO when a SparkContext is started. + 0.9.0 spark.master @@ -352,6 +361,7 @@ of the most common options to set are: The cluster manager to connect to. See the list of allowed master URL's. + 0.9.0 spark.submit.deployMode @@ -467,6 +477,7 @@ Apart from these, the following properties are also available, and may be useful Instead, please set this through the --driver-java-options command line option or in your default properties file. + 3.0.0 spark.driver.extraJavaOptions @@ -540,6 +551,7 @@ Apart from these, the following properties are also available, and may be useful verbose gc logging to a file named for the executor ID of the app in /tmp, pass a 'value' of: -verbose:gc -Xloggc:/tmp/{{APP_ID}}-{{EXECUTOR_ID}}.gc + 3.0.0 spark.executor.extraJavaOptions @@ -636,6 +648,7 @@ Apart from these, the following properties are also available, and may be useful Add the environment variable specified by EnvironmentVariableName to the Executor process. The user can specify multiple of these to set multiple environment variables. + 0.9.0 spark.redaction.regex @@ -659,7 +672,7 @@ Apart from these, the following properties are also available, and may be useful By default the pyspark.profiler.BasicProfiler will be used, but this can be overridden by passing a profiler class in as a parameter to the SparkContext constructor. - + 1.2.0 spark.python.profile.dump @@ -670,6 +683,7 @@ Apart from these, the following properties are also available, and may be useful by pstats.Stats(). If this is specified, the profile result will not be displayed automatically. + 1.2.0 spark.python.worker.memory @@ -680,6 +694,7 @@ Apart from these, the following properties are also available, and may be useful (e.g. 512m, 2g). If the memory used during aggregation goes above this amount, it will spill the data into disks. + 1.1.0 spark.python.worker.reuse @@ -727,6 +742,7 @@ Apart from these, the following properties are also available, and may be useful repositories given by the command-line option --repositories. For more details, see Advanced Dependency Management. + 1.5.0 spark.jars.excludes @@ -735,6 +751,7 @@ Apart from these, the following properties are also available, and may be useful Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in spark.jars.packages to avoid dependency conflicts. + 1.5.0 spark.jars.ivy @@ -744,6 +761,7 @@ Apart from these, the following properties are also available, and may be useful spark.jars.packages. This will override the Ivy property ivy.default.ivy.user.dir which defaults to ~/.ivy2. + 1.3.0 spark.jars.ivySettings @@ -756,6 +774,7 @@ Apart from these, the following properties are also available, and may be useful artifact server like Artifactory. Details on the settings file format can be found at Settings Files + 2.2.0 spark.jars.repositories @@ -764,6 +783,7 @@ Apart from these, the following properties are also available, and may be useful Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages or spark.jars.packages. + 2.3.0 spark.pyspark.driver.python @@ -849,6 +869,7 @@ Apart from these, the following properties are also available, and may be useful set to a non-zero value. This retry logic helps stabilize large shuffles in the face of long GC pauses or transient network connectivity issues. + 1.2.0 spark.shuffle.io.numConnectionsPerPeer @@ -858,6 +879,7 @@ Apart from these, the following properties are also available, and may be useful large clusters. For clusters with many hard disks and few hosts, this may result in insufficient concurrency to saturate all disks, and so users may consider increasing this value. + 1.2.1 spark.shuffle.io.preferDirectBufs @@ -867,6 +889,7 @@ Apart from these, the following properties are also available, and may be useful block transfer. For environments where off-heap memory is tightly limited, users may wish to turn this off to force all allocations from Netty to be on-heap. + 1.2.0 spark.shuffle.io.retryWait @@ -875,6 +898,7 @@ Apart from these, the following properties are also available, and may be useful (Netty only) How long to wait between retries of fetches. The maximum delay caused by retrying is 15 seconds by default, calculated as maxRetries * retryWait. + 1.2.1 spark.shuffle.io.backLog @@ -887,6 +911,7 @@ Apart from these, the following properties are also available, and may be useful application (see spark.shuffle.service.enabled option below). If set below 1, will fallback to OS default defined by Netty's io.netty.util.NetUtil#SOMAXCONN. + 1.1.1 spark.shuffle.service.enabled @@ -915,6 +940,7 @@ Apart from these, the following properties are also available, and may be useful Cache entries limited to the specified memory footprint, in bytes unless otherwise specified. + 2.3.0 spark.shuffle.maxChunksBeingTransferred @@ -926,6 +952,7 @@ Apart from these, the following properties are also available, and may be useful spark.shuffle.io.retryWait), if those limits are reached the task will fail with fetch failure. + 2.3.0 spark.shuffle.sort.bypassMergeThreshold @@ -1233,6 +1260,7 @@ Apart from these, the following properties are also available, and may be useful How many finished executions the Spark UI and status APIs remember before garbage collecting. + 1.5.0 spark.streaming.ui.retainedBatches @@ -1240,6 +1268,7 @@ Apart from these, the following properties are also available, and may be useful How many finished batches the Spark UI and status APIs remember before garbage collecting. + 1.0.0 spark.ui.retainedDeadExecutors @@ -1633,6 +1662,7 @@ Apart from these, the following properties are also available, and may be useful Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user. + 0.5.0 spark.executor.heartbeatInterval @@ -1652,6 +1682,7 @@ Apart from these, the following properties are also available, and may be useful Communication timeout to use when fetching files added through SparkContext.addFile() from the driver. + 1.0.0 spark.files.useFetchCache @@ -1664,6 +1695,7 @@ Apart from these, the following properties are also available, and may be useful disabled in order to use Spark local directories that reside on NFS filesystems (see SPARK-6313 for more details). + 1.2.2 spark.files.overwrite @@ -1672,6 +1704,7 @@ Apart from these, the following properties are also available, and may be useful Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not match those of the source. + 1.0.0 spark.files.maxPartitionBytes @@ -1692,23 +1725,29 @@ Apart from these, the following properties are also available, and may be useful 2.1.0 - spark.hadoop.cloneConf - false - If set to true, clones a new Hadoop Configuration object for each task. This + spark.hadoop.cloneConf + false + + If set to true, clones a new Hadoop Configuration object for each task. This option should be enabled to work around Configuration thread-safety issues (see SPARK-2546 for more details). This is disabled by default in order to avoid unexpected performance regressions for jobs that - are not affected by these issues. + are not affected by these issues. + + 1.0.3 - spark.hadoop.validateOutputSpecs - true - If set to true, validates the output specification (e.g. checking if the output directory already exists) + spark.hadoop.validateOutputSpecs + true + + If set to true, validates the output specification (e.g. checking if the output directory already exists) used in saveAsHadoopFile and other variants. This can be disabled to silence exceptions due to pre-existing - output directories. We recommend that users do not disable this except if trying to achieve compatibility with - previous versions of Spark. Simply use Hadoop's FileSystem API to delete output directories by hand. - This setting is ignored for jobs generated through Spark Streaming's StreamingContext, since - data may need to be rewritten to pre-existing output directories during checkpoint recovery. + output directories. We recommend that users do not disable this except if trying to achieve compatibility + with previous versions of Spark. Simply use Hadoop's FileSystem API to delete output directories by hand. + This setting is ignored for jobs generated through Spark Streaming's StreamingContext, since data may + need to be rewritten to pre-existing output directories during checkpoint recovery. + + 1.0.1 spark.storage.memoryMapThreshold @@ -1728,6 +1767,7 @@ Apart from these, the following properties are also available, and may be useful Version 2 may have better performance, but version 1 may handle failures better in certain situations, as per MAPREDUCE-4815. + 2.2.0 @@ -1842,7 +1882,7 @@ Apart from these, the following properties are also available, and may be useful need to be increased, so that incoming connections are not dropped when a large number of connections arrives in a short period of time. - + 3.0.0 spark.network.timeout @@ -1865,7 +1905,7 @@ Apart from these, the following properties are also available, and may be useful block transfer. For environments where off-heap memory is tightly limited, users may wish to turn this off to force all allocations to be on-heap. - + 3.0.0 spark.port.maxRetries @@ -1877,7 +1917,7 @@ Apart from these, the following properties are also available, and may be useful essentially allows it to try a range of ports from the start port specified to port + maxRetries. - + 1.1.1 spark.rpc.numRetries @@ -1920,7 +1960,7 @@ Apart from these, the following properties are also available, and may be useful out and giving up. To avoid unwilling timeout caused by long pause like GC, you can set larger value. - + 1.1.1 spark.network.maxRemoteBlockSizeFetchToMem @@ -2053,6 +2093,7 @@ Apart from these, the following properties are also available, and may be useful that register to the listener bus. Consider increasing value, if the listener events corresponding to shared queue are dropped. Increasing this value may result in the driver using more memory. + 3.0.0 spark.scheduler.listenerbus.eventqueue.appStatus.capacity @@ -2062,6 +2103,7 @@ Apart from these, the following properties are also available, and may be useful Consider increasing value, if the listener events corresponding to appStatus queue are dropped. Increasing this value may result in the driver using more memory. + 3.0.0 spark.scheduler.listenerbus.eventqueue.executorManagement.capacity @@ -2071,6 +2113,7 @@ Apart from these, the following properties are also available, and may be useful executor management listeners. Consider increasing value if the listener events corresponding to executorManagement queue are dropped. Increasing this value may result in the driver using more memory. + 3.0.0 spark.scheduler.listenerbus.eventqueue.eventLog.capacity @@ -2080,6 +2123,7 @@ Apart from these, the following properties are also available, and may be useful that write events to eventLogs. Consider increasing value if the listener events corresponding to eventLog queue are dropped. Increasing this value may result in the driver using more memory. + 3.0.0 spark.scheduler.listenerbus.eventqueue.streams.capacity @@ -2089,6 +2133,7 @@ Apart from these, the following properties are also available, and may be useful Consider increasing value if the listener events corresponding to streams queue are dropped. Increasing this value may result in the driver using more memory. + 3.0.0 spark.scheduler.blacklist.unschedulableTaskSetTimeout @@ -2271,6 +2316,7 @@ Apart from these, the following properties are also available, and may be useful in order to assign resource slots (e.g. a 0.2222 configuration, or 1/0.2222 slots will become 4 tasks/resource, not 5). + 3.0.0 spark.task.maxFailures @@ -2335,6 +2381,7 @@ Apart from these, the following properties are also available, and may be useful Number of consecutive stage attempts allowed before a stage is aborted. + 2.2.0 @@ -2526,13 +2573,14 @@ like shuffle, just replace "rpc" with "shuffle" in the property names except spark.{driver|executor}.rpc.netty.dispatcher.numThreads, which is only for RPC module. - + + @@ -2540,6 +2588,7 @@ like shuffle, just replace "rpc" with "shuffle" in the property names except Fall back on spark.rpc.io.clientThreads + @@ -2547,6 +2596,7 @@ like shuffle, just replace "rpc" with "shuffle" in the property names except Fall back on spark.rpc.netty.dispatcher.numThreads +
Property NameDefaultMeaning
Property NameDefaultMeaningSince Version
spark.{driver|executor}.rpc.io.serverThreads Fall back on spark.rpc.io.serverThreads Number of threads used in the server thread pool1.6.0
spark.{driver|executor}.rpc.io.clientThreads Number of threads used in the client thread pool1.6.0
spark.{driver|executor}.rpc.netty.dispatcher.numThreads Number of threads used in RPC message dispatcher thread pool3.0.0
@@ -2728,7 +2778,7 @@ Spark subsystems. Executable for executing R scripts in client modes for driver. Ignored in cluster modes. - + 1.5.3 spark.r.shell.command @@ -2737,7 +2787,7 @@ Spark subsystems. Executable for executing sparkR shell in client modes for driver. Ignored in cluster modes. It is the same as environment variable SPARKR_DRIVER_R, but take precedence over it. spark.r.shell.command is used for sparkR shell while spark.r.driver.command is used for running R script. - + 2.1.0 spark.r.backendConnectionTimeout @@ -2769,6 +2819,7 @@ Spark subsystems. Checkpoint interval for graph and message in Pregel. It used to avoid stackOverflowError due to long lineage chains after lots of iterations. The checkpoint is disabled by default. + 2.2.0