Skip to content

Commit

Permalink
[SPARK-28639][CORE][DOC] Configuration doc for Barrier Execution Mode
Browse files Browse the repository at this point in the history
## What changes were proposed in this pull request?

SPARK-24817 and SPARK-24819 introduced new 3 non-internal properties for barrier-execution mode but they are not documented.
So I've added a section into configuration.md for barrier-mode execution.

## How was this patch tested?
Built using jekyll and confirm the layout by browser.

Closes #25370 from sarutak/barrier-exec-mode-conf-doc.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
  • Loading branch information
sarutak authored and srowen committed Aug 11, 2019
1 parent 58cc0df commit 31ef268
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 1 deletion.
Expand Up @@ -1058,7 +1058,7 @@ package object config {
ConfigBuilder("spark.barrier.sync.timeout")
.doc("The timeout in seconds for each barrier() call from a barrier task. If the " +
"coordinator didn't receive all the sync messages from barrier tasks within the " +
"configed time, throw a SparkException to fail all the tasks. The default value is set " +
"configured time, throw a SparkException to fail all the tasks. The default value is set " +
"to 31536000(3600 * 24 * 365) so the barrier() call shall wait for one year.")
.timeConf(TimeUnit.SECONDS)
.checkValue(v => v > 0, "The value should be a positive time value.")
Expand Down
44 changes: 44 additions & 0 deletions docs/configuration.md
Expand Up @@ -2039,6 +2039,50 @@ Apart from these, the following properties are also available, and may be useful
</tr>
</table>

### Barrier Execution Mode

<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
<td><code>spark.barrier.sync.timeout</code></td>
<td>365d</td>
<td>
The timeout in seconds for each <code>barrier()</code> call from a barrier task. If the
coordinator didn't receive all the sync messages from barrier tasks within the
configured time, throw a SparkException to fail all the tasks. The default value is set
to 31536000(3600 * 24 * 365) so the <code>barrier()</code> call shall wait for one year.
</td>
</tr>
<tr>
<td><code>spark.scheduler.barrier.maxConcurrentTasksCheck.interval</code></td>
<td>15s</td>
<td>
Time in seconds to wait between a max concurrent tasks check failure and the next
check. A max concurrent tasks check ensures the cluster can launch more concurrent
tasks than required by a barrier stage on job submitted. The check can fail in case
a cluster has just started and not enough executors have registered, so we wait for a
little while and try to perform the check again. If the check fails more than a
configured max failure times for a job then fail current job submission. Note this
config only applies to jobs that contain one or more barrier stages, we won't perform
the check on non-barrier jobs.
</td>
</tr>
<tr>
<td><code>spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures</code></td>
<td>40</td>
<td>
Number of max concurrent tasks check failures allowed before fail a job submission.
A max concurrent tasks check ensures the cluster can launch more concurrent tasks than
required by a barrier stage on job submitted. The check can fail in case a cluster
has just started and not enough executors have registered, so we wait for a little
while and try to perform the check again. If the check fails more than a configured
max failure times for a job then fail current job submission. Note this config only
applies to jobs that contain one or more barrier stages, we won't perform the check on
non-barrier jobs.
</td>
</tr>
</table>

### Dynamic Allocation

<table class="table">
Expand Down

0 comments on commit 31ef268

Please sign in to comment.