[SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files #6901

srowen · 2015-06-19T10:15:29Z

Clarify what may cause long-running Spark apps to preserve shuffle files

SparkQA · 2015-06-19T12:07:38Z

Test build #35260 has finished for PR 6901 at commit a9faef0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-06-19T18:02:54Z

LGTM. Eventually we want to address this behavior by forcing a periodic GC (once every 30 minutes or something should be inexpensive). For now this is a better description to have. Merging into master 1.4 and 1.3.

…park apps to preserve shuffle files Clarify what may cause long-running Spark apps to preserve shuffle files Author: Sean Owen <sowen@cloudera.com> Closes #6901 from srowen/SPARK-5836 and squashes the following commits: a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files (cherry picked from commit 4be53d0) Signed-off-by: Andrew Or <andrew@databricks.com>

tdas · 2015-06-19T18:22:30Z

docs/programming-guide.md

@@ -1144,9 +1144,11 @@ generate these on the reduce side. When data does not fit in memory Spark will s
 to disk, incurring the additional overhead of disk I/O and increased garbage collection.

 Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files


I know this has been merged, but a annoying issue that I have found in docs (including mine, so I am guilty too) is use of this as of Spark X. No one remembers searching for this pattern and it never gets updated. Rather we should use markdown variables, as of Spark {{site.SPARK_VERSION_SHORT}}.

In this case I think the sense was '... in 1.3 and not before', so it can stay as is. Yes, in cases where the meaning is '... as of the latest version, which is currently 1.3, and maybe beyond' then it makes sense to introduce a replacement, or just remove the text altogether.

Oh! I thought you meant it as the latter ... "as of the latest version". This is a little confusing. :/
May be it makes sense to remove it completely. The GC based behavior is present for 4 versions now, since Spark 1.0, and its not gonna change in foreseeable future. So its best to remove it. The only things that may change in Spark 1.5 that we induce GC periodically ourselves.

I agree it could be removed too, even if it probably doesn't matter at this point since we are well beyond 1.3.

…park apps to preserve shuffle files Clarify what may cause long-running Spark apps to preserve shuffle files Author: Sean Owen <sowen@cloudera.com> Closes apache#6901 from srowen/SPARK-5836 and squashes the following commits: a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files (cherry picked from commit 4be53d0) Signed-off-by: Andrew Or <andrew@databricks.com>

Clarify what may cause long-running Spark apps to preserve shuffle files

a9faef0

srowen mentioned this pull request Jun 19, 2015

[SPARK-5750][SPARK-3441][SPARK-5836][CORE] Added documentation explaining shuffle #5074

Closed

asfgit closed this in 4be53d0 Jun 19, 2015

tdas reviewed Jun 19, 2015
View reviewed changes

srowen deleted the SPARK-5836 branch June 26, 2015 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files #6901

[SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files #6901

srowen commented Jun 19, 2015

SparkQA commented Jun 19, 2015

andrewor14 commented Jun 19, 2015

tdas Jun 19, 2015

srowen Jun 19, 2015

tdas Jun 19, 2015

srowen Jun 19, 2015

		@@ -1144,9 +1144,11 @@ generate these on the reduce side. When data does not fit in memory Spark will s
		to disk, incurring the additional overhead of disk I/O and increased garbage collection.

		Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files

[SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files #6901

[SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files #6901

Conversation

srowen commented Jun 19, 2015

SparkQA commented Jun 19, 2015

andrewor14 commented Jun 19, 2015

tdas Jun 19, 2015

Choose a reason for hiding this comment

srowen Jun 19, 2015

Choose a reason for hiding this comment

tdas Jun 19, 2015

Choose a reason for hiding this comment

srowen Jun 19, 2015

Choose a reason for hiding this comment