Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files #6901

Closed
wants to merge 1 commit into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Jun 19, 2015

Clarify what may cause long-running Spark apps to preserve shuffle files

@SparkQA
Copy link

SparkQA commented Jun 19, 2015

Test build #35260 has finished for PR 6901 at commit a9faef0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

LGTM. Eventually we want to address this behavior by forcing a periodic GC (once every 30 minutes or something should be inexpensive). For now this is a better description to have. Merging into master 1.4 and 1.3.

asfgit pushed a commit that referenced this pull request Jun 19, 2015
…park apps to preserve shuffle files

Clarify what may cause long-running Spark apps to preserve shuffle files

Author: Sean Owen <sowen@cloudera.com>

Closes #6901 from srowen/SPARK-5836 and squashes the following commits:

a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files

(cherry picked from commit 4be53d0)
Signed-off-by: Andrew Or <andrew@databricks.com>
asfgit pushed a commit that referenced this pull request Jun 19, 2015
…park apps to preserve shuffle files

Clarify what may cause long-running Spark apps to preserve shuffle files

Author: Sean Owen <sowen@cloudera.com>

Closes #6901 from srowen/SPARK-5836 and squashes the following commits:

a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files

(cherry picked from commit 4be53d0)
Signed-off-by: Andrew Or <andrew@databricks.com>
@asfgit asfgit closed this in 4be53d0 Jun 19, 2015
@@ -1144,9 +1144,11 @@ generate these on the reduce side. When data does not fit in memory Spark will s
to disk, incurring the additional overhead of disk I/O and increased garbage collection.

Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this has been merged, but a annoying issue that I have found in docs (including mine, so I am guilty too) is use of this as of Spark X. No one remembers searching for this pattern and it never gets updated. Rather we should use markdown variables, as of Spark {{site.SPARK_VERSION_SHORT}}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I think the sense was '... in 1.3 and not before', so it can stay as is. Yes, in cases where the meaning is '... as of the latest version, which is currently 1.3, and maybe beyond' then it makes sense to introduce a replacement, or just remove the text altogether.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I thought you meant it as the latter ... "as of the latest version". This is a little confusing. :/
May be it makes sense to remove it completely. The GC based behavior is present for 4 versions now, since Spark 1.0, and its not gonna change in foreseeable future. So its best to remove it. The only things that may change in Spark 1.5 that we induce GC periodically ourselves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it could be removed too, even if it probably doesn't matter at this point since we are well beyond 1.3.

nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 22, 2015
…park apps to preserve shuffle files

Clarify what may cause long-running Spark apps to preserve shuffle files

Author: Sean Owen <sowen@cloudera.com>

Closes apache#6901 from srowen/SPARK-5836 and squashes the following commits:

a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files

(cherry picked from commit 4be53d0)
Signed-off-by: Andrew Or <andrew@databricks.com>
@srowen srowen deleted the SPARK-5836 branch June 26, 2015 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants