Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-733] Add documentation on use of accumulators in lazy transformation #4022

Closed
wants to merge 10 commits into from

Conversation

ilganeli
Copy link

I've added documentation clarifying the particular lack of clarity highlighted in the relevant JIRA. I've also added code examples for this issue to clarify the explanation.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25474 has started for PR 4022 at commit df3afd7.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25474 has finished for PR 4022 at commit df3afd7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • val classServer = new HttpServer(conf, outputDir, new SecurityManager(conf), classServerPort, "HTTP class server")

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25474/
Test PASSed.

@squito
Copy link
Contributor

squito commented Jan 15, 2015

lgtm

@@ -1316,7 +1316,35 @@ For accumulator updates performed inside <b>actions only</b>, Spark guarantees t
will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware
of that each task's update may be applied more than once if tasks or job stages are re-executed.

In addition, accumulators do not maintain lineage for the operations that use them. Consequently, accumulator updates are not guaranteed to be executed when made within a lazy transformation like `map()`. Unless something has triggered the evaluation of the lazy transformation that updates the value of the accumlator, subsequent operations will not themselves trigger that evaluation and the value of the accumulator will remain unchanged. The below code fragment demonstrates this issue:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this is worded a bit confusingly: what would it mean for an accumulator to "maintain lineage"? I think this is from @JoshRosen's PR description, but IMO it might be better to remove that particular phrasing. What about a slight re-wording:

Accumulators do not change the lazy evaluation model of Spark. Their value is only
updated once the RDD in which they are being modified is computed as part of an
action. The below code fragment demonstrates this property:

I also didn't call it an "issue" because it's just a property of how they work, I don't think it's necessarily a bug.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion - I've updated the doc.

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25668 has started for PR 4022 at commit 587def5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25668 has finished for PR 4022 at commit 587def5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25668/
Test FAILed.

@pwendell
Copy link
Contributor

Okay new version LGTM! Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25671 has started for PR 4022 at commit 587def5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25671 has finished for PR 4022 at commit 587def5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25671/
Test PASSed.

asfgit pushed a commit that referenced this pull request Jan 16, 2015
…mation

I've added documentation clarifying the particular lack of clarity highlighted in the relevant JIRA. I've also added code examples for this issue to clarify the explanation.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #4022 from ilganeli/SPARK-733 and squashes the following commits:

587def5 [Ilya Ganelin] Updated to clarify verbage
df3afd7 [Ilya Ganelin] Revert "Partially updated task metrics to make some vars private"
3f6c512 [Ilya Ganelin] Revert "Completed refactoring to make vars in TaskMetrics class private"
58034fb [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-733
4dc2cdb [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-733
3a38db1 [Ilya Ganelin] Verified documentation update by building via jekyll
33b5a2d [Ilya Ganelin] Added code examples for java and python
1fd59b2 [Ilya Ganelin] Updated documentation for accumulators to highlight lazy evaluation issue
5525c20 [Ilya Ganelin] Completed refactoring to make vars in TaskMetrics class private
c64da4f [Ilya Ganelin] Partially updated task metrics to make some vars private

(cherry picked from commit fd3a8a1)
Signed-off-by: Imran Rashid <irashid@cloudera.com>
@asfgit asfgit closed this in fd3a8a1 Jan 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants