Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-3642. Document the nuances of shared variables. #2490

Closed
wants to merge 1 commit into from

Conversation

sryza
Copy link
Contributor

@sryza sryza commented Sep 22, 2014

No description provided.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have started for PR 2490 at commit cb9ffad.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have finished for PR 2490 at commit cb9ffad.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1183,6 +1188,10 @@ running on the cluster can then add to it using the `add` method or the `+=` ope
However, they cannot read its value.
Only the driver program can read the accumulator's value, using its `value` method.

The same task may run multiple times, either when its output data becomes lost or when multiple
actions make use of the same stage. In these cases, only the additions reported by the first
successful task contribute to the accumulator's value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sryza Actually this is not true....

scala> val acc = sc.accumulator(0)

scala> val data = sc.parallelize(List(1, 2, 3))
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :12

scala> val a = data.map(x => acc += 1)

scala> a.count

scala> acc.value
res5: Int = 3

scala> a.count
scala> acc.value
res7: Int = 6

I will resubmit #228 tonight or tomorrow,

@sryza
Copy link
Contributor Author

sryza commented Nov 15, 2014

Thanks for the review @Ishiihara . Updated the PR to clarify these points.

@SparkQA
Copy link

SparkQA commented Nov 15, 2014

Test build #23400 has started for PR 2490 at commit 2a81019.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 15, 2014

Test build #23400 has finished for PR 2490 at commit 2a81019.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23400/
Test PASSed.

@@ -1183,6 +1189,10 @@ running on the cluster can then add to it using the `add` method or the `+=` ope
However, they cannot read its value.
Only the driver program can read the accumulator's value, using its `value` method.

An operation referencing an accumulator may run multiple times, either when parts of its output
data become lost or when multiple actions make use of the RDD it produces. In these cases, only the
additions reported by the first successful execution contribute to the accumulator's value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest that this might not be true, but it looks like this is partially-true now that #2524 has been merged. That PR added a paragraph to the programming guide which clarifies that we guard against duplicate updates only for updates performed inside of actions and not for ones performed in transformations: 66cc243?diff=unified#diff-3

In light of this, do we still need this paragraph?

@JoshRosen
Copy link
Contributor

Sorry for letting this sit for so long; I'm working my way through the backlog now, though.

I think that the second addition, RE: accumulator updates, may no longer be necessary / may be subsumed by changes in other PRs, but the first paragraph RE: broadcast variables still looks good.

@srowen
Copy link
Member

srowen commented Feb 11, 2015

Sandy do you want to remove or update the 2nd paragraph? can be merged then, it looks like.

@SparkQA
Copy link

SparkQA commented Mar 10, 2015

Test build #28433 has started for PR 2490 at commit aae3340.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 10, 2015

Test build #28433 has finished for PR 2490 at commit aae3340.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28433/
Test PASSed.

@srowen
Copy link
Member

srowen commented Mar 10, 2015

LGTM since the PR now represents the just the first paragraph that Josh alludes to above, and he approved that much. I'll leave it open a little while for more comments just in case.

@asfgit asfgit closed this in 2d87a41 Mar 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants