SPARK-3642. Document the nuances of shared variables. #2490

sryza · 2014-09-22T16:55:45Z

No description provided.

SparkQA · 2014-09-22T16:59:22Z

QA tests have started for PR 2490 at commit cb9ffad.

This patch merges cleanly.

SparkQA · 2014-09-22T18:04:56Z

QA tests have finished for PR 2490 at commit cb9ffad.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

CodingCat · 2014-09-24T01:24:59Z

docs/programming-guide.md

@@ -1183,6 +1188,10 @@ running on the cluster can then add to it using the `add` method or the `+=` ope
 However, they cannot read its value.
 Only the driver program can read the accumulator's value, using its `value` method.

+The same task may run multiple times, either when its output data becomes lost or when multiple
+actions make use of the same stage. In these cases, only the additions reported by the first
+successful task contribute to the accumulator's value.


@sryza Actually this is not true....

scala> val acc = sc.accumulator(0)

scala> val data = sc.parallelize(List(1, 2, 3))
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :12

scala> val a = data.map(x => acc += 1)

scala> a.count

scala> acc.value
res5: Int = 3

scala> a.count
scala> acc.value
res7: Int = 6

I will resubmit #228 tonight or tomorrow,

sryza · 2014-11-15T00:06:17Z

Thanks for the review @Ishiihara . Updated the PR to clarify these points.

SparkQA · 2014-11-15T00:10:22Z

Test build #23400 has started for PR 2490 at commit 2a81019.

This patch merges cleanly.

SparkQA · 2014-11-15T01:40:45Z

Test build #23400 has finished for PR 2490 at commit 2a81019.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-15T01:40:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23400/
Test PASSed.

JoshRosen · 2014-12-24T02:31:35Z

docs/programming-guide.md

@@ -1183,6 +1189,10 @@ running on the cluster can then add to it using the `add` method or the `+=` ope
 However, they cannot read its value.
 Only the driver program can read the accumulator's value, using its `value` method.

+An operation referencing an accumulator may run multiple times, either when parts of its output
+data become lost or when multiple actions make use of the RDD it produces. In these cases, only the
+additions reported by the first successful execution contribute to the accumulator's value.


I was going to suggest that this might not be true, but it looks like this is partially-true now that #2524 has been merged. That PR added a paragraph to the programming guide which clarifies that we guard against duplicate updates only for updates performed inside of actions and not for ones performed in transformations: 66cc243?diff=unified#diff-3

In light of this, do we still need this paragraph?

JoshRosen · 2014-12-24T02:32:43Z

Sorry for letting this sit for so long; I'm working my way through the backlog now, though.

I think that the second addition, RE: accumulator updates, may no longer be necessary / may be subsumed by changes in other PRs, but the first paragraph RE: broadcast variables still looks good.

srowen · 2015-02-11T10:01:29Z

Sandy do you want to remove or update the 2nd paragraph? can be merged then, it looks like.

SparkQA · 2015-03-10T18:32:53Z

Test build #28433 has started for PR 2490 at commit aae3340.

This patch merges cleanly.

SparkQA · 2015-03-10T19:52:26Z

Test build #28433 has finished for PR 2490 at commit aae3340.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-10T19:52:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28433/
Test PASSed.

srowen · 2015-03-10T22:04:11Z

LGTM since the PR now represents the just the first paragraph that Josh alludes to above, and he approved that much. I'll leave it open a little while for more comments just in case.

CodingCat reviewed Sep 24, 2014
View reviewed changes

sryza force-pushed the sandy-spark-3642 branch from cb9ffad to 2a81019 Compare November 15, 2014 00:05

JoshRosen reviewed Dec 24, 2014
View reviewed changes

SPARK-3642. Document the nuances of broadcast variables

aae3340

sryza force-pushed the sandy-spark-3642 branch from 2a81019 to aae3340 Compare March 10, 2015 18:30

asfgit closed this in 2d87a41 Mar 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-3642. Document the nuances of shared variables. #2490

SPARK-3642. Document the nuances of shared variables. #2490

sryza commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

CodingCat Sep 24, 2014

sryza commented Nov 15, 2014

SparkQA commented Nov 15, 2014

SparkQA commented Nov 15, 2014

AmplabJenkins commented Nov 15, 2014

JoshRosen Dec 24, 2014

JoshRosen commented Dec 24, 2014

srowen commented Feb 11, 2015

SparkQA commented Mar 10, 2015

SparkQA commented Mar 10, 2015

AmplabJenkins commented Mar 10, 2015

srowen commented Mar 10, 2015

SPARK-3642. Document the nuances of shared variables. #2490

SPARK-3642. Document the nuances of shared variables. #2490

Conversation

sryza commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

CodingCat Sep 24, 2014

Choose a reason for hiding this comment

sryza commented Nov 15, 2014

SparkQA commented Nov 15, 2014

SparkQA commented Nov 15, 2014

AmplabJenkins commented Nov 15, 2014

JoshRosen Dec 24, 2014

Choose a reason for hiding this comment

JoshRosen commented Dec 24, 2014

srowen commented Feb 11, 2015

SparkQA commented Mar 10, 2015

SparkQA commented Mar 10, 2015

AmplabJenkins commented Mar 10, 2015

srowen commented Mar 10, 2015