-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-3642. Document the nuances of shared variables. #2490
Conversation
QA tests have started for PR 2490 at commit
|
QA tests have finished for PR 2490 at commit
|
@@ -1183,6 +1188,10 @@ running on the cluster can then add to it using the `add` method or the `+=` ope | |||
However, they cannot read its value. | |||
Only the driver program can read the accumulator's value, using its `value` method. | |||
|
|||
The same task may run multiple times, either when its output data becomes lost or when multiple | |||
actions make use of the same stage. In these cases, only the additions reported by the first | |||
successful task contribute to the accumulator's value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sryza Actually this is not true....
scala> val acc = sc.accumulator(0)
scala> val data = sc.parallelize(List(1, 2, 3))
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :12
scala> val a = data.map(x => acc += 1)
scala> a.count
scala> acc.value
res5: Int = 3
scala> a.count
scala> acc.value
res7: Int = 6
I will resubmit #228 tonight or tomorrow,
cb9ffad
to
2a81019
Compare
Thanks for the review @Ishiihara . Updated the PR to clarify these points. |
Test build #23400 has started for PR 2490 at commit
|
Test build #23400 has finished for PR 2490 at commit
|
Test PASSed. |
@@ -1183,6 +1189,10 @@ running on the cluster can then add to it using the `add` method or the `+=` ope | |||
However, they cannot read its value. | |||
Only the driver program can read the accumulator's value, using its `value` method. | |||
|
|||
An operation referencing an accumulator may run multiple times, either when parts of its output | |||
data become lost or when multiple actions make use of the RDD it produces. In these cases, only the | |||
additions reported by the first successful execution contribute to the accumulator's value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to suggest that this might not be true, but it looks like this is partially-true now that #2524 has been merged. That PR added a paragraph to the programming guide which clarifies that we guard against duplicate updates only for updates performed inside of actions and not for ones performed in transformations: 66cc243?diff=unified#diff-3
In light of this, do we still need this paragraph?
Sorry for letting this sit for so long; I'm working my way through the backlog now, though. I think that the second addition, RE: accumulator updates, may no longer be necessary / may be subsumed by changes in other PRs, but the first paragraph RE: broadcast variables still looks good. |
Sandy do you want to remove or update the 2nd paragraph? can be merged then, it looks like. |
Test build #28433 has started for PR 2490 at commit
|
Test build #28433 has finished for PR 2490 at commit
|
Test PASSed. |
LGTM since the PR now represents the just the first paragraph that Josh alludes to above, and he approved that much. I'll leave it open a little while for more comments just in case. |
No description provided.