[SPARK-30666][Core][WIP] Reliable single-stage accumulators #27377

EnricoMi · 2020-01-28T19:34:09Z

What changes were proposed in this pull request?

This PR introduces a reliable accumulator which only merges the first accumulator from each partition. Together with AccumulatorMetadata.countFailedValues == false, this makes accumulator values reliable in the presence of reprocessing partitions.

The current implementation has no means to identify which partition a remote accumulator value is merged from. When partitions are executed multiple times (e.g. re-run on failure or cache eviction, rerunning actions, usage of a stage in multiple actions), accumulators currently consider all partitions' accumulator values.

The ReliableAccumulator keeps the first accumulator per partition only. For accumulators registered with countFailedValues == false, this provides the accumulator value equivalent to a single successful stage, which does not change any more.

The reliable accumulator on the driver uses additional memory in the order of the number of partitions.

Why are the changes needed?

With the current behaviour, only a very limited class of accumulator use cases can be implemented, those that count across partition executions. Counting read and written bytes is a good example where this behaviour is desired. Counting the number of rows of the dataset, or rows that meet a certain condition cannot be implemented with this behaviour. The accumulator over-counts and thus only provides a pessimistic upper bound of the true value. With this PR, exact reliable numbers can be extracted.

Does this PR introduce any user-facing change?

Yes, it introduces the method ReliableAccumulator.merge(AccumulatorV2, int) that is called by DAGScheduler to provide the partition id.

How was this patch tested?

Unit tests in the ReliableAccumulatorSuite.

AmplabJenkins · 2020-01-28T19:36:27Z

Can one of the admins verify this patch?

EnricoMi · 2020-01-28T19:46:59Z

Inviting some contributors of AccumulatorV2: @cloud-fan @rxin @HyukjinKwon @zsxwing

databricks-david-lewis

I think this is a more specific solution to a more general problem, and I'm not sure it works in all cases.
AccumulatorMode.Max doesn't seem to have an analogy for all Accumulator types. I don't think it's appropriate for this API.
AccumulatorMode.Last doesn't actually know what value is used, does it? It is making the assumption that the last task was the one used by the partition, but I'm not sure that's the case.

I think the core issue is that we send a CompletionEvent for tasks, even if we don't end up recalculating that partition.

I can see the use case for ALL when counting the total number of bytes written out, but for any sort of statistics based on the data it seems like LAST is the only idea that makes sense. And to make that work properly I think we need a solution that isn't implemented specially for each Accumulator.

databricks-david-lewis · 2020-01-30T16:51:30Z

core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala

+          fragmentId.foreach(_fragments(_) = (maxSum, maxCount))
+        case AccumulatorMode.Last =>
+          val (fragmentSum, fragmentCount) =
+            fragmentId


When is the fragmentId None?

If AccumulatorV2.merge is only used in DAGScheduler.scala, then fragmentId can be an Int rather than an Option[Int].

cloud-fan · 2020-01-31T16:06:38Z

Is this mostly for speculative tasks where we can get duplicated accumulator updates for the same partition?

EnricoMi · 2020-02-06T08:27:36Z

@cloud-fan This is not restricted to speculative tasks, as there is a multitude of reasons for when we get multiple accumulator updates per partition. This should handle any of them (as long as the accumulator updates are deterministic).

@databricks-david-lewis can you give an example of an (deterministic) accumulator type that does not work with AccumulatorMode.Max?

databricks-david-lewis · 2020-02-06T20:11:11Z

@EnricoMi Let's say we are aggregating a column of longs, but the values are not all positive.

In that case taking the "max" would not give the right value.

EnricoMi · 2020-02-11T14:52:11Z

@databricks-david-lewis you are right, in that case MAX is not the notion that I want, it is more like "favour the fragment that is more comprehensive / has higher cardinality". In case of the LongAccumulator, that would be the one with the higher fragmentCount. Not sure how to name that mode, maybe BIGGEST. Mode name RELIABLE would be preferable, but this does not refer to the merge strategy but rather the outcome of that merge strategy.

When an AccumulatorV2 implementation cannot support such a mode, e.g. SqlMetric, it should simply throw an exception in register.

databricks-david-lewis · 2020-02-19T00:11:22Z

@EnricoMi Thank you for continuing to work on this! I appreciate all the time and thought you've put into it.

I worry that your solution will lead to lots of duplicated work. Is there some way to move the AccumulatorMode logic out of the accumulator itself? It seems like most accumulators always want to do the correct thing, which is only act on the data that is passed on to the rest of the stages.

The only exception I can think of is counting the total number of bytes read or written, which is unreliable anyway because certain failures mean that information never makes it back to the driver.

- removed Option[] from merge fragmentId argument - introduced First AccumulatorMode - moved merge mode implementations into methods (Long and Double Accum)

- provdes traits that implement the modes - case classes calling into the traits

Implements first accumulator mode only. Provides reliable Long, Double and Collection accumulator implementations.

EnricoMi · 2020-02-24T13:55:06Z

@databricks-david-lewis I have stripped out the notion of AccumulatorMode and only implemented a single mode, formerly called FIRST. The ReliableAccumulator only merges the first accumulator value per partition (fragmentId). This reduces the footprint of this PR significantly.

EnricoMi · 2020-03-17T16:34:12Z

I have had a quick chat with @holdenk and we found two use cases where this approach will not work:

partially computed partitions will lock the partial value of the aggregator, a subsequent complete computation will not update that partition's aggregator value
building a Dataset on top of another one that contains an accumulator may produce two query plans where the aggregator is computed with differing partitioning

I will look into these, so changing this to WIP. More feedback welcome.

github-actions · 2020-06-26T00:24:05Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

EnricoMi mentioned this pull request Jan 28, 2020

[WIP] Reliable single-stage accumulators G-Research/spark#3

Closed

databricks-david-lewis reviewed Jan 30, 2020

View reviewed changes

dongjoon-hyun added the SPARK CORE label Feb 5, 2020

EnricoMi force-pushed the branch-accumulator branch from 3001d82 to 6b89ecc Compare February 11, 2020 16:08

EnricoMi changed the title ~~[WIP][SPARK-30666][Core] Reliable single-stage accumulators~~ [SPARK-30666][Core] Reliable single-stage accumulators Feb 18, 2020

EnricoMi added 8 commits February 24, 2020 12:04

Implementing accumulator mode Last

853ad7d

Implement All and Max accumulator mode, add tests

2c7c98a

Fixing warnings and max mode

534b795

Reworking accumulator modes

3648314

- removed Option[] from merge fragmentId argument - introduced First AccumulatorMode - moved merge mode implementations into methods (Long and Double Accum)

fix scala style warning

4ea28d2

replace AccumulatorMode enum with case classes

af50af6

- provdes traits that implement the modes - case classes calling into the traits

Simplifying generics

bd59a74

Replacing accum modes with ReliableAccumulator trait

575f104

Implements first accumulator mode only. Provides reliable Long, Double and Collection accumulator implementations.

EnricoMi force-pushed the branch-accumulator branch from 837efaa to 575f104 Compare February 24, 2020 13:46

EnricoMi changed the title ~~[SPARK-30666][Core] Reliable single-stage accumulators~~ [SPARK-30666][Core][WIP] Reliable single-stage accumulators Mar 17, 2020

github-actions bot added the Stale label Jun 26, 2020

github-actions bot closed this Jun 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30666][Core][WIP] Reliable single-stage accumulators #27377

[SPARK-30666][Core][WIP] Reliable single-stage accumulators #27377

EnricoMi commented Jan 28, 2020 •

edited

AmplabJenkins commented Jan 28, 2020

EnricoMi commented Jan 28, 2020

databricks-david-lewis left a comment

databricks-david-lewis Jan 30, 2020

EnricoMi Feb 6, 2020

cloud-fan commented Jan 31, 2020

EnricoMi commented Feb 6, 2020

databricks-david-lewis commented Feb 6, 2020 •

edited

EnricoMi commented Feb 11, 2020

databricks-david-lewis commented Feb 19, 2020

EnricoMi commented Feb 24, 2020

EnricoMi commented Mar 17, 2020

github-actions bot commented Jun 26, 2020

[SPARK-30666][Core][WIP] Reliable single-stage accumulators #27377

[SPARK-30666][Core][WIP] Reliable single-stage accumulators #27377

Conversation

EnricoMi commented Jan 28, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AmplabJenkins commented Jan 28, 2020

EnricoMi commented Jan 28, 2020

databricks-david-lewis left a comment

Choose a reason for hiding this comment

databricks-david-lewis Jan 30, 2020

Choose a reason for hiding this comment

EnricoMi Feb 6, 2020

Choose a reason for hiding this comment

cloud-fan commented Jan 31, 2020

EnricoMi commented Feb 6, 2020

databricks-david-lewis commented Feb 6, 2020 • edited

EnricoMi commented Feb 11, 2020

databricks-david-lewis commented Feb 19, 2020

EnricoMi commented Feb 24, 2020

EnricoMi commented Mar 17, 2020

github-actions bot commented Jun 26, 2020

EnricoMi commented Jan 28, 2020 •

edited

databricks-david-lewis commented Feb 6, 2020 •

edited