[SPARK-41513][SQL] Implement an accumulator to collect per mapper row count metrics #39057

amaliujia · 2022-12-14T02:44:35Z

What changes were proposed in this pull request?

In current Spark optimizer, a single partition shuffle might be created for a limit if this limit is not the last non-action operation (e.g. a filter following the limit and the data size exceeds a threshold). There is a possibility that the previous output partitions before go into this limit are sorted. The single partition shuffle approach has a correctness bug in this case: shuffle read partitions could be out of partition order and the limit exec just take the first limit rows which could lose the order thus result into wrong result. This is a shuffle so it is relatively costly. Meanwhile, to correct this bug, a native solution is to sort all the data fed into limit again, which is another overhead.

So we propose a row count based AQE algorithm that optimizes this problem by two folds:

Avoid the extra sort on the shuffle read side (or with the limit exec) to achieve the correct result.
Avoid reading all shuffle data from mappers for this single partition shuffle to reduce shuffle cost.

Note that 1. is only applied for the sorted partition case where 2. is applied for general single partition shuffle + limit case

The algorithm works as the following:

Each mapper will record a row count when writing shuffle data.
Since this is single shuffle partition case, there is only one partition but N mappers.
A accumulatorV2 is implemented to collect a list of tuple which records the mapping between mapper id and the number of row written by the mapper (row count metrics)
AQE framework detects a plan shape of shuffle plus a global limit.
AQE framework reads necessary data from mappers based on the limit. For example, if mapper 1 writes 200 rows and mapper 2 writes 300 rows, and the limit is 500, AQE creates shuffle read node to write from both mapper 1 and 2, thus skip the left mappers.
This is both correct for limit with the sorted or non-sorted partitions.

This is the first step to implement the idea in https://issues.apache.org/jira/browse/SPARK-41512, which is to implement a row count accumulator that will be used to collect row count metrics.

Why are the changes needed?

Optimization algorithm for global limit with single partition shuffle

Does this PR introduce any user-facing change?

NO

How was this patch tested?

UT

… count metrics.

amaliujia · 2022-12-14T04:22:27Z

@cloud-fan

cloud-fan · 2022-12-14T16:34:19Z

core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala

+ *
+ * @since 3.4.0
+ */
+class MapperRowCounter extends AccumulatorV2[jl.Long, java.util.List[java.util.List[jl.Long]]] {


We can put it in the sql module, or probably the same file with shuffle node.

Moved to sql module.

core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala

AmplabJenkins · 2022-12-16T07:42:06Z

Can one of the admins verify this patch?

sql/core/src/main/scala/org/apache/spark/sql/util/MapperRowCounter.scala

cloud-fan · 2022-12-19T05:56:06Z

sql/core/src/main/scala/org/apache/spark/sql/util/MapperRowCounter.scala

+
+  def setPartitionId(id: Long): Unit = {
+    this.synchronized {
+      val p = id


what does this do?

hmm I can remove this assignment and use id directly.

sql/core/src/main/scala/org/apache/spark/sql/util/MapperRowCounter.scala

cloud-fan · 2022-12-19T06:01:19Z

sql/core/src/main/scala/org/apache/spark/sql/util/MapperRowCounter.scala

+ *
+ * @since 3.4.0
+ */
+class MapperRowCounter extends AccumulatorV2[jl.Long, java.util.List[(jl.Long, jl.Long)]] {


TaskContext.partitionId is int, do we really need long here?

The firs long here (aka IN) defines

/** * Takes the inputs and accumulates. */ def add(v: IN): Unit

So this should be a long for the row count?

row count should be long, map index should be int.

I think I wrongly understand the template type.

Changed to Integer for the map index.

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

cloud-fan · 2022-12-22T03:20:00Z

sql/core/src/main/scala/org/apache/spark/sql/util/MapperRowCounter.scala

+
+  override def add(v: jl.Long): Unit = {
+    this.synchronized {
+      assert(!isZero, "agg must have been initialized")


assert(getOrCreate.size == 1)?

cloud-fan · 2022-12-22T03:20:37Z

sql/core/src/main/scala/org/apache/spark/sql/util/MapperRowCounter.scala

+      if (isZero) {
+        getOrCreate.add((id, 0))
+      } else {
+        val n = getOrCreate.get(0)._2


when can we hit this branch?

This is because I don't know the invocation sequence of the accumulator APIs on the executor side. So I added this branch for safe.

If the setPartitionId is always called before any add, then we can does an assert on isZero and remove this branch.

What do you think?

let's add an assert

cloud-fan · 2022-12-22T11:59:21Z

thanks, merging to master!

[SPARK-41513][SQL] Implement an accumulator to collect per mapper row…

819ee6a

… count metrics.

github-actions bot added the CORE label Dec 14, 2022

cloud-fan reviewed Dec 14, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala Outdated Show resolved Hide resolved

update

ef9e39b

github-actions bot added SQL and removed CORE labels Dec 15, 2022

amaliujia added 2 commits December 15, 2022 15:23

update

682f1ee

update

504766b

cloud-fan reviewed Dec 19, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/util/MapperRowCounter.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 19, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/util/MapperRowCounter.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 19, 2022

View reviewed changes

amaliujia and others added 3 commits December 20, 2022 13:47

Apply suggestions from code review

e26fd5b

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

update

453973b

update

cc02424

cloud-fan reviewed Dec 22, 2022

View reviewed changes

update

8acd9f2

cloud-fan closed this in 887f831 Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41513][SQL] Implement an accumulator to collect per mapper row count metrics #39057

[SPARK-41513][SQL] Implement an accumulator to collect per mapper row count metrics #39057

amaliujia commented Dec 14, 2022 •

edited

amaliujia commented Dec 14, 2022

cloud-fan Dec 14, 2022

amaliujia Dec 15, 2022

AmplabJenkins commented Dec 16, 2022

cloud-fan Dec 19, 2022

amaliujia Dec 20, 2022

cloud-fan Dec 19, 2022

amaliujia Dec 20, 2022

cloud-fan Dec 21, 2022

amaliujia Dec 21, 2022

cloud-fan Dec 22, 2022

amaliujia Dec 22, 2022

cloud-fan Dec 22, 2022

amaliujia Dec 22, 2022 •

edited

cloud-fan Dec 22, 2022

amaliujia Dec 22, 2022

cloud-fan commented Dec 22, 2022

[SPARK-41513][SQL] Implement an accumulator to collect per mapper row count metrics #39057

[SPARK-41513][SQL] Implement an accumulator to collect per mapper row count metrics #39057

Conversation

amaliujia commented Dec 14, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

amaliujia commented Dec 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Dec 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amaliujia Dec 22, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Dec 22, 2022

amaliujia commented Dec 14, 2022 •

edited

amaliujia Dec 22, 2022 •

edited