[SPARK-32096][SQL] Improve sorting performance of Spark SQL window function by removing window partition key from sort order #29725

xuzikun2003 · 2020-09-11T07:12:21Z

What changes were proposed in this pull request?

Spark SQL rank window function needs to sort the data in each window partition, and it relies on the execution operator SortExec to do the sort. During sorting, the window partition key is also put at the front of the sort order and thus it brings unnecessary comparisons on the partition key. Instead, we can group the rows by the partition key first, and inside each group we sort the rows without comparing the partition key.

We use a HashMap to store the mapping between a partition key and a sorter. All the rows corresponding to a single partition key will be inserted into the same sorter. Each sorter will sort its rows. The partition keys stored in the HashMap will also be sorted at the end. When the sort operator is ready to return the rows to the window operator, it will follow the order of the partition key to go over each sorter, and each sorter will return the rows in the window order decided by the SQL syntax “ORDER BY”.

As we cannot store an unlimited number of key-value pairs in the HashMap, we set an upper bound for the number of pairs. If the number of distinct keys in the HashMap reaches the limit, the new incoming rows will be inserted to the main sorter. This main sorter will sort the rows in the order of the partition key plus the window order. If the number of distinct keys in the HashMap is under the limit, the main sorter will be always empty.

When there are two sequences of sorted rows in both the HashMap and the main sorter, we follow a merge sort to return the rows. We compare the next row ready to return from the HashMap and the next row ready to return from the main sorter, and always choose the one with a higher rank to return.

Why are the changes needed?

This is the related JIRA
https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17147504

This change brings performance improvement for window function. This is the change of performance when running q67 of TPCDS-1TB benchmark.
Query | Time in seconds (master) | Time in seconds (perf patch)
67-v2.4 | 450.515 | 226.124
This is the change of performance when running q67 of TPCDS-10TB benchmark.
Query | Time in seconds (master) | Time in seconds (perf patch)
67-v2.4 | 2486.404 | 1168.709
While this change brings performance improvement to query 67, it does not bring performance regression to other queries of TPCDS-1TB or TPCDS-10TB.

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing unit tests
newly added unit tests

…ndowSortExec

HyukjinKwon · 2020-09-28T07:49:13Z

cc @hvanhovell FYI

hvanhovell · 2020-09-28T12:24:21Z

ok to test

SparkQA · 2020-09-28T13:05:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33798/

SparkQA · 2020-09-28T13:25:21Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33798/

SparkQA · 2020-09-28T17:14:28Z

Test build #129184 has finished for PR 29725 at commit 77d615a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-05T08:32:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34014/

SparkQA · 2020-10-05T08:56:55Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34014/

SparkQA · 2020-10-05T11:59:40Z

Test build #129407 has finished for PR 29725 at commit 6b8ca20.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-11T09:36:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34240/

SparkQA · 2020-10-11T09:57:30Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34240/

SparkQA · 2020-10-11T10:45:30Z

Test build #129636 has finished for PR 29725 at commit 7604f53.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-11T16:47:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34245/

SparkQA · 2020-10-11T17:06:33Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34245/

SparkQA · 2020-10-11T20:13:28Z

Test build #129641 has finished for PR 29725 at commit 32a6714.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuzikun2003 · 2020-10-13T06:56:28Z

@hvanhovell , do you have a chance to review this pull request?

opensky142857 · 2020-11-17T10:44:00Z

sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowWindowSorter.java

+          (Supplier<RecordComparator>)null,
+          prefixComparatorInWindow,
+          prefixComputerInWindow,
+          pageSizeBytes/totalNumSorters,


why this page size is related to totalNumSorters?
consider the situation there are 100 different partition keys in one task. the pageSize will be 1/100 of the original pageSize which will lead to 100 times of page allocate.
Could you please explain why you want to reduce the page size here?

opensky142857 · 2020-11-17T10:45:08Z

sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowWindowSorter.java

+  private final UnsafeExternalRowSorter.PrefixComputer prefixComputerInWindow;
+  private final boolean canUseRadixSortInWindow;
+  private final long pageSizeBytes;
+  private static final int windowSorterMapMaxSize = 1;


why the windowSorterMapMaxSize is 1?
so the windowSorterMap can only have 1 sorter?

I have the same question. Could you parameterize it via SQLConf?

@maropu , @opensky142857, here are the reasons for why we set the windowSorterMapMaxSize to be 1 and why we reduce the page size of each sorter.

Each UnsafeExternalRowSorter is using a different memory consumer. Whenever you insert the first row into an UnsafeExternalRowSorter, the memory consumer of this sorter will allocate a whole page to the sorter. In our perf run of TPCDS100TB, the default page size is 64MB. If we insert only a few rows into a sorter corresponding to a window, then a lot of memory resources are wasted and the non-necessary memory allocation also brings significant performance overhead. So that is why we do two things in this PR:

Keep the number of window sorters small

Decrease the page size of each window sorter.

To address this problem, actually we have two directions to go.

One direction is that we can let these window sorters share the same memory consumer. Thus we won't allocate many big pages to which very few rows are inserted. But this direction requires a lot of engineer effort to refactor the code of UnsafeExternalSorter.

The second direction is that we only keep one window sorter for each physical partition.

Here is why we choose the second direction. When we run TPCDS100TB, we are not seeing Spark engine is slow in sorting many windows in a physical partition. We are seeing Spark engine is slow in sorting a single window in a single physical partition (q67 is the case), and the executor is doing a lot of unnecessary comparisons on the window partition key. To address the slowness that we observe, we follow the second direction to keep only one window sorter for each physical partition. And this single window sorter in each physical partition does not need to compare the window partition key and thus it runs almost 2 times faster.

Perhaps I can rename these parameters to avoid confusion. How do you guys think?

did you consider to reduce the page size with a fixed ratio instead of choosing to only optimize for 1 partition in each task?
In my understanding, it is not a rare case that one task handles several partition keys.

In our current setting, we have one main sorter and one window sorter. If there is only window partition key on a physical partition, then all the rows will go to the window sorter and the main sorter will be empty; if there are more than one window partition key in a physical partition, then one window partition key goes to the window sorter and remaining partition keys go to the main sorter.

We just reduce the original page size by half in these two sorters. We observe that halving the page size gives no performance difference in the overall TPCDS 100TB run. The advantage is that if there are very few rows inserted to the window sorter, then less memory will be wasted in the first page allocated for the window sorter and thus less overhead caused by the memory allocation of the first page.

@opensky142857, You are right, it is not a rare case that one task handles several partition keys but reducing the page size by half wouldn't make much difference. The default page size is 64MB, and there is no performance difference between 64MB page size and 32 MB page size. We can also keep the page size of the main sorter unchanged.

this PR is kind of like a TPCDS specific optimization to me if we make windowSorterMapMaxSize=1 since it works the best when the number of partition keys is fewer than the task numbers.

but the code itself looks like a general optimization.

have you ever considered to set windowSorterMapMaxSize to 10 or 100 and reduce page size in the meantime?
so we can accept some limited memory waste in exchange for the ability to handle more general cases.

In my current impression, we need to try the first direction for achieving this optimization w/o the high memory pressure as you pointed out above. Probably, we need to implement (or, extend?) BytesToBytesMap-like data structure (values in the map need to be sorted in an output iterator) instead of using HashMap.

maropu · 2020-11-18T06:04:41Z

sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowWindowSorter.java

+  private final UnsafeExternalRowSorter.PrefixComputer prefixComputerInWindow;
+  private final boolean canUseRadixSortInWindow;
+  private final long pageSizeBytes;
+  private static final int windowSorterMapMaxSize = 1;


I have the same question. Could you parameterize it via SQLConf?

maropu · 2020-11-18T06:04:59Z

sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowWindowSorter.java

+  private final long pageSizeBytes;
+  private static final int windowSorterMapMaxSize = 1;
+  private static final int totalNumSorters = windowSorterMapMaxSize + 1;
+  private final HashMap<UnsafeRow,AbstractUnsafeExternalRowSorter> windowSorterMap;


nit: HashMap<UnsafeRow, AbstractUnsafeExternalRowSorter>

maropu · 2020-11-18T06:07:14Z

sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowWindowSorter.java

+  private UnsafeExternalRowSorter createUnsafeExternalRowSorterForWindow() throws IOException {
+    UnsafeExternalRowSorter sorter = null;
+    try {
+      if (this.orderingInWindow == null) {


When orderingInWindow == null, we need WindowSortExec?

maropu · 2020-11-18T06:07:41Z

sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowWindowSorter.java

+          prefixComparatorInWindow,
+          prefixComputerInWindow,
+          pageSizeBytes/totalNumSorters,
+          false);


nit: pageSizeBytes/totalNumSorters, -> pageSizeBytes / totalNumSorters,

maropu · 2020-11-18T06:08:07Z

sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowWindowSorter.java

+  }
+
+  /**
+  * Returns an UnsafeExternalRowWindowSorter object.


nit: wrong format.

maropu · 2020-11-18T06:13:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala

+    partitionSpec: Seq[Expression],
+    sortOrderInWindow: Seq[SortOrder],
+    sortOrderAcrossWindows: Seq[SortOrder],
+    global: Boolean,


It seems we don't need global for this node.

You are right, we can remove this parameter.

maropu · 2020-11-18T06:17:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala

+      override def compare(prefix1: Long, prefix2: Long): Int = 0
+    }
+
+    if (sortOrderInWindow == null || sortOrderInWindow.size == 0) {


When orderingInWindow == null or ortOrderInWindow.size == 0, we need WindowSortExec?

We don't run WindowSortExec when orderingInWindow == null or ortOrderInWindow.size == 0. We run the original SortExec if there is no need to sort within each group.

If so, It seems we don't this if section?

maropu · 2020-11-18T06:18:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala

+  val enableRadixSort = sqlContext.conf.enableRadixSort
+
+  lazy val boundSortExpression = BindReferences.bindReference(sortOrder.head, output)
+  lazy val ordering = RowOrdering.create(sortOrder, output)
+  lazy val sortPrefixExpr = SortPrefix(boundSortExpression)
+
+  // The comparator for comparing prefix
+  lazy val prefixComparator = SortPrefixUtils.getPrefixComparator(boundSortExpression)
+
+  // The generator for prefix
+  lazy val prefixComputer = createPrefixComputer(sortPrefixExpr)
+
+  lazy val canUseRadixSort = enableRadixSort && sortOrder.length == 1 &&
+    SortPrefixUtils.canSortFullyWithPrefix(boundSortExpression)
+
+  lazy val pageSize = SparkEnv.get.memoryManager.pageSizeBytes


Please add protected for the variables above.

Sure, I will add it.

maropu · 2020-11-18T06:20:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala

+}
+
+/**
+ * Performs (external) sorting for multiple windows.


Could you leave some comments about what's a difference from SortExec?

maropu · 2020-11-18T06:21:19Z

sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowWindowSorter.java

+import org.apache.spark.util.collection.unsafe.sort.PrefixComparator;
+import org.apache.spark.util.collection.unsafe.sort.RecordComparator;
+
+public final class UnsafeExternalRowWindowSorter extends AbstractUnsafeExternalRowSorter {


Could you leave some comments about what's a difference from UnsafeExternalRowSorter?

maropu · 2020-11-18T06:24:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

-        SortExec(requiredOrdering, global = false, child = child)
+        operator match {
+          case WindowExec(_, partitionSpec, orderSpec, _)
+            if (!partitionSpec.isEmpty && !orderSpec.isEmpty) =>


nit: isEmpty -> nonEmpty

Thanks, will fix it.

maropu · 2020-11-18T06:26:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SortExec.scala

+    global,
+    child,
+    testSpillFrequency) {
+


Could you add assert(partitionSpec.nonEmpty && sortOrderInWindow.nonEmpty, "XXX") here?

opensky142857 · 2020-11-18T10:54:24Z

sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowWindowSorter.java

+      AbstractUnsafeExternalRowSorter sorter = createUnsafeExternalRowSorterForWindow();
+
+      if (sorter == null) {
+        this.mainSorter.spill();


if we fail to create the new sorter, why we need to spill the main sorter?

github-actions · 2021-02-28T00:48:35Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

save change

79be0d4

probot-autolabeler bot added the SQL label Sep 11, 2020

xuzikun2003 changed the title ~~Improve sorting performance of Spark SQL window function by removing window partition key from sort order~~ [SPARK-32096] [SQL] Improve sorting performance of Spark SQL window function by removing window partition key from sort order Sep 11, 2020

xuzikun2003 changed the title ~~[SPARK-32096] [SQL] Improve sorting performance of Spark SQL window function by removing window partition key from sort order~~ [SPARK-32096][SQL] Improve sorting performance of Spark SQL window function by removing window partition key from sort order Sep 11, 2020

zixu added 4 commits September 11, 2020 14:32

revert accidential merge error

f5cec9c

fix the expected execution plan

a07535e

Merge branch 'master' of https://github.com/apache/spark into zixu/wi…

8262c58

…ndowSortExec

add more comments

77d615a

reduce the number of window sorters

6b8ca20

no need to use window sort if there is no order by clause

7604f53

fix test plan

32a6714

opensky142857 suggested changes Nov 18, 2020

View reviewed changes

maropu reviewed Nov 18, 2020

View reviewed changes

opensky142857 reviewed Nov 18, 2020

View reviewed changes

github-actions bot added the Stale label Feb 28, 2021

github-actions bot closed this Mar 1, 2021

[SPARK-32096][SQL] Improve sorting performance of Spark SQL window function by removing window partition key from sort order #29725

[SPARK-32096][SQL] Improve sorting performance of Spark SQL window function by removing window partition key from sort order #29725

Conversation

xuzikun2003 commented Sep 11, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented Sep 28, 2020

hvanhovell commented Sep 28, 2020

SparkQA commented Sep 28, 2020

SparkQA commented Sep 28, 2020

SparkQA commented Sep 28, 2020

SparkQA commented Oct 5, 2020

SparkQA commented Oct 5, 2020

SparkQA commented Oct 5, 2020

SparkQA commented Oct 11, 2020

SparkQA commented Oct 11, 2020

SparkQA commented Oct 11, 2020

SparkQA commented Oct 11, 2020

SparkQA commented Oct 11, 2020

SparkQA commented Oct 11, 2020

xuzikun2003 commented Oct 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuzikun2003 Nov 18, 2020 • edited Loading

Choose a reason for hiding this comment

opensky142857 Nov 18, 2020 • edited Loading

Choose a reason for hiding this comment

xuzikun2003 Nov 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 28, 2021

xuzikun2003 Nov 18, 2020 •

edited

Loading

opensky142857 Nov 18, 2020 •

edited

Loading

xuzikun2003 Nov 18, 2020 •

edited

Loading