-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements #27246
Conversation
@siknezevic Thank you for your contribution. Could you please reformat the PR description? |
@kiszk. Thank you for the questions. I hope that I reformatted PR description correctly. |
Does this change only have an good effect only on large data? For example, how about small dataset, e.g., TPCDS sf=10 and cores=1 case? |
The small data set will fit into executor memory, so there is no need to spill. I tested with just few records and it works fine. This solution will be faster for small data set when comparing with the current Spark spilling. My focus are bigger data sets and this solution will greatly improve spilling performance in those cases. |
Regarding the "Implement lazy initialization of UnsafeSorterSpillReader" portion of the changes: It looks like @liutang123 previously proposed a similar change in 2018 in #20184 / SPARK-22987. I stumbled across that PR earlier this year and left some comments: overall, I think that lazy initialization of the individual spill readers makes sense. In order to enable incremental review and merge of these changes, do you think it could make sense to split the lazy spill reader initialization changes into a separate PR? I haven't compared the implementations in detail, so I haven't formed any opinion on the approach here vs. the one in #20184. |
sql/core/src/main/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArray.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArray.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the ExternalAppendOnlyUnsafeRowArray
change:
- It looks like there's some benchmarks for this class in
ExternalAppendOnlyUnsafeRowArrayBenchmark
; we should probably re-run those after this change. - It seems like a lot of the complexity in
ExternalAppendOnlyUnsafeRowArray
comes from the fact that we want to define a "do not spill" threshold in terms of an absolute row count. If it wasn't for this, I think we could just directly use anUnsafeExternalSorter
here (the performance overhead inserting a record when you don't spill should be roughly the same AFAIK, since both paths are more-or-less just copying the row's memory into the end of buffer). It looks like this was previously discussed at [SPARK-13450] Introduce ExternalAppendOnlyUnsafeRowArray. Change CartesianProductExec, SortMergeJoin, WindowExec to use it #16909 (comment), where it sounds like the current design was motivated by performance considerations. - Your idea of having a "merged iterator" between the in-memory and spill iterators might make sense; I left some feedback / comments on the current implementation of that idea (I think I've spotted a couple of bugs).
Regarding decreasing the initial memory read buffer size in Intuitively this also make sense to me: AFAIK this buffer only needs to be large enough to hold a single record and records >= 1MB are probably relatively rare. I'm not sure what the optimal threshold here is, though, since I don't really have data on median record sizes for representative real-world workloads. |
Thank you for the comments. |
I just want to mention that having 1 MB buffer has negative 10X performance impact. The constructor of UnsafeSorterSpillReader takes extra time to create 1M buffer. The benchmark tests with 1K buffer showed good performance |
I agree it would be good if we could use ExternalUnsafeSorter only. ExternalAppendOnlyUnsafeRowArray class has issue that it does not have protection from OutOfMemory (OOM) like ExternalUnsafeSorter does. OOM issue (and Spark executor loss) would happen at point of doubling of capacity of the ExternalAppendOnlyUnsafeRowArray object. This object gets full and needs to expand and then OOM happens when trying to allocate memory for new “double in capacity” object (ArrayBuffer). |
I fixed the issues in ExternalAppendOnlyUnsafeRowArray. Next, I coming days will push new PR for lazy spill reader initialization. |
I did test to see what is the impact on the spilling performance without changes in ExternalAppendOnlyUnsafeRowArray class.The design of MergerSorterIteretor.hasNext() in ExternalAppendOnlyUnsafeRowArray is critical for the good performance. The below is the stack of the code execution which has lazy initialization of the spill reader but no changes in ExternalAppendOnlyUnsafeRowArray. As you can see invocation of SpillableArrayIterator.hasNext() will cause initialization of UnsafeSorterSpillReader and it will produce I/0. The invocation of MergerSorterIteretor.hasNext() will not cause initialization of UnsafeSorterSpillReader, so I/0 will not happen. The startIndex that is passed to MergerSorterIteretor is 0. startIndex initialize currentIndex, so we have currentIndex < numRowBufferedInMemor (0 < numRowBufferedInMemor ) .This means that MergerSorterIteretor.hasNext() will not invoke iterator.hasNext() method which is the case for SpillableArrayIterator.hasNext(). Invocation of iterator.hasNext() is producing I/O. Could you please let me know if I should still submit PR for lazy spill reader initialization? sun.misc.Unsafe.park(Native Method) |
sql/core/src/main/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArray.scala
Outdated
Show resolved
Hide resolved
Can we get this PR in please? |
OK to test |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
Could you please reopen this PR and remove the Stale tag? This PR brings significant performance improvements when spilling is enabled. Thank you |
ok to test |
Test build #123264 has finished for PR 27246 at commit
|
retest this please |
Why this closed? |
Could you reformat the description for easy-to-read instead of just copying the benchmark result. That's because the description will be used as commit log messages. |
...rc/test/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArrayBenchmark.scala
Show resolved
Hide resolved
sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt
Show resolved
Hide resolved
Sorry, I added info about what I did and I thought the task is done. Apparently I was wrong |
I changed the description. Could you please review it? Thank you for the help !! |
sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt
Show resolved
Hide resolved
Looks okay to me. Anyone could check this? @cloud-fan @dongjoon-hyun @JoshRosen @jiangxb1987 |
Just checking to see if there is anything on my side that I should do to move this PR. Thank you |
kindly ping: @cloud-fan @dongjoon-hyun @JoshRosen @jiangxb1987 |
ok to test |
Test build #126860 has finished for PR 27246 at commit
|
Test build #126900 has started for PR 27246 at commit |
@siknezevic Seems like a valid test failure, so could you check it? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126900/testReport/org.apache.spark.sql/JoinSuite/test_SortMergeJoin__with_spill_/ |
What changes were proposed in this pull request? The following list of changes will improve SQL execution performance when data is spilled on the disk: 1) Implement lazy initialization of UnsafeSorterSpillReader - iterator on top of spilled rows: ... During SortMergeJoin (Left Semi Join) execution, the iterator on the spill data is created but no iteration over the data is done. ... Having lazy initialization of UnsafeSorterSpillReader will enable efficient processing of SortMergeJoin even if data is spilled onto disk. Unnecessary I/O will be avoided. 2) Decrease initial memory read buffer size in UnsafeSorterSpillReader from 1MB to 1KB: ... UnsafeSorterSpillReader constructor takes lot of time due to size of default 1MB memory read buffer. ... The code already has logic to increase the memory read buffer if it cannot fit the data, so decreasing the size to 1K is safe and has positive performance impact. 3) Improve memory utilization when spilling is enabled in ExternalAppendOnlyUnsafeRowArrey ... In the current implementation, when spilling is enabled, UnsafeExternalSorter object is created and then data moved from ExternalAppendOnlyUnsafeRowArrey object into UnsafeExternalSorter and then ExternalAppendOnlyUnsafeRowArrey object is emptied. Just before ExternalAppendOnlyUnsafeRowArrey object is emptied there are both objects in the memory with the same data. That require double memory and there is duplication of data. This can be avoided. ... In the proposed solution, when spark.sql.sortMergeJoinExec.buffer.in.memory.threshold is reached adding new rows into ExternalAppendOnlyUnsafeRowArray object stops. UnsafeExternalSorter object is created and new rows are added into this object. ExternalAppendOnlyUnsafeRowArray object retains all rows already added into this object. This approach will enable better memory utilization and avoid unnecessary movement of data from one object into another. Why are the changes needed? Testing with TPC-DS 100 TB benchmark data set showed that some of SQLs (example query 14) are not able to run even with extremely large Spark executor memory. Spark spilling feature has to be enabled, in order to be able to process these SQLs. Processing of SQLs becomes extremely slow when spilling is enabled. The test of this solution with query 14 and enabled spilling on the disk, showed 500X performance improvements and it didn�t degrade performance of the other SQLs from TPC-DS benchmark. Does this PR introduce any user-facing change? No How was this patch tested? By running TPC-DS SQLs with different data sets 10 TB and 100 TB By running all Spark tests.
Test build #126922 has finished for PR 27246 at commit
|
retest this please |
Test build #127776 has finished for PR 27246 at commit
|
retest this please |
Test build #128448 has finished for PR 27246 at commit
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
@siknezevic Hi, is there any progress on this pr? |
No. |
What changes were proposed in this pull request?
The following list of changes will improve SQL execution performance when data is spilled on the disk:
... During SortMergeJoin (Left Semi Join) execution, the iterator on the spill data is created but no iteration over the data is done.
... Having lazy initialization of UnsafeSorterSpillReader will enable efficient processing of SortMergeJoin even if data is spilled onto disk. Unnecessary I/O will be avoided.
... UnsafeSorterSpillReader constructor takes lot of time due to size of default 1MB memory read buffer.
... The code already has logic to increase the memory read buffer if it cannot fit the data, so decreasing the size to 1K is safe and has positive performance impact.
... In the current implementation, when spilling is enabled, UnsafeExternalSorter object is created and then data moved from ExternalAppendOnlyUnsafeRowArrey object into UnsafeExternalSorter and then ExternalAppendOnlyUnsafeRowArrey object is emptied. Just before ExternalAppendOnlyUnsafeRowArrey object is emptied there are both objects in the memory with the same data. That require double memory and there is duplication of data. This can be avoided.
... In the proposed solution, when spark.sql.sortMergeJoinExec.buffer.in.memory.threshold is reached adding new rows into ExternalAppendOnlyUnsafeRowArray object stops. UnsafeExternalSorter object is created and new rows are added into this object. ExternalAppendOnlyUnsafeRowArray object retains all rows already added into this object. This approach will enable better memory utilization and avoid unnecessary movement of data from one object into another.
Why are the changes needed?
Testing with TPC-DS 100 TB benchmark data set showed that some of SQLs (example query 14) are not able to run even with extremely large Spark executor memory.Spark spilling feature has to be enabled, in order to be able to process these SQLs. Processing of SQLs becomes extremely slow when spilling is enabled. The test of this solution with query 14 and enabled spilling on the disk, showed 500X performance improvements and it didn�t degrade performance of the other SQLs from TPC-DS benchmark.
The testing was done with micro-benchmark that simulates typical join scenario for 100000 rows when spilling is enabled. In this micro-benchmark the spilling will create three spill files. The test covered two scenarios: Scenario when patch is NOT applied, so micro-benchmark runs existing Spark code. Then, scenario when patch is applied. The average processing time of 100000 join rows without patch is 943989 milliseconds. The average processing time of 100000 joins rows when patch is applied is 426 milliseconds. Comparison of these two micro-benchmarks shows more than 2000 X (943898/426 = 2215.72 X ) difference in performance. If the spilling produces more then three spill files, the performance difference would be even greater.
Does this PR introduce any user-facing change?
No
How was this patch tested?
By running TPC-DS SQLs with different data sets 10 TB and 100 TB
By running all Spark tests.