-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13450] Introduce ExternalAppendOnlyUnsafeRowArray. Change CartesianProductExec, SortMergeJoin, WindowExec to use it #16909
Conversation
ok to test |
@rxin : can you please recommend someone who could review this PR ? |
Test build #72806 has started for PR 16909 at commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tejasapatil I have taken a quick pass. It looks good overall. I think it might be a good idea to do some benchmarking to get an idea of potential performance implications.
} | ||
|
||
def add(entry: InternalRow): Unit = { | ||
val unsafeRow = entry.asInstanceOf[UnsafeRow] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems tricky. Lets move this cast to the call site (and preferably avoid it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved cast to client
* [[add()]] or [[clear()]] calls made on this array after creation of the iterator, | ||
* then the iterator is invalidated thus saving clients from getting inconsistent data. | ||
*/ | ||
def generateIterator(): Iterator[UnsafeRow] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also have a version that skips n
elements? This will be a little more efficient for the unbounded following window case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
var rowBuffer: RowBuffer = null | ||
if (sqlContext == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
@@ -285,6 +283,9 @@ case class WindowExec( | |||
val expressions = windowFrameExpressionFactoryPairs.flatMap(_._1) | |||
val factories = windowFrameExpressionFactoryPairs.map(_._2).toArray | |||
|
|||
var spillThreshold = | |||
sqlContext.conf.getConfString("spark.sql.windowExec.buffer.spill.threshold", "4096").toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make an internal configuration in SQLConf
for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created a SQLConf
|
||
// A counter to keep track of total additions made to this array since its creation. | ||
// This helps to invalidate iterators when there are changes done to the backing array. | ||
private var modCount: Long = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only an issue after we have cleared the buffer right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. There is more than that. Apart from clear()
, even add()
would be regarded as a modification.
The inbuilt iterators (from both ArrayBuffer
and UnsafeExternalSorter
) once created do not see any new data added. Clients of ExternalAppendOnlyUnsafeRowArray
may not realise this and think that they have read all the data. To prevent that, I have to have such mechanism to invalidate existing iterators.
val buffer = ArrayBuffer.empty[Int]
buffer.append(1)
val iterator = buffer.iterator
assert(iterator.hasNext)
assert(iterator.next() == 1)
buffer.append(2)
assert(iterator.hasNext) // <-- THIS FAILS
Also, when add()
transparently switches the backing storage from ArrayBuffer
=> UnsafeExternalSorter
and there was an open iterator created over the ArrayBuffer
, it will lead to IndexOutOfBoundsException
val buffer = ArrayBuffer.empty[Int]
buffer.append(1)
buffer.append(2)
val iterator = buffer.iterator
buffer.clear()
assert(iterator.hasNext) // <-- THIS FAILS
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: I changed to not use inbuilt iterator from ArrayBuffer
and instead use a counter to iterate over the array. However I still depend on inbuilt iterator of UnsafeExternalSorter
* [[ArrayBuffer]] or [[Array]]. | ||
*/ | ||
private[sql] class ExternalAppendOnlyUnsafeRowArray(numRowsSpillThreshold: Int) extends Logging { | ||
private val inMemoryBuffer: ArrayBuffer[UnsafeRow] = ArrayBuffer.empty[UnsafeRow] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it is better just to allocate an array, but that might be overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention behind not using raw Array
is to avoid holding that memory (if we go this route, one would have to set the spill threshold to a relatively lower value to avoid potential wastage of memory).
Before this PR:
- Initial size:
SortMergeJoin
started off withArrayBuffer
of default size (ie. 16)WindowExec
started off with emptyArrayBuffer
- For both the cases, there was no shrinking of the array so memory is not reclaimed until the operator finishes.
Proposed change:
- I am switching to
new ArrayBuffer(128)
for both cases in order to init with decent size and not start with an empty array. Allocating space for 128 entries upfront is trivial memory footprint. - Keeping the "no shrinking" behavior same. A part of me thinks I could do something smarter by shrinking based on running average of actual lengths of the array, but it might be over-optimization. I will first focus on getting the basic stuff in.
@@ -97,6 +98,11 @@ case class SortMergeJoinExec( | |||
|
|||
protected override def doExecute(): RDD[InternalRow] = { | |||
val numOutputRows = longMetric("numOutputRows") | |||
val spillThreshold = | |||
sqlContext.conf.getConfString( | |||
"spark.sql.sortMergeJoinExec.buffer.spill.threshold", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets also move this configuration in SQLConf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ctx.addMutableState(clsName, matches, s"$matches = new $clsName();") | ||
val clsName = classOf[ExternalAppendOnlyUnsafeRowArray].getName | ||
|
||
val spillThreshold = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Place it in a def/lazy cal if we need it more than once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to a method
private[this] val iter: UnsafeSorterIterator = sorter.getIterator | ||
|
||
private[this] val currentRow = new UnsafeRow(numFields) | ||
private[window] class RowBuffer(appendOnlyExternalArray: ExternalAppendOnlyUnsafeRowArray) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets just drop row buffer in favor of ExternalAppendOnlyUnsafeRowArray
it doesn't make a lot of sense to keep this around. We just need a generateIterator(offset)
for the unbounded following case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed RowBuffer
valueInserted | ||
} | ||
|
||
private def checkIfValueExits(iterator: Iterator[UnsafeRow], expectedValue: Long): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @tejasapatil .
checkIfValueExists
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed the typo
private var spillableArray: UnsafeExternalSorter = null | ||
private var numElements = 0 | ||
|
||
// A counter to keep track of total additions made to this array since its creation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to here, total additions
-> total modifications
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
} else { | ||
if (spillableArray == null) { | ||
logInfo(s"Reached spill threshold of $numRowsSpillThreshold rows, switching to " + | ||
s"${classOf[org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray].getName}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the intention here, but the log message looks a little bit misleading to me because we are already using ExternalAppendOnlyUnsafeRowArray
. Also, technically, we are switching to UnsafeExternalSorter
under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a typo. Thanks for pointing this out. Corrected.
@@ -285,6 +283,9 @@ case class WindowExec( | |||
val expressions = windowFrameExpressionFactoryPairs.flatMap(_._1) | |||
val factories = windowFrameExpressionFactoryPairs.map(_._2).toArray | |||
|
|||
var spillThreshold = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var
-> val
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
// This helps to invalidate iterators when there are changes done to the backing array. | ||
private var modCount: Long = 0 | ||
|
||
private var numFieldPerRow = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numFieldsPerRow
for consistency ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
Retest this please |
Test build #72827 has finished for PR 16909 at commit
|
@tejasapatil Do you want to fix the BufferedRowIterator for WholeStageCodegenExec as well? As for inner join, the LinkedList currentRows would cause the same issue as it buffer the rows from inner join, and takes more memory (probably double if left and right has similar size). Also they can share the similar iterator data structure. |
@hvanhovell @davies Correct me if I am wrong. My understanding is that following code will go though all matching rows on the right side, and put them into the BufferedRowIterator. If there is OOM caused by ArrayList matches in SortMergeJoinExec, the memory usage will be doubled in currentRows in BufferedRowIterator (assuming the left and right have the same size). There are two way to solve it. One is to consume one row at a time, and the other one is make BufferedRowIterator spillable. BTW, I used to see that the memory consumption in BufferedRowIterator caused OOM along with matches in SortMergeJoinExec. The former took much more heap memory. |
The problem pointed out by @zhzhan is legit problem and I have seen that personally. However, I would scope out this PR to 2 use cases and handle other cases separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review comments @hvanhovell and @dongjoon-hyun. I have made the changes.
@hvanhovell : Wrt. potential performance implications, the updated PR has a benchmark class with numbers. It's a isolated micro-benchmark comparing the new array impl with ArrayBuffer
. In order to see how much this actually adds up at a query level, I am doing an integration test over a large dataset by running SMB join with this PR and comparing it against version without this PR. Will get back with those numbers in sometime.
var rowBuffer: RowBuffer = null | ||
if (sqlContext == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
|
||
// A counter to keep track of total additions made to this array since its creation. | ||
// This helps to invalidate iterators when there are changes done to the backing array. | ||
private var modCount: Long = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. There is more than that. Apart from clear()
, even add()
would be regarded as a modification.
The inbuilt iterators (from both ArrayBuffer
and UnsafeExternalSorter
) once created do not see any new data added. Clients of ExternalAppendOnlyUnsafeRowArray
may not realise this and think that they have read all the data. To prevent that, I have to have such mechanism to invalidate existing iterators.
val buffer = ArrayBuffer.empty[Int]
buffer.append(1)
val iterator = buffer.iterator
assert(iterator.hasNext)
assert(iterator.next() == 1)
buffer.append(2)
assert(iterator.hasNext) // <-- THIS FAILS
Also, when add()
transparently switches the backing storage from ArrayBuffer
=> UnsafeExternalSorter
and there was an open iterator created over the ArrayBuffer
, it will lead to IndexOutOfBoundsException
val buffer = ArrayBuffer.empty[Int]
buffer.append(1)
buffer.append(2)
val iterator = buffer.iterator
buffer.clear()
assert(iterator.hasNext) // <-- THIS FAILS
}
} | ||
|
||
def add(entry: InternalRow): Unit = { | ||
val unsafeRow = entry.asInstanceOf[UnsafeRow] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved cast to client
* [[ArrayBuffer]] or [[Array]]. | ||
*/ | ||
private[sql] class ExternalAppendOnlyUnsafeRowArray(numRowsSpillThreshold: Int) extends Logging { | ||
private val inMemoryBuffer: ArrayBuffer[UnsafeRow] = ArrayBuffer.empty[UnsafeRow] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention behind not using raw Array
is to avoid holding that memory (if we go this route, one would have to set the spill threshold to a relatively lower value to avoid potential wastage of memory).
Before this PR:
- Initial size:
SortMergeJoin
started off withArrayBuffer
of default size (ie. 16)WindowExec
started off with emptyArrayBuffer
- For both the cases, there was no shrinking of the array so memory is not reclaimed until the operator finishes.
Proposed change:
- I am switching to
new ArrayBuffer(128)
for both cases in order to init with decent size and not start with an empty array. Allocating space for 128 entries upfront is trivial memory footprint. - Keeping the "no shrinking" behavior same. A part of me thinks I could do something smarter by shrinking based on running average of actual lengths of the array, but it might be over-optimization. I will first focus on getting the basic stuff in.
// This helps to invalidate iterators when there are changes done to the backing array. | ||
private var modCount: Long = 0 | ||
|
||
private var numFieldPerRow = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
@@ -285,6 +283,9 @@ case class WindowExec( | |||
val expressions = windowFrameExpressionFactoryPairs.flatMap(_._1) | |||
val factories = windowFrameExpressionFactoryPairs.map(_._2).toArray | |||
|
|||
var spillThreshold = | |||
sqlContext.conf.getConfString("spark.sql.windowExec.buffer.spill.threshold", "4096").toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created a SQLConf
ctx.addMutableState(clsName, matches, s"$matches = new $clsName();") | ||
val clsName = classOf[ExternalAppendOnlyUnsafeRowArray].getName | ||
|
||
val spillThreshold = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to a method
@@ -97,6 +98,11 @@ case class SortMergeJoinExec( | |||
|
|||
protected override def doExecute(): RDD[InternalRow] = { | |||
val numOutputRows = longMetric("numOutputRows") | |||
val spillThreshold = | |||
sqlContext.conf.getConfString( | |||
"spark.sql.sortMergeJoinExec.buffer.spill.threshold", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* [[add()]] or [[clear()]] calls made on this array after creation of the iterator, | ||
* then the iterator is invalidated thus saving clients from getting inconsistent data. | ||
*/ | ||
def generateIterator(): Iterator[UnsafeRow] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
private[this] val iter: UnsafeSorterIterator = sorter.getIterator | ||
|
||
private[this] val currentRow = new UnsafeRow(numFields) | ||
private[window] class RowBuffer(appendOnlyExternalArray: ExternalAppendOnlyUnsafeRowArray) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed RowBuffer
} | ||
} | ||
|
||
def generateIterator(): Iterator[UnsafeRow] = generateIterator(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self review : will change this to generateIterator(startIndex = 0)
Test build #72918 has finished for PR 16909 at commit
|
a4c4416
to
e653171
Compare
Test build #72920 has started for PR 16909 at commit |
@hvanhovell : Re performance: I ran a SMB join query over two real world tables (2 trillion rows (40 TB) and 6 million rows (120 GB)). Note that this dataset does not have skew so no spill happened. I saw improvement in CPU time by 2-4% over version without this PR. This did not add up as I was expected some regression. I think allocating array of capacity of 128 at the start (instead of starting with default size 16) is the sole reason for the perf. gain : https://github.com/tejasapatil/spark/blob/SPARK-13450_smb_buffer_oom/sql/core/src/main/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArray.scala#L43 . I could remove that and rerun, but effectively the change will be deployed in this form and I wanted to see the effect of it over large workload. For micro-benchmarking, I took care of making the test case allocate array buffers of same size to avoid this from interfering the results. The numbers there look sane as there is some regression from using the spillable array. I see testcases |
* This may lead to a performance regression compared to the normal case of using an | ||
* [[ArrayBuffer]] or [[Array]]. | ||
*/ | ||
private[sql] class ExternalAppendOnlyUnsafeRowArray(numRowsSpillThreshold: Int) extends Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this compare to use ExternalUnsafeSorter directly? There is a similar use case here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/CartesianProductExec.scala#L42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies :
- See PR description with comparison results.
- Updated the PR to include
CartesianProductExec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the comparison results, ExternalUnsafeSorter
performs slightly better? If so, any reason not to use ExternalUnsafeSorter
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. My original guess before writing benchmark was ArrayBuffer
would be superior to ExternalUnsafeSorter
so having a impl which would initially behave like ArrayBuffer
but later switch to ExternalUnsafeSorter
was what I went with. I won't make this call solely based on the micro benchmark results as it might not reflect what acutally happens when a query runs because there are other operations that happen while the buffer is populated and accessed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I created a test version by removing the ArrayBuffer
and let alone ExternalUnsafeSorter
be used to back the data. Over prod queries, this turned out to be slightly slower unlike what the micro-benchmark results showed. I will stick with the current approach.
ok to test |
retest this please |
Test build #72941 has finished for PR 16909 at commit
|
1494c83
to
5b73e2b
Compare
Test build #72963 has finished for PR 16909 at commit
|
Test build #72964 has finished for PR 16909 at commit
|
…ernalAppendOnlyUnsafeRowArray` against using raw `UnsafeExternalSorter`
3ae834a
to
23acc3f
Compare
Test build #74603 has finished for PR 16909 at commit
|
LGTM - merging to master. Thanks! |
|
||
// Setup the frames. | ||
var i = 0 | ||
while (i < numFrames) { | ||
frames(i).prepare(rowBuffer.copy()) | ||
frames(i).prepare(buffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No copy?
…lling ## What changes were proposed in this pull request? `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`. This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by #16909, after this PR Spark spills more eagerly. This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point. ## How was this patch tested? Added a regression test to `DataFrameWindowFunctionsSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #18470 from hvanhovell/SPARK-21258. (cherry picked from commit e2f32ee) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…lling ## What changes were proposed in this pull request? `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`. This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by apache#16909, after this PR Spark spills more eagerly. This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point. ## How was this patch tested? Added a regression test to `DataFrameWindowFunctionsSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes apache#18470 from hvanhovell/SPARK-21258.
…lling ## What changes were proposed in this pull request? `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`. This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by #16909, after this PR Spark spills more eagerly. This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point. ## How was this patch tested? Added a regression test to `DataFrameWindowFunctionsSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #18470 from hvanhovell/SPARK-21258. (cherry picked from commit e2f32ee) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…nalAppendOnlyUnsafeRowArray ## What changes were proposed in this pull request? [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre #16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers). Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control. ## How was this patch tested? Added unit tests Author: Tejas Patil <tejasp@fb.com> Closes #18843 from tejasapatil/SPARK-21595. (cherry picked from commit 9443999) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
…nalAppendOnlyUnsafeRowArray ## What changes were proposed in this pull request? [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre apache#16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers). Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control. ## How was this patch tested? Added unit tests Author: Tejas Patil <tejasp@fb.com> Closes apache#18843 from tejasapatil/SPARK-21595.
…nalAppendOnlyUnsafeRowArray ## What changes were proposed in this pull request? [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre apache#16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers). Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control. ## How was this patch tested? Added unit tests Author: Tejas Patil <tejasp@fb.com> Closes apache#18843 from tejasapatil/SPARK-21595.
I have one question regarding this change. @tejasapatil thanks, |
…nalAppendOnlyUnsafeRowArray ## What changes were proposed in this pull request? [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre apache#16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers). Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control. ## How was this patch tested? Added unit tests Author: Tejas Patil <tejasp@fb.com> Closes apache#18843 from tejasapatil/SPARK-21595. (cherry picked from commit 9443999) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
Hi @sheperdh , the PR author does make a brief note about 'Full Outer Joins' in the PR description.
Apparently, this separate PR is pending. |
What issue does this PR address ?
Jira: https://issues.apache.org/jira/browse/SPARK-13450
In
SortMergeJoinExec
, rows of the right relation having the same value for a join key are buffered in-memory. In case of skew, this causes OOMs (see comments in SPARK-13450 for more details). Heap dump from a failed job confirms this : https://issues.apache.org/jira/secure/attachment/12846382/heap-dump-analysis.png . While its possible to increase the heap size to workaround, Spark should be resilient to such issues as skews can happen arbitrarily.Change proposed in this pull request
ExternalAppendOnlyUnsafeRowArray
UnsafeRow
s in-memory upto a certain threshold.UnsafeExternalSorter
which enables spilling of the rows to disk. It does NOT sort the data.add
orclear
) will invalidate the existing iterator(s)WindowExec
was already usingUnsafeExternalSorter
to support spilling. Changed it to use the new arraySortMergeJoinExec
to use the new array implementationCartesianProductExec
to use the new array implementationNote for reviewers
The diff can be divided into 3 parts. My motive behind having all the changes in a single PR was to demonstrate that the API is sane and supports 2 use cases. If reviewing as 3 separate PRs would help, I am happy to make the split.
How was this patch tested ?
Unit testing
ExternalAppendOnlyUnsafeRowArray
to validate all its APIs and access patternsSortMergeExec
SortMergeExec
for all joins (except FULL OUTER JOIN). However, I expect existing test cases to cover that there is no regression in correctness.WindowExec
to check behavior of spilling and correctness of results.Stress testing
Generating the synthetic data
Ran this against trunk VS local build with this PR. OOM repros with trunk and with the fix this query runs fine.
Performance comparison
Macro-benchmark
I ran a SMB join query over two real world tables (2 trillion rows (40 TB) and 6 million rows (120 GB)). Note that this dataset does not have skew so no spill happened. I saw improvement in CPU time by 2-4% over version without this PR. This did not add up as I was expected some regression. I think allocating array of capacity of 128 at the start (instead of starting with default size 16) is the sole reason for the perf. gain : https://github.com/tejasapatil/spark/blob/SPARK-13450_smb_buffer_oom/sql/core/src/main/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArray.scala#L43 . I could remove that and rerun, but effectively the change will be deployed in this form and I wanted to see the effect of it over large workload.
Micro-benchmark
Two types of benchmarking can be found in
ExternalAppendOnlyUnsafeRowArrayBenchmark
:[A] Comparing
ExternalAppendOnlyUnsafeRowArray
against rawArrayBuffer
when all rows fit in-memory and there is no spill[B] Comparing
ExternalAppendOnlyUnsafeRowArray
against rawUnsafeExternalSorter
when there is spilling of data