-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32900][CORE] Allow UnsafeExternalSorter to spill when there are nulls. #29772
Conversation
ok to test |
core/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending jenkins
public long spill() throws IOException { | ||
synchronized (this) { | ||
if (!(upstream instanceof UnsafeInMemorySorter.SortedIterator && nextUpstream == null | ||
&& numRecords > 0)) { | ||
if (inMemSorter == null || numRecords <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't touched this file for a long time. Do you mind providing more context? Does inMemSorter == null || numRecords <= 0
mean this sorter has not been inserted any records yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was originally supposed to mean this, but it could also mean that all records have been read. This is actually a bug as during the call to spill it would prevent freeing the memory that's no longer necessary. I've kept this bug here to keep the changelist as small as possible (there's a little bit more to this than just removing the check). I'll address this issue in a separate ticket.
Test build #128768 has finished for PR 29772 at commit
|
Test build #128769 has finished for PR 29772 at commit
|
retest this please |
Test build #128770 has finished for PR 29772 at commit
|
Test build #128771 has finished for PR 29772 at commit
|
Test build #128776 has finished for PR 29772 at commit
|
Can we update the description a bit? This is reading like Spark thinks if (!(upstream instanceof UnsafeInMemorySorter.SortedIterator && nextUpstream == null
&& numRecords > 0)) {
return 0L;
} |
Thanks for the review! Reworded the description to be clearer. |
Alright merging to master/3.0/2.4. Thanks! |
…e nulls ### What changes were proposed in this pull request? This PR changes the way `UnsafeExternalSorter.SpillableIterator` checks whether it has spilled already, by checking whether `inMemSorter` is null. It also allows it to spill other `UnsafeSorterIterator`s than `UnsafeInMemorySorter.SortedIterator`. ### Why are the changes needed? Before this PR `UnsafeExternalSorter.SpillableIterator` could not spill when there are NULLs in the input and radix sorting is used. Currently, Spark determines whether UnsafeExternalSorter.SpillableIterator has not spilled yet by checking whether `upstream` is an instance of `UnsafeInMemorySorter.SortedIterator`. When radix sorting is used and there are NULLs in the input however, `upstream` will be an instance of `UnsafeExternalSorter.ChainedIterator` instead, and Spark will assume that the `SpillableIterator` iterator has spilled already, and therefore cannot spill again when it's supposed to spill. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A test was added to `UnsafeExternalSorterSuite` (and therefore also to `UnsafeExternalSorterRadixSortSuite`). I manually confirmed that the test failed in `UnsafeExternalSorterRadixSortSuite` without this patch. Closes #29772 from tomvanbussel/SPARK-32900. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: herman <herman@databricks.com> (cherry picked from commit e5e54a3) Signed-off-by: herman <herman@databricks.com>
…e nulls ### What changes were proposed in this pull request? This PR changes the way `UnsafeExternalSorter.SpillableIterator` checks whether it has spilled already, by checking whether `inMemSorter` is null. It also allows it to spill other `UnsafeSorterIterator`s than `UnsafeInMemorySorter.SortedIterator`. ### Why are the changes needed? Before this PR `UnsafeExternalSorter.SpillableIterator` could not spill when there are NULLs in the input and radix sorting is used. Currently, Spark determines whether UnsafeExternalSorter.SpillableIterator has not spilled yet by checking whether `upstream` is an instance of `UnsafeInMemorySorter.SortedIterator`. When radix sorting is used and there are NULLs in the input however, `upstream` will be an instance of `UnsafeExternalSorter.ChainedIterator` instead, and Spark will assume that the `SpillableIterator` iterator has spilled already, and therefore cannot spill again when it's supposed to spill. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A test was added to `UnsafeExternalSorterSuite` (and therefore also to `UnsafeExternalSorterRadixSortSuite`). I manually confirmed that the test failed in `UnsafeExternalSorterRadixSortSuite` without this patch. Closes #29772 from tomvanbussel/SPARK-32900. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: herman <herman@databricks.com> (cherry picked from commit e5e54a3) Signed-off-by: herman <herman@databricks.com>
…e nulls ### What changes were proposed in this pull request? This PR changes the way `UnsafeExternalSorter.SpillableIterator` checks whether it has spilled already, by checking whether `inMemSorter` is null. It also allows it to spill other `UnsafeSorterIterator`s than `UnsafeInMemorySorter.SortedIterator`. ### Why are the changes needed? Before this PR `UnsafeExternalSorter.SpillableIterator` could not spill when there are NULLs in the input and radix sorting is used. Currently, Spark determines whether UnsafeExternalSorter.SpillableIterator has not spilled yet by checking whether `upstream` is an instance of `UnsafeInMemorySorter.SortedIterator`. When radix sorting is used and there are NULLs in the input however, `upstream` will be an instance of `UnsafeExternalSorter.ChainedIterator` instead, and Spark will assume that the `SpillableIterator` iterator has spilled already, and therefore cannot spill again when it's supposed to spill. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A test was added to `UnsafeExternalSorterSuite` (and therefore also to `UnsafeExternalSorterRadixSortSuite`). I manually confirmed that the test failed in `UnsafeExternalSorterRadixSortSuite` without this patch. Closes apache#29772 from tomvanbussel/SPARK-32900. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: herman <herman@databricks.com> (cherry picked from commit e5e54a3) Signed-off-by: herman <herman@databricks.com>
Recently, I encountered this problem in the production environment. There are millions of nulls in the memory. There is no way to spill. |
…e nulls This PR changes the way `UnsafeExternalSorter.SpillableIterator` checks whether it has spilled already, by checking whether `inMemSorter` is null. It also allows it to spill other `UnsafeSorterIterator`s than `UnsafeInMemorySorter.SortedIterator`. Before this PR `UnsafeExternalSorter.SpillableIterator` could not spill when there are NULLs in the input and radix sorting is used. Currently, Spark determines whether UnsafeExternalSorter.SpillableIterator has not spilled yet by checking whether `upstream` is an instance of `UnsafeInMemorySorter.SortedIterator`. When radix sorting is used and there are NULLs in the input however, `upstream` will be an instance of `UnsafeExternalSorter.ChainedIterator` instead, and Spark will assume that the `SpillableIterator` iterator has spilled already, and therefore cannot spill again when it's supposed to spill. No A test was added to `UnsafeExternalSorterSuite` (and therefore also to `UnsafeExternalSorterRadixSortSuite`). I manually confirmed that the test failed in `UnsafeExternalSorterRadixSortSuite` without this patch. Closes apache#29772 from tomvanbussel/SPARK-32900. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: herman <herman@databricks.com> (cherry picked from commit e5e54a3) Signed-off-by: herman <herman@databricks.com> Ref: LIHADOOP-58062 RB=2553974 BUG=LIHADOOP-58062 G=spark-reviewers R=mmuralid A=mmuralid
What changes were proposed in this pull request?
This PR changes the way
UnsafeExternalSorter.SpillableIterator
checks whether it has spilled already, by checking whetherinMemSorter
is null. It also allows it to spill otherUnsafeSorterIterator
s thanUnsafeInMemorySorter.SortedIterator
.Why are the changes needed?
Before this PR
UnsafeExternalSorter.SpillableIterator
could not spill when there are NULLs in the input and radix sorting is used. Currently, Spark determines whether UnsafeExternalSorter.SpillableIterator has not spilled yet by checking whetherupstream
is an instance ofUnsafeInMemorySorter.SortedIterator
. When radix sorting is used and there are NULLs in the input however,upstream
will be an instance ofUnsafeExternalSorter.ChainedIterator
instead, and Spark will assume that theSpillableIterator
iterator has spilled already, and therefore cannot spill again when it's supposed to spill.Does this PR introduce any user-facing change?
No
How was this patch tested?
A test was added to
UnsafeExternalSorterSuite
(and therefore also toUnsafeExternalSorterRadixSortSuite
). I manually confirmed that the test failed inUnsafeExternalSorterRadixSortSuite
without this patch.