Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32900][CORE] Allow UnsafeExternalSorter to spill when there are nulls. #29772

Closed
wants to merge 5 commits into from

Conversation

tomvanbussel
Copy link
Contributor

@tomvanbussel tomvanbussel commented Sep 16, 2020

What changes were proposed in this pull request?

This PR changes the way UnsafeExternalSorter.SpillableIterator checks whether it has spilled already, by checking whether inMemSorter is null. It also allows it to spill other UnsafeSorterIterators than UnsafeInMemorySorter.SortedIterator.

Why are the changes needed?

Before this PR UnsafeExternalSorter.SpillableIterator could not spill when there are NULLs in the input and radix sorting is used. Currently, Spark determines whether UnsafeExternalSorter.SpillableIterator has not spilled yet by checking whether upstream is an instance of UnsafeInMemorySorter.SortedIterator. When radix sorting is used and there are NULLs in the input however, upstream will be an instance of UnsafeExternalSorter.ChainedIterator instead, and Spark will assume that the SpillableIterator iterator has spilled already, and therefore cannot spill again when it's supposed to spill.

Does this PR introduce any user-facing change?

No

How was this patch tested?

A test was added to UnsafeExternalSorterSuite (and therefore also to UnsafeExternalSorterRadixSortSuite). I manually confirmed that the test failed in UnsafeExternalSorterRadixSortSuite without this patch.

@tomvanbussel
Copy link
Contributor Author

cc @hvanhovell @cloud-fan

@hvanhovell
Copy link
Contributor

ok to test

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending jenkins

public long spill() throws IOException {
synchronized (this) {
if (!(upstream instanceof UnsafeInMemorySorter.SortedIterator && nextUpstream == null
&& numRecords > 0)) {
if (inMemSorter == null || numRecords <= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't touched this file for a long time. Do you mind providing more context? Does inMemSorter == null || numRecords <= 0 mean this sorter has not been inserted any records yet?

Copy link
Contributor Author

@tomvanbussel tomvanbussel Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was originally supposed to mean this, but it could also mean that all records have been read. This is actually a bug as during the call to spill it would prevent freeing the memory that's no longer necessary. I've kept this bug here to keep the changelist as small as possible (there's a little bit more to this than just removing the check). I'll address this issue in a separate ticket.

@SparkQA
Copy link

SparkQA commented Sep 16, 2020

Test build #128768 has finished for PR 29772 at commit e6dbe2b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 16, 2020

Test build #128769 has finished for PR 29772 at commit 6eaf7f0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 16, 2020

Test build #128770 has finished for PR 29772 at commit 4bfa122.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 16, 2020

Test build #128771 has finished for PR 29772 at commit e514327.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 16, 2020

Test build #128776 has finished for PR 29772 at commit e514327.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Sep 17, 2020

Currently, Spark determines whether UnsafeExternalSorter.SpillableIterator has spilled already by checking whether upstream is an instance of UnsafeInMemorySorter.SortedIterator

Can we update the description a bit?

This is reading like Spark thinks UnsafeExternalSorter.SpillableIterator has spilled already if upstream is UnsafeInMemorySorter.SortedIterator. But it actually is Spark thinks UnsafeExternalSorter.SpillableIterator has spilled already if upstream is not UnsafeInMemorySorter.SortedIterator, right?

if (!(upstream instanceof UnsafeInMemorySorter.SortedIterator && nextUpstream == null
  && numRecords > 0)) {
  return 0L;
}

@tomvanbussel
Copy link
Contributor Author

Thanks for the review! Reworded the description to be clearer.

@hvanhovell
Copy link
Contributor

Alright merging to master/3.0/2.4. Thanks!

hvanhovell pushed a commit that referenced this pull request Sep 17, 2020
…e nulls

### What changes were proposed in this pull request?

This PR changes the way `UnsafeExternalSorter.SpillableIterator` checks whether it has spilled already, by checking whether `inMemSorter` is null. It also allows it to spill other `UnsafeSorterIterator`s than `UnsafeInMemorySorter.SortedIterator`.

### Why are the changes needed?

Before this PR `UnsafeExternalSorter.SpillableIterator` could not spill when there are NULLs in the input and radix sorting is used. Currently, Spark determines whether UnsafeExternalSorter.SpillableIterator has not spilled yet by checking whether `upstream` is an instance of `UnsafeInMemorySorter.SortedIterator`. When radix sorting is used and there are NULLs in the input however, `upstream` will be an instance of `UnsafeExternalSorter.ChainedIterator` instead, and Spark will assume that the `SpillableIterator` iterator has spilled already, and therefore cannot spill again when it's supposed to spill.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A test was added to `UnsafeExternalSorterSuite` (and therefore also to `UnsafeExternalSorterRadixSortSuite`). I manually confirmed that the test failed in `UnsafeExternalSorterRadixSortSuite` without this patch.

Closes #29772 from tomvanbussel/SPARK-32900.

Authored-by: Tom van Bussel <tom.vanbussel@databricks.com>
Signed-off-by: herman <herman@databricks.com>
(cherry picked from commit e5e54a3)
Signed-off-by: herman <herman@databricks.com>
hvanhovell pushed a commit that referenced this pull request Sep 17, 2020
…e nulls

### What changes were proposed in this pull request?

This PR changes the way `UnsafeExternalSorter.SpillableIterator` checks whether it has spilled already, by checking whether `inMemSorter` is null. It also allows it to spill other `UnsafeSorterIterator`s than `UnsafeInMemorySorter.SortedIterator`.

### Why are the changes needed?

Before this PR `UnsafeExternalSorter.SpillableIterator` could not spill when there are NULLs in the input and radix sorting is used. Currently, Spark determines whether UnsafeExternalSorter.SpillableIterator has not spilled yet by checking whether `upstream` is an instance of `UnsafeInMemorySorter.SortedIterator`. When radix sorting is used and there are NULLs in the input however, `upstream` will be an instance of `UnsafeExternalSorter.ChainedIterator` instead, and Spark will assume that the `SpillableIterator` iterator has spilled already, and therefore cannot spill again when it's supposed to spill.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A test was added to `UnsafeExternalSorterSuite` (and therefore also to `UnsafeExternalSorterRadixSortSuite`). I manually confirmed that the test failed in `UnsafeExternalSorterRadixSortSuite` without this patch.

Closes #29772 from tomvanbussel/SPARK-32900.

Authored-by: Tom van Bussel <tom.vanbussel@databricks.com>
Signed-off-by: herman <herman@databricks.com>
(cherry picked from commit e5e54a3)
Signed-off-by: herman <herman@databricks.com>
holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
…e nulls

### What changes were proposed in this pull request?

This PR changes the way `UnsafeExternalSorter.SpillableIterator` checks whether it has spilled already, by checking whether `inMemSorter` is null. It also allows it to spill other `UnsafeSorterIterator`s than `UnsafeInMemorySorter.SortedIterator`.

### Why are the changes needed?

Before this PR `UnsafeExternalSorter.SpillableIterator` could not spill when there are NULLs in the input and radix sorting is used. Currently, Spark determines whether UnsafeExternalSorter.SpillableIterator has not spilled yet by checking whether `upstream` is an instance of `UnsafeInMemorySorter.SortedIterator`. When radix sorting is used and there are NULLs in the input however, `upstream` will be an instance of `UnsafeExternalSorter.ChainedIterator` instead, and Spark will assume that the `SpillableIterator` iterator has spilled already, and therefore cannot spill again when it's supposed to spill.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A test was added to `UnsafeExternalSorterSuite` (and therefore also to `UnsafeExternalSorterRadixSortSuite`). I manually confirmed that the test failed in `UnsafeExternalSorterRadixSortSuite` without this patch.

Closes apache#29772 from tomvanbussel/SPARK-32900.

Authored-by: Tom van Bussel <tom.vanbussel@databricks.com>
Signed-off-by: herman <herman@databricks.com>
(cherry picked from commit e5e54a3)
Signed-off-by: herman <herman@databricks.com>
@cxzl25
Copy link
Contributor

cxzl25 commented Jul 7, 2021

Recently, I encountered this problem in the production environment. There are millions of nulls in the memory. There is no way to spill.
I used your patch and the verification took effect.
Thank you. @tomvanbussel
Before this, I also raised related questions. #26323

otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…e nulls

This PR changes the way `UnsafeExternalSorter.SpillableIterator` checks whether it has spilled already, by checking whether `inMemSorter` is null. It also allows it to spill other `UnsafeSorterIterator`s than `UnsafeInMemorySorter.SortedIterator`.

Before this PR `UnsafeExternalSorter.SpillableIterator` could not spill when there are NULLs in the input and radix sorting is used. Currently, Spark determines whether UnsafeExternalSorter.SpillableIterator has not spilled yet by checking whether `upstream` is an instance of `UnsafeInMemorySorter.SortedIterator`. When radix sorting is used and there are NULLs in the input however, `upstream` will be an instance of `UnsafeExternalSorter.ChainedIterator` instead, and Spark will assume that the `SpillableIterator` iterator has spilled already, and therefore cannot spill again when it's supposed to spill.

No

A test was added to `UnsafeExternalSorterSuite` (and therefore also to `UnsafeExternalSorterRadixSortSuite`). I manually confirmed that the test failed in `UnsafeExternalSorterRadixSortSuite` without this patch.

Closes apache#29772 from tomvanbussel/SPARK-32900.

Authored-by: Tom van Bussel <tom.vanbussel@databricks.com>
Signed-off-by: herman <herman@databricks.com>
(cherry picked from commit e5e54a3)
Signed-off-by: herman <herman@databricks.com>

Ref: LIHADOOP-58062

RB=2553974
BUG=LIHADOOP-58062
G=spark-reviewers
R=mmuralid
A=mmuralid
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants