[SPARK-19882][SQL] Pivot with null as the dictinct pivot value throws NPE #17224
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR proposes to fix two problems as below:
Use previous code path to handle
null
in distinct pivot valuesAn optimisation to this was introduced to prevent each input gets evaluated on every aggregate which seems making it slow when
pivotValues
are too many. It seems this tightly assumes that this distinct pivot value can't benull
.I could not find a clean and easy workaround to support this and wonder if it is worth. Please guide me if anyone knows a clean and short way to fix it in this two aggregation path.
Fix to count
null
Before (Spark 1.6),
Before (current master), <- this is currently a regression
After,
It seems we should count null given
How was this patch tested?
Unit tests in
DataFramePivotSuite
.