-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48019] Fix incorrect behavior in ColumnVector/ColumnarArray with dictionary and nulls #46254
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
thanks, merging to master/3.5! |
…th dictionary and nulls This fixes how `ColumnVector` handles copying arrays when the vector has a dictionary and null values. The possible issues with the previous implementation: - An `ArrayIndexOutOfBoundsException` may be thrown when the `ColumnVector` has nulls and dictionaries. This is because the dictionary id for `null` entries might be invalid and should not be used for `null` entries. - Copying a `ColumnarArray` (which contains a `ColumnVector`) is incorrect, if it contains `null` entries. This is because copying a primitive array does not take into account the `null` entries, so all the null entries get lost. These changes are needed to avoid `ArrayIndexOutOfBoundsException` and to produce correct results when copying `ColumnarArray`. The only user facing changes are to fix existing errors and incorrect results. Added new unit tests. No. Closes #46254 from gene-db/dictionary-nulls. Authored-by: Gene Pang <gene.pang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 76ce6b0) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…when nulls exist ### What changes were proposed in this pull request? This is a followup to #46254 . Instead of using object arrays when nulls are present, continue to use primitive arrays when appropriate. This PR sets the null bits appropriately for the primitive array copy. Primitive arrays are faster than object arrays and won't create unnecessary objects. ### Why are the changes needed? This will improve performance and memory usage, when nulls are present in the `ColumnarArray`. ### Does this PR introduce _any_ user-facing change? This is expected to be faster when copying `ColumnarArray`. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46372 from gene-db/primitive-nulls. Authored-by: Gene Pang <gene.pang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…when nulls exist ### What changes were proposed in this pull request? This is a followup to #46254 . Instead of using object arrays when nulls are present, continue to use primitive arrays when appropriate. This PR sets the null bits appropriately for the primitive array copy. Primitive arrays are faster than object arrays and won't create unnecessary objects. ### Why are the changes needed? This will improve performance and memory usage, when nulls are present in the `ColumnarArray`. ### Does this PR introduce _any_ user-facing change? This is expected to be faster when copying `ColumnarArray`. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46372 from gene-db/primitive-nulls. Authored-by: Gene Pang <gene.pang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit bf2e254) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…th dictionary and nulls ### What changes were proposed in this pull request? This fixes how `ColumnVector` handles copying arrays when the vector has a dictionary and null values. The possible issues with the previous implementation: - An `ArrayIndexOutOfBoundsException` may be thrown when the `ColumnVector` has nulls and dictionaries. This is because the dictionary id for `null` entries might be invalid and should not be used for `null` entries. - Copying a `ColumnarArray` (which contains a `ColumnVector`) is incorrect, if it contains `null` entries. This is because copying a primitive array does not take into account the `null` entries, so all the null entries get lost. ### Why are the changes needed? These changes are needed to avoid `ArrayIndexOutOfBoundsException` and to produce correct results when copying `ColumnarArray`. ### Does this PR introduce _any_ user-facing change? The only user facing changes are to fix existing errors and incorrect results. ### How was this patch tested? Added new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46254 from gene-db/dictionary-nulls. Authored-by: Gene Pang <gene.pang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…when nulls exist ### What changes were proposed in this pull request? This is a followup to apache#46254 . Instead of using object arrays when nulls are present, continue to use primitive arrays when appropriate. This PR sets the null bits appropriately for the primitive array copy. Primitive arrays are faster than object arrays and won't create unnecessary objects. ### Why are the changes needed? This will improve performance and memory usage, when nulls are present in the `ColumnarArray`. ### Does this PR introduce _any_ user-facing change? This is expected to be faster when copying `ColumnarArray`. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46372 from gene-db/primitive-nulls. Authored-by: Gene Pang <gene.pang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @gene-db and @cloud-fan .
SPARK-48019 was filed with Affected Version: 3.4.3
. I also verified that the test cases fail on branch-3.4
. Since this is a correctness issue, let me backport this and the follow-up to branch-3.4
…th dictionary and nulls This fixes how `ColumnVector` handles copying arrays when the vector has a dictionary and null values. The possible issues with the previous implementation: - An `ArrayIndexOutOfBoundsException` may be thrown when the `ColumnVector` has nulls and dictionaries. This is because the dictionary id for `null` entries might be invalid and should not be used for `null` entries. - Copying a `ColumnarArray` (which contains a `ColumnVector`) is incorrect, if it contains `null` entries. This is because copying a primitive array does not take into account the `null` entries, so all the null entries get lost. These changes are needed to avoid `ArrayIndexOutOfBoundsException` and to produce correct results when copying `ColumnarArray`. The only user facing changes are to fix existing errors and incorrect results. Added new unit tests. No. Closes #46254 from gene-db/dictionary-nulls. Authored-by: Gene Pang <gene.pang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 76ce6b0) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…when nulls exist ### What changes were proposed in this pull request? This is a followup to #46254 . Instead of using object arrays when nulls are present, continue to use primitive arrays when appropriate. This PR sets the null bits appropriately for the primitive array copy. Primitive arrays are faster than object arrays and won't create unnecessary objects. ### Why are the changes needed? This will improve performance and memory usage, when nulls are present in the `ColumnarArray`. ### Does this PR introduce _any_ user-facing change? This is expected to be faster when copying `ColumnarArray`. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46372 from gene-db/primitive-nulls. Authored-by: Gene Pang <gene.pang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit bf2e254) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…th dictionary and nulls This fixes how `ColumnVector` handles copying arrays when the vector has a dictionary and null values. The possible issues with the previous implementation: - An `ArrayIndexOutOfBoundsException` may be thrown when the `ColumnVector` has nulls and dictionaries. This is because the dictionary id for `null` entries might be invalid and should not be used for `null` entries. - Copying a `ColumnarArray` (which contains a `ColumnVector`) is incorrect, if it contains `null` entries. This is because copying a primitive array does not take into account the `null` entries, so all the null entries get lost. These changes are needed to avoid `ArrayIndexOutOfBoundsException` and to produce correct results when copying `ColumnarArray`. The only user facing changes are to fix existing errors and incorrect results. Added new unit tests. No. Closes apache#46254 from gene-db/dictionary-nulls. Authored-by: Gene Pang <gene.pang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 76ce6b0) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…when nulls exist ### What changes were proposed in this pull request? This is a followup to apache#46254 . Instead of using object arrays when nulls are present, continue to use primitive arrays when appropriate. This PR sets the null bits appropriately for the primitive array copy. Primitive arrays are faster than object arrays and won't create unnecessary objects. ### Why are the changes needed? This will improve performance and memory usage, when nulls are present in the `ColumnarArray`. ### Does this PR introduce _any_ user-facing change? This is expected to be faster when copying `ColumnarArray`. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46372 from gene-db/primitive-nulls. Authored-by: Gene Pang <gene.pang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit bf2e254) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This fixes how
ColumnVector
handles copying arrays when the vector has a dictionary and null values. The possible issues with the previous implementation:ArrayIndexOutOfBoundsException
may be thrown when theColumnVector
has nulls and dictionaries. This is because the dictionary id fornull
entries might be invalid and should not be used fornull
entries.ColumnarArray
(which contains aColumnVector
) is incorrect, if it containsnull
entries. This is because copying a primitive array does not take into account thenull
entries, so all the null entries get lost.Why are the changes needed?
These changes are needed to avoid
ArrayIndexOutOfBoundsException
and to produce correct results when copyingColumnarArray
.Does this PR introduce any user-facing change?
The only user facing changes are to fix existing errors and incorrect results.
How was this patch tested?
Added new unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.