Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48019] Fix incorrect behavior in ColumnVector/ColumnarArray with dictionary and nulls #46254

Closed
wants to merge 3 commits into from

Conversation

gene-db
Copy link
Contributor

@gene-db gene-db commented Apr 27, 2024

What changes were proposed in this pull request?

This fixes how ColumnVector handles copying arrays when the vector has a dictionary and null values. The possible issues with the previous implementation:

  • An ArrayIndexOutOfBoundsException may be thrown when the ColumnVector has nulls and dictionaries. This is because the dictionary id for null entries might be invalid and should not be used for null entries.
  • Copying a ColumnarArray (which contains a ColumnVector) is incorrect, if it contains null entries. This is because copying a primitive array does not take into account the null entries, so all the null entries get lost.

Why are the changes needed?

These changes are needed to avoid ArrayIndexOutOfBoundsException and to produce correct results when copying ColumnarArray.

Does this PR introduce any user-facing change?

The only user facing changes are to fix existing errors and incorrect results.

How was this patch tested?

Added new unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Apr 27, 2024
Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.5!

@cloud-fan cloud-fan closed this in 76ce6b0 Apr 28, 2024
cloud-fan pushed a commit that referenced this pull request Apr 28, 2024
…th dictionary and nulls

This fixes how `ColumnVector` handles copying arrays when the vector has a dictionary and null values. The possible issues with the previous implementation:
- An `ArrayIndexOutOfBoundsException` may be thrown when the `ColumnVector` has nulls and dictionaries. This is because the dictionary id for `null` entries might be invalid and should not be used for `null` entries.
- Copying a `ColumnarArray` (which contains a `ColumnVector`) is incorrect, if it contains `null` entries. This is because copying a primitive array does not take into account the `null` entries, so all the null entries get lost.

These changes are needed to avoid `ArrayIndexOutOfBoundsException` and to produce correct results when copying `ColumnarArray`.

The only user facing changes are to fix existing errors and incorrect results.

Added new unit tests.

No.

Closes #46254 from gene-db/dictionary-nulls.

Authored-by: Gene Pang <gene.pang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 76ce6b0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request May 5, 2024
…when nulls exist

### What changes were proposed in this pull request?

This is a followup to #46254 . Instead of using object arrays when nulls are present, continue to use primitive arrays when appropriate. This PR sets the null bits appropriately for the primitive array copy.

Primitive arrays are faster than object arrays and won't create unnecessary objects.

### Why are the changes needed?

This will improve performance and memory usage, when nulls are present in the `ColumnarArray`.

### Does this PR introduce _any_ user-facing change?

This is expected to be faster when copying `ColumnarArray`.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46372 from gene-db/primitive-nulls.

Authored-by: Gene Pang <gene.pang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request May 5, 2024
…when nulls exist

### What changes were proposed in this pull request?

This is a followup to #46254 . Instead of using object arrays when nulls are present, continue to use primitive arrays when appropriate. This PR sets the null bits appropriately for the primitive array copy.

Primitive arrays are faster than object arrays and won't create unnecessary objects.

### Why are the changes needed?

This will improve performance and memory usage, when nulls are present in the `ColumnarArray`.

### Does this PR introduce _any_ user-facing change?

This is expected to be faster when copying `ColumnarArray`.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46372 from gene-db/primitive-nulls.

Authored-by: Gene Pang <gene.pang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit bf2e254)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…th dictionary and nulls

### What changes were proposed in this pull request?

This fixes how `ColumnVector` handles copying arrays when the vector has a dictionary and null values. The possible issues with the previous implementation:
- An `ArrayIndexOutOfBoundsException` may be thrown when the `ColumnVector` has nulls and dictionaries. This is because the dictionary id for `null` entries might be invalid and should not be used for `null` entries.
- Copying a `ColumnarArray` (which contains a `ColumnVector`) is incorrect, if it contains `null` entries. This is because copying a primitive array does not take into account the `null` entries, so all the null entries get lost.

### Why are the changes needed?

These changes are needed to avoid `ArrayIndexOutOfBoundsException` and to produce correct results when copying `ColumnarArray`.

### Does this PR introduce _any_ user-facing change?

The only user facing changes are to fix existing errors and incorrect results.

### How was this patch tested?

Added new unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46254 from gene-db/dictionary-nulls.

Authored-by: Gene Pang <gene.pang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…when nulls exist

### What changes were proposed in this pull request?

This is a followup to apache#46254 . Instead of using object arrays when nulls are present, continue to use primitive arrays when appropriate. This PR sets the null bits appropriately for the primitive array copy.

Primitive arrays are faster than object arrays and won't create unnecessary objects.

### Why are the changes needed?

This will improve performance and memory usage, when nulls are present in the `ColumnarArray`.

### Does this PR introduce _any_ user-facing change?

This is expected to be faster when copying `ColumnarArray`.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46372 from gene-db/primitive-nulls.

Authored-by: Gene Pang <gene.pang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @gene-db and @cloud-fan .

SPARK-48019 was filed with Affected Version: 3.4.3. I also verified that the test cases fail on branch-3.4. Since this is a correctness issue, let me backport this and the follow-up to branch-3.4

Screenshot 2024-09-04 at 13 51 34

dongjoon-hyun pushed a commit that referenced this pull request Sep 4, 2024
…th dictionary and nulls

This fixes how `ColumnVector` handles copying arrays when the vector has a dictionary and null values. The possible issues with the previous implementation:
- An `ArrayIndexOutOfBoundsException` may be thrown when the `ColumnVector` has nulls and dictionaries. This is because the dictionary id for `null` entries might be invalid and should not be used for `null` entries.
- Copying a `ColumnarArray` (which contains a `ColumnVector`) is incorrect, if it contains `null` entries. This is because copying a primitive array does not take into account the `null` entries, so all the null entries get lost.

These changes are needed to avoid `ArrayIndexOutOfBoundsException` and to produce correct results when copying `ColumnarArray`.

The only user facing changes are to fix existing errors and incorrect results.

Added new unit tests.

No.

Closes #46254 from gene-db/dictionary-nulls.

Authored-by: Gene Pang <gene.pang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 76ce6b0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
dongjoon-hyun pushed a commit that referenced this pull request Sep 4, 2024
…when nulls exist

### What changes were proposed in this pull request?

This is a followup to #46254 . Instead of using object arrays when nulls are present, continue to use primitive arrays when appropriate. This PR sets the null bits appropriately for the primitive array copy.

Primitive arrays are faster than object arrays and won't create unnecessary objects.

### Why are the changes needed?

This will improve performance and memory usage, when nulls are present in the `ColumnarArray`.

### Does this PR introduce _any_ user-facing change?

This is expected to be faster when copying `ColumnarArray`.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46372 from gene-db/primitive-nulls.

Authored-by: Gene Pang <gene.pang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit bf2e254)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Sep 24, 2024
…th dictionary and nulls

This fixes how `ColumnVector` handles copying arrays when the vector has a dictionary and null values. The possible issues with the previous implementation:
- An `ArrayIndexOutOfBoundsException` may be thrown when the `ColumnVector` has nulls and dictionaries. This is because the dictionary id for `null` entries might be invalid and should not be used for `null` entries.
- Copying a `ColumnarArray` (which contains a `ColumnVector`) is incorrect, if it contains `null` entries. This is because copying a primitive array does not take into account the `null` entries, so all the null entries get lost.

These changes are needed to avoid `ArrayIndexOutOfBoundsException` and to produce correct results when copying `ColumnarArray`.

The only user facing changes are to fix existing errors and incorrect results.

Added new unit tests.

No.

Closes apache#46254 from gene-db/dictionary-nulls.

Authored-by: Gene Pang <gene.pang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 76ce6b0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Sep 24, 2024
…when nulls exist

### What changes were proposed in this pull request?

This is a followup to apache#46254 . Instead of using object arrays when nulls are present, continue to use primitive arrays when appropriate. This PR sets the null bits appropriately for the primitive array copy.

Primitive arrays are faster than object arrays and won't create unnecessary objects.

### Why are the changes needed?

This will improve performance and memory usage, when nulls are present in the `ColumnarArray`.

### Does this PR introduce _any_ user-facing change?

This is expected to be faster when copying `ColumnarArray`.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46372 from gene-db/primitive-nulls.

Authored-by: Gene Pang <gene.pang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit bf2e254)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants