Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48019][SQL][FOLLOWUP] Use primitive arrays over object arrays when nulls exist #46372

Closed
wants to merge 1 commit into from

Conversation

gene-db
Copy link
Contributor

@gene-db gene-db commented May 3, 2024

What changes were proposed in this pull request?

This is a followup to #46254 . Instead of using object arrays when nulls are present, continue to use primitive arrays when appropriate. This PR sets the null bits appropriately for the primitive array copy.

Primitive arrays are faster than object arrays and won't create unnecessary objects.

Why are the changes needed?

This will improve performance and memory usage, when nulls are present in the ColumnarArray.

Does this PR introduce any user-facing change?

This is expected to be faster when copying ColumnarArray.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label May 3, 2024
@gene-db
Copy link
Contributor Author

gene-db commented May 3, 2024

@cloud-fan Could you take a look at this followup to improve the primitive array with nulls issue. Thanks!

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-48019][FOLLOWUP] Set nulls correctly for primitive arrays [SPARK-48019][SQL][FOLLOWUP] Set nulls correctly for primitive arrays May 3, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, @gene-db .

  • Could you provide a test case to validate your contribution and to prevent any future regression at this area?
  • I'd like to recommend to use [SQL] in SQL module PR title next time.

@gene-db
Copy link
Contributor Author

gene-db commented May 3, 2024

@dongjoon-hyun

  • Could you provide a test case to validate your contribution and to prevent any future regression at this area?

Yes, the previous PR added many tests in this area, to make sure it is correct.

The reason I am updating in this PR is to go back to using primitive arrays (and still produce correct results with the tests), which are faster than the object ones.

  • I'd like to recommend to use [SQL] in SQL module PR title next time.

Ok, I will add the appropriate module name next time. Thanks!

@dongjoon-hyun
Copy link
Member

To @gene-db , for the following purpose, you should not use correctly in the PR title because it's misleading.

faster than the object ones.

@dongjoon-hyun
Copy link
Member

Please revise the PR title and description.

@gene-db gene-db changed the title [SPARK-48019][SQL][FOLLOWUP] Set nulls correctly for primitive arrays [SPARK-48019][SQL][FOLLOWUP] Use primitive arrays over object arrays when nulls exist May 3, 2024
@gene-db
Copy link
Contributor Author

gene-db commented May 3, 2024

Please revise the PR title and description.

@dongjoon-hyun Updated the title and description.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to change it back to use primitive arrays!

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.5!

@cloud-fan cloud-fan closed this in bf2e254 May 5, 2024
cloud-fan pushed a commit that referenced this pull request May 5, 2024
…when nulls exist

### What changes were proposed in this pull request?

This is a followup to #46254 . Instead of using object arrays when nulls are present, continue to use primitive arrays when appropriate. This PR sets the null bits appropriately for the primitive array copy.

Primitive arrays are faster than object arrays and won't create unnecessary objects.

### Why are the changes needed?

This will improve performance and memory usage, when nulls are present in the `ColumnarArray`.

### Does this PR introduce _any_ user-facing change?

This is expected to be faster when copying `ColumnarArray`.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46372 from gene-db/primitive-nulls.

Authored-by: Gene Pang <gene.pang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit bf2e254)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…when nulls exist

### What changes were proposed in this pull request?

This is a followup to apache#46254 . Instead of using object arrays when nulls are present, continue to use primitive arrays when appropriate. This PR sets the null bits appropriately for the primitive array copy.

Primitive arrays are faster than object arrays and won't create unnecessary objects.

### Why are the changes needed?

This will improve performance and memory usage, when nulls are present in the `ColumnarArray`.

### Does this PR introduce _any_ user-facing change?

This is expected to be faster when copying `ColumnarArray`.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46372 from gene-db/primitive-nulls.

Authored-by: Gene Pang <gene.pang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants