Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: directly create projection instead of using DataFrame::with_column #2222

Merged
merged 2 commits into from
Feb 27, 2024

Conversation

emcake
Copy link
Contributor

@emcake emcake commented Feb 26, 2024

Description

DataFrame::with_column performs a linear operation in the number of columns to append on an existing column, checking that nothing collides. On top of this once the projection a normalization step (also linear in number of columns) is performed before returning the dataframe.

For a merge where we are performing a when_matched_update_all type operation on wide tables (100+ columns), this is in effect a 2*N^2 operation as we were adding the remapped case columns one at a time with with_column and then remapping it.

This PR uses project directly to construct the logical plan. We don't need any of the special checking for name clashes or windowing that with_column provides and we discard it immediately down to an unoptimized logical plan anyway, so this produces no change to schema - just a much more compact logical plan.

This reduces an example merge I had from taking 5+ minutes to just optimize the table, down to about 13 seconds including the merge.

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Feb 26, 2024
Copy link
Collaborator

@Blajda Blajda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for looking into the performance issues :)

@Blajda Blajda merged commit 2f2acba into delta-io:main Feb 27, 2024
22 checks passed
@ldacey
Copy link
Contributor

ldacey commented Feb 27, 2024

Nice - this will be huge. So this should be in the 0.15.4 release then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants