perf: directly create projection instead of using DataFrame::with_column #2222
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
DataFrame::with_column
performs a linear operation in the number of columns to append on an existing column, checking that nothing collides. On top of this once the projection a normalization step (also linear in number of columns) is performed before returning the dataframe.For a merge where we are performing a
when_matched_update_all
type operation on wide tables (100+ columns), this is in effect a2*N^2
operation as we were adding the remapped case columns one at a time withwith_column
and then remapping it.This PR uses
project
directly to construct the logical plan. We don't need any of the special checking for name clashes or windowing thatwith_column
provides and we discard it immediately down to an unoptimized logical plan anyway, so this produces no change to schema - just a much more compact logical plan.This reduces an example merge I had from taking 5+ minutes to just optimize the table, down to about 13 seconds including the merge.