Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Arrow (Dev)] Refactor arrow scan internals #8430

Merged
merged 24 commits into from
Aug 7, 2023

Conversation

Tishj
Copy link
Contributor

@Tishj Tishj commented Jul 31, 2023

Previously we populated ArrowConvertData while scanning the arrow schema and the function returned a DuckDB LogicalType.

This has been removed, instead the function now outputs an ArrowType, this contains both the LogicalType and the data that would be put into the ArrowConvertData.

This allows us to simplify the arrow scan a lot, as we no longer need to pass the column index, the map and the 'arrow_convert_index'

This arrow_convert_index consisted of a vector of ArrowConvertDataIndices, this was state used to remember which column/column child we were scanning.
This has now also been removed in its entirety.

@Tishj Tishj requested a review from pdet July 31, 2023 11:59
@github-actions github-actions bot marked this pull request as draft July 31, 2023 12:06
Copy link
Contributor

@pdet pdet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this refactor up, I think the code looks way cleaner!

Just added some nitpicks

unique_ptr<ArrowType> dictionary_type;
};

using arrow_column_map_t = unordered_map<idx_t, ArrowType>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the type alias is a bit of an excessive abstraction here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yea I agree, before this refactor it was all over the place, but now it's only in a single location

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on further inspection I think it makes the prototypes less bulky and clearly defines the relation between ArrowTableType and the arrow scan functions.

It also allows us to change these things in one place, for example when I made the change to unique_ptr over move semantics it was really nice to just change the using definition and not have to hunt down the other uses of it.

@Tishj Tishj force-pushed the arrow_conversion_refactor branch from a51e99e to ee66ea7 Compare August 2, 2023 08:45
@Tishj Tishj marked this pull request as ready for review August 5, 2023 17:23
@Mytherin Mytherin merged commit 3ab8d38 into duckdb:master Aug 7, 2023
53 checks passed
@Mytherin
Copy link
Collaborator

Mytherin commented Aug 7, 2023

Thanks!

Maxxen added a commit to duckdb/duckdb_spatial that referenced this pull request Aug 31, 2023
Apply patch from duckdb/duckdb#8430 and update duckdb to latest main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants