-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support UNION_BY_NAME option in parquet_scan read_parquet #5716
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Looks good. Some comments below:
|
||
// union_col_names will exclude generated columns | ||
// like filename, hivepartition etc. | ||
for (idx_t col = 0; col <= reader->last_parquet_col; ++col) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code looks very similar to the code in read_csv.cpp
- could we extract this into a separate class (e.g. UnionByName
- similar to how we handle hive partitioning in the HivePartitioning
class)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! That makes code much reusable
extension/parquet/parquet_reader.cpp
Outdated
auto child_reader = move(root_struct_reader.child_readers[column_idx]); | ||
auto cast_reader = make_unique<CastColumnReader>(move(child_reader), expected_type); | ||
root_struct_reader.child_readers[column_idx] = move(cast_reader); | ||
if (!parquet_options.union_by_name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if modifying the Parquet Reader itself is necessary here - can't we handle the union_by_name
in the ParquetScanFunction
outside of the ParquetReader
class? That should also allow for more code re-use between the CSV and Parquet readers for this functionality - by e.g. moving SetNullUnionCols
to a shared UnionByName
class.
Could you also have a look at the failing CI? |
parquet_union_by_name.test 's glob pattern also match the files created by parquet_3989.test 😅 |
Very excited for this! |
If a given file does not contain a column of interest (say we do |
if you're saying , there are two parquet files
p2.parquet
For the query Current implementation cant do that. But I think we can improve it in another PR. |
@douenergy sounds good :D I think it could be a worthwhile enhancement in the future! |
@douenergy is there a way to make this behave like |
Currently, parquet_scan with union_by_name just work like |
Perfect! Does this issue still occur (it reading all files for something like a count even with a filter)? #5790 |
@danthegoodman1 |
That'd be AMAZING. I can't get duckdb to build on my machine or in a github codespace to save my life so I will look forward to testing once a release comes out! |
Thanks @Mytherin
|
Unrelated CodeCov github action CI failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes! LGTM
#4837 , #4720 , #4699 and #3238
We can make these queries much easier !
to
All Parquet files matched the glob pattern will union by their col name, If any Parquet file without some specified cols NULL will be generated for missing parquet cols.
As described in Add UNION BY NAME to CSV and Parquet import #4720, parquet_scan with union_by_name isn't reversible.
If we have p1.parquet
and p2.parquet
then
p2new.parquet
andp2.parquet
will have different type. Maybe we should add some warning?