Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support UNION_BY_NAME option in parquet_scan read_parquet #5716

Merged
merged 28 commits into from
Jan 13, 2023

Conversation

douenergy
Copy link
Contributor

#4837 , #4720 , #4699 and #3238

We can make these queries much easier !

SELECT * FROM 'p1.parquet' UNION BY NAME
SELECT * FROM 'p2.parquet' UNION BY NAME
SELECT * FROM 'p3.parquet' 

to

SELECT *  FROM parquet_scan('p*.parquet' ,  union_by_name=TRUE);
  • All Parquet files matched the glob pattern will union by their col name, If any Parquet file without some specified cols NULL will be generated for missing parquet cols.

  • As described in Add UNION BY NAME to CSV and Parquet import #4720, parquet_scan with union_by_name isn't reversible.

If we have p1.parquet

col1
a
b
c

and p2.parquet

col1
1
2
3
INSERT INTO tbl SELECT * FROM read_parquet('*.parquet', FILENAME=TRUE, UNION_BY_NAME=TRUE);

then

COPY (SELECT * FROM tbl WHERE filename='p2.parquet') TO 'p2new.parquet';

p2new.parquet and p2.parquet will have different type. Maybe we should add some warning?

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good. Some comments below:


// union_col_names will exclude generated columns
// like filename, hivepartition etc.
for (idx_t col = 0; col <= reader->last_parquet_col; ++col) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks very similar to the code in read_csv.cpp - could we extract this into a separate class (e.g. UnionByName - similar to how we handle hive partitioning in the HivePartitioning class)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! That makes code much reusable

auto child_reader = move(root_struct_reader.child_readers[column_idx]);
auto cast_reader = make_unique<CastColumnReader>(move(child_reader), expected_type);
root_struct_reader.child_readers[column_idx] = move(cast_reader);
if (!parquet_options.union_by_name) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if modifying the Parquet Reader itself is necessary here - can't we handle the union_by_name in the ParquetScanFunction outside of the ParquetReader class? That should also allow for more code re-use between the CSV and Parquet readers for this functionality - by e.g. moving SetNullUnionCols to a shared UnionByName class.

@Mytherin
Copy link
Collaborator

Could you also have a look at the failing CI?

@douenergy
Copy link
Contributor Author

douenergy commented Dec 16, 2022

Could you also have a look at the failing CI?

parquet_union_by_name.test 's glob pattern also match the files created by parquet_3989.test 😅

@danthegoodman1
Copy link

Very excited for this!

@danthegoodman1
Copy link

If a given file does not contain a column of interest (say we do SELECT col_b FROM read_parquet... WHERE col_b IS NOT NULL), will DuckDB be able to recognize from the metadata that the column does not exist and thus not read the file that does not contain the column?

@douenergy
Copy link
Contributor Author

douenergy commented Dec 19, 2022

If a given file does not contain a column of interest (say we do SELECT col_b FROM read_parquet... WHERE col_b IS NOT NULL), will DuckDB be able to recognize from the metadata that the column does not exist and thus not read the file that does not contain the column?

if you're saying , there are two parquet files
p1.parquet

colA   colB
  1          a
  2          b
  3          c

p2.parquet

colA   colC
  1          x
  2          y
  3          z

For the query SELECT colB FROM read_parquet... WHERE colB IS NOT NULL just read only all metadata and completely skip p2.parquet

Current implementation cant do that. But I think we can improve it in another PR.

@danthegoodman1
Copy link

@douenergy sounds good :D I think it could be a worthwhile enhancement in the future!

@danthegoodman1
Copy link

@douenergy is there a way to make this behave like UNION ALL BY NAME?

@douenergy
Copy link
Contributor Author

@douenergy is there a way to make this behave like UNION ALL BY NAME?

Currently, parquet_scan with union_by_name just work like UNION ALL BY NAME.
It allows duplicate row.

@danthegoodman1
Copy link

Currently, parquet_scan with union_by_name just work like UNION ALL BY NAME. It allows duplicate row.

Perfect!

Does this issue still occur (it reading all files for something like a count even with a filter)? #5790

@douenergy
Copy link
Contributor Author

@danthegoodman1
I'think parquet_scan (with union_by_name) would help for your use case.
Because there will no UNION operator in your query plan(even you select from VIEW).

@danthegoodman1
Copy link

That'd be AMAZING. I can't get duckdb to build on my machine or in a github codespace to save my life so I will look forward to testing once a release comes out!

@douenergy
Copy link
Contributor Author

douenergy commented Dec 29, 2022

Thanks @Mytherin
I think I've now ready for another review.

  1. Created an UnionByName template in common/union_by_name.hpp
  2. Rename BaseCSVReader sql_types , col_names to return_types names. This make BaseCSVReader and ParquetReader have same data member names for same purpose.
  3. fix the test.

@douenergy
Copy link
Contributor Author

Unrelated CodeCov github action CI failed
See codecov/codecov-action#598

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes! LGTM

@Mytherin Mytherin merged commit 53088f9 into duckdb:master Jan 13, 2023
@douenergy douenergy deleted the Parquet-Union-By-Name branch February 15, 2023 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants