Support UNION_BY_NAME option in parquet_scan read_parquet #5716

douenergy · 2022-12-16T00:57:14Z

We can make these queries much easier !

SELECT * FROM 'p1.parquet' UNION BY NAME
SELECT * FROM 'p2.parquet' UNION BY NAME
SELECT * FROM 'p3.parquet'

to

SELECT *  FROM parquet_scan('p*.parquet' ,  union_by_name=TRUE);

All Parquet files matched the glob pattern will union by their col name, If any Parquet file without some specified cols NULL will be generated for missing parquet cols.
As described in Add UNION BY NAME to CSV and Parquet import #4720, parquet_scan with union_by_name isn't reversible.

If we have p1.parquet

col1
a
b
c

and p2.parquet

col1
1
2
3

INSERT INTO tbl SELECT * FROM read_parquet('*.parquet', FILENAME=TRUE, UNION_BY_NAME=TRUE);

then

COPY (SELECT * FROM tbl WHERE filename='p2.parquet') TO 'p2new.parquet';

p2new.parquet and p2.parquet will have different type. Maybe we should add some warning?

Mytherin

Thanks for the PR! Looks good. Some comments below:

Mytherin · 2022-12-16T09:31:30Z

extension/parquet/parquet-extension.cpp

+
+			// union_col_names will exclude generated columns
+			// like filename, hivepartition etc.
+			for (idx_t col = 0; col <= reader->last_parquet_col; ++col) {


This code looks very similar to the code in read_csv.cpp - could we extract this into a separate class (e.g. UnionByName - similar to how we handle hive partitioning in the HivePartitioning class)?

Yes! That makes code much reusable

Mytherin · 2022-12-16T09:36:42Z

extension/parquet/parquet_reader.cpp

-		auto child_reader = move(root_struct_reader.child_readers[column_idx]);
-		auto cast_reader = make_unique<CastColumnReader>(move(child_reader), expected_type);
-		root_struct_reader.child_readers[column_idx] = move(cast_reader);
+	if (!parquet_options.union_by_name) {


I'm not sure if modifying the Parquet Reader itself is necessary here - can't we handle the union_by_name in the ParquetScanFunction outside of the ParquetReader class? That should also allow for more code re-use between the CSV and Parquet readers for this functionality - by e.g. moving SetNullUnionCols to a shared UnionByName class.

Mytherin · 2022-12-16T09:37:24Z

Could you also have a look at the failing CI?

douenergy · 2022-12-16T13:19:15Z

Could you also have a look at the failing CI?

parquet_union_by_name.test 's glob pattern also match the files created by parquet_3989.test 😅

danthegoodman1 · 2022-12-17T18:23:04Z

Very excited for this!

danthegoodman1 · 2022-12-17T19:17:49Z

If a given file does not contain a column of interest (say we do SELECT col_b FROM read_parquet... WHERE col_b IS NOT NULL), will DuckDB be able to recognize from the metadata that the column does not exist and thus not read the file that does not contain the column?

douenergy · 2022-12-19T07:20:48Z

If a given file does not contain a column of interest (say we do SELECT col_b FROM read_parquet... WHERE col_b IS NOT NULL), will DuckDB be able to recognize from the metadata that the column does not exist and thus not read the file that does not contain the column?

if you're saying , there are two parquet files
p1.parquet

colA   colB
  1          a
  2          b
  3          c

p2.parquet

colA   colC
  1          x
  2          y
  3          z

For the query SELECT colB FROM read_parquet... WHERE colB IS NOT NULL just read only all metadata and completely skip p2.parquet

Current implementation cant do that. But I think we can improve it in another PR.

danthegoodman1 · 2022-12-19T13:14:20Z

@douenergy sounds good :D I think it could be a worthwhile enhancement in the future!

danthegoodman1 · 2022-12-26T15:04:43Z

@douenergy is there a way to make this behave like UNION ALL BY NAME?

douenergy · 2022-12-27T22:51:03Z

@douenergy is there a way to make this behave like UNION ALL BY NAME?

Currently, parquet_scan with union_by_name just work like UNION ALL BY NAME.
It allows duplicate row.

danthegoodman1 · 2022-12-27T23:06:48Z

Currently, parquet_scan with union_by_name just work like UNION ALL BY NAME. It allows duplicate row.

Perfect!

Does this issue still occur (it reading all files for something like a count even with a filter)? #5790

douenergy · 2022-12-28T16:30:31Z

@danthegoodman1
I'think parquet_scan (with union_by_name) would help for your use case.
Because there will no UNION operator in your query plan(even you select from VIEW).

danthegoodman1 · 2022-12-28T16:33:33Z

That'd be AMAZING. I can't get duckdb to build on my machine or in a github codespace to save my life so I will look forward to testing once a release comes out!

douenergy · 2022-12-29T02:56:31Z

Thanks @Mytherin
I think I've now ready for another review.

Created an UnionByName template in common/union_by_name.hpp
Rename BaseCSVReader sql_types , col_names to return_types names. This make BaseCSVReader and ParquetReader have same data member names for same purpose.
fix the test.

douenergy · 2023-01-12T04:10:32Z

Unrelated CodeCov github action CI failed
See codecov/codecov-action#598

Mytherin

Thanks for the fixes! LGTM

douenergy added 8 commits December 16, 2022 08:00

add Parquet union_by_name option

246f183

parquet union by name logic

43eceb8

generated null column reader

2037d45

cast to union type

983f79b

parquet union by name test

0cf011d

formatting

998bf1d

little comment

8f7db79

fix test

b1ac981

Mytherin reviewed Dec 16, 2022

View reviewed changes

douenergy added 2 commits December 16, 2022 21:56

fix the glob pattern match other test

399961d

fix test

104ed06

douenergy added 8 commits December 27, 2022 07:54

union by name template init

6419bbc

Parquet use union by name template

a200c99

buffercsvreader uses union by name template

a12eacf

fix test

7cc9a60

formatting

430d5ae

Merge branch 'master' into Parquet-Union-By-Name

6004c1c

fix conflict

b16287b

pessimizing-move

23deda4

added union_by_name.hpp to amalgamation script

df2cbd3

douenergy added 9 commits January 1, 2023 19:58

fight with ci

23467c6

format

fa53d47

parquet SetNullUnionCols init

e37ffbb

SetNullUnionCols template

3df8faf

fix double free erroe

656c898

fix segmentfault

ad1f5f7

Merge branch 'master' into Parquet-Union-By-Name

db4bbbb

fix conflict

ada120b

format

5636595

Mytherin approved these changes Jan 12, 2023

View reviewed changes

Mytherin merged commit 53088f9 into duckdb:master Jan 13, 2023

papparapa mentioned this pull request Jan 15, 2023

[Python] Add UNION_BY_NAME to from_parquet arguments #5913

Merged

douenergy deleted the Parquet-Union-By-Name branch February 15, 2023 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support UNION_BY_NAME option in parquet_scan read_parquet #5716

Support UNION_BY_NAME option in parquet_scan read_parquet #5716

douenergy commented Dec 16, 2022

Mytherin left a comment

Mytherin Dec 16, 2022

douenergy Dec 16, 2022

Mytherin Dec 16, 2022

Mytherin commented Dec 16, 2022

douenergy commented Dec 16, 2022 •

edited

Loading

danthegoodman1 commented Dec 17, 2022

danthegoodman1 commented Dec 17, 2022

douenergy commented Dec 19, 2022 •

edited

Loading

danthegoodman1 commented Dec 19, 2022

danthegoodman1 commented Dec 26, 2022

douenergy commented Dec 27, 2022

danthegoodman1 commented Dec 27, 2022

douenergy commented Dec 28, 2022

danthegoodman1 commented Dec 28, 2022

douenergy commented Dec 29, 2022 •

edited

Loading

douenergy commented Jan 12, 2023

Mytherin left a comment

Support UNION_BY_NAME option in parquet_scan read_parquet #5716

Support UNION_BY_NAME option in parquet_scan read_parquet #5716

Conversation

douenergy commented Dec 16, 2022

Mytherin left a comment

Choose a reason for hiding this comment

Mytherin Dec 16, 2022

Choose a reason for hiding this comment

douenergy Dec 16, 2022

Choose a reason for hiding this comment

Mytherin Dec 16, 2022

Choose a reason for hiding this comment

Mytherin commented Dec 16, 2022

douenergy commented Dec 16, 2022 • edited Loading

danthegoodman1 commented Dec 17, 2022

danthegoodman1 commented Dec 17, 2022

douenergy commented Dec 19, 2022 • edited Loading

danthegoodman1 commented Dec 19, 2022

danthegoodman1 commented Dec 26, 2022

douenergy commented Dec 27, 2022

danthegoodman1 commented Dec 27, 2022

douenergy commented Dec 28, 2022

danthegoodman1 commented Dec 28, 2022

douenergy commented Dec 29, 2022 • edited Loading

douenergy commented Jan 12, 2023

Mytherin left a comment

Choose a reason for hiding this comment

douenergy commented Dec 16, 2022 •

edited

Loading

douenergy commented Dec 19, 2022 •

edited

Loading

douenergy commented Dec 29, 2022 •

edited

Loading