-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CSV Reader] Reorder of Columns for CSV Scans on multiple files. #12288
Conversation
…ror messages and defaulting to true if it's all varchar
@Mytherin is this good to go? |
Thanks for the PR! The code looks good to me. I've tried it on the data set in this issue - #11334 I get a number of errors, e.g.:
The reason for these errors is not an actual schema mismatch, but rather the columns have a lot of In the Parquet reader, this would work, and we would just insert a cast. The cast then fails at run-time rather than at sniffing time if there is an actual mismatch. In this case since there is no mismatch - the script would just work. Perhaps something to look into? Or maybe for a follow-up PR? |
I already have a different branch that I'm working on the minimal sniffer, can add this there, if that's ok. |
Playing around with this some more I think this might have some other issues remaining, e.g. writing lineitem with the columns re-ordered: call dbgen(sf=1);
copy (select l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment from lineitem limit 100000 offset 0)
to 'lineitem_part1.csv';
copy (select l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_tax,l_discount,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment from lineitem limit 100000 offset 100000)
to 'lineitem_part2.csv';
copy (select l_comment,l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_tax,l_discount,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode from lineitem limit 100000 offset 200000)
to 'lineitem_part3.csv';
copy (select l_comment,l_orderkey,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_tax,l_discount,l_shipmode from lineitem limit 100000 offset 300000)
to 'lineitem_part4.csv';
copy (select l_comment,l_orderkey,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_tax,l_discount,l_shipmode,l_shipmode as shipmode2 from lineitem limit 100000 offset 400000)
to 'lineitem_part5.csv'; Then reading it back, results in the following error: select * from read_csv(['lineitem_part1.csv', 'lineitem_part2.csv', 'lineitem_part3.csv', 'lineitem_part4.csv', 'lineitem_part5.csv']);
The
|
I've added: https://github.com/duckdb/duckdb/blob/f0944f33814abbfedd63a32434261c10b6438668/test/sql/copy/csv/test_glob_reorder_null.test as a minimal test for the NULL checking https://github.com/duckdb/duckdb/blob/f0944f33814abbfedd63a32434261c10b6438668/test/sql/copy/csv/test_glob_reorder_lineitem.test for the lineitem test |
Thanks! Looks good - could you just have a look at the merge conflicts? |
Thanks! |
Merge pull request duckdb/duckdb#12288 from pdet/auto_glob_matching
This PR adds the ability of reordering columns from a CSV Scanner over multiple files, if they match the schema.
For example, consider the following files:
We can now scan these files, without setting the
union_by_name
option.e.g.,
FROM 'a*.csv'
In practice, these columns are reordered after sniffing to match the order of the first sniffed filed.
In case of a schema mismatch, for example by having:
An appropriate error message will be thrown:
It is important to notice that if a3.csv has columns not defined in the first sniffed file, these will just be dropped.
For example if
a3.csv
was:FROM 'a*.csv'
would not consider the d column froma3.csv
.In terms of performance, this brings about 30% better performance for these cases. The benchmark runs a
FROM 'file_*.csv'
over 100 files of 5 columns and ~ 1000000 rows each.union_by_name = false
: 2.20union_by_name = true
: 3I think the next step here, is to implement "adaptive sniffing" where instead of sniffing the default or set values of the file, we sniff a much lower number of rows, and if they don't match the expected schema, we adaptively sniff more and more.