Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CSV Reader] Reorder of Columns for CSV Scans on multiple files. #12288

Merged
merged 43 commits into from
Jun 24, 2024

Conversation

pdet
Copy link
Contributor

@pdet pdet commented May 28, 2024

This PR adds the ability of reordering columns from a CSV Scanner over multiple files, if they match the schema.

For example, consider the following files:

  • a1.csv
a;b;c
1;2;3
  • a2.csv
c;b;a
1;2;3

We can now scan these files, without setting the union_by_name option.
e.g., FROM 'a*.csv'

In practice, these columns are reordered after sniffing to match the order of the first sniffed filed.

In case of a schema mismatch, for example by having:

  • a3.csv
c;b
1;2

An appropriate error message will be thrown:

Invalid Input Error: Schema mismatch between globbed files.
Main file schema: a1.csv
Current file: a3.csv
Column with name: "a" is missing
Potential Fix: Since your schema has a mismatch, consider setting union_by_name=true

It is important to notice that if a3.csv has columns not defined in the first sniffed file, these will just be dropped.
For example if a3.csv was:

  • a3.csv
c;b;a;d
1;2;3;4

FROM 'a*.csv' would not consider the d column from a3.csv.

In terms of performance, this brings about 30% better performance for these cases. The benchmark runs a FROM 'file_*.csv' over 100 files of 5 columns and ~ 1000000 rows each.

union_by_name = false: 2.20
union_by_name = true: 3

I think the next step here, is to implement "adaptive sniffing" where instead of sniffing the default or set values of the file, we sniff a much lower number of rows, and if they don't match the expected schema, we adaptively sniff more and more.

@pdet pdet changed the base branch from main to feature May 28, 2024 12:48
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 29, 2024 10:47
@pdet pdet marked this pull request as ready for review June 4, 2024 11:11
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 4, 2024 13:59
@pdet pdet marked this pull request as ready for review June 4, 2024 14:05
@pdet
Copy link
Contributor Author

pdet commented Jun 5, 2024

@Mytherin is this good to go?

@Mytherin
Copy link
Collaborator

Mytherin commented Jun 5, 2024

Thanks for the PR! The code looks good to me.

I've tried it on the data set in this issue - #11334

I get a number of errors, e.g.:

select count(*) from 'csvtest/data*/**/*.csv';
Main file schema: csvtest/data_Q1_2022/2022-01-01.csv
Current file: csvtest/data_Q1_2023/2023-01-01.csv
Column with name: "smart_13_normalized" is expected to have type: BIGINT But has type: VARCHAR
Column with name: "smart_13_raw" is expected to have type: BIGINT But has type: VARCHAR

The reason for these errors is not an actual schema mismatch, but rather the columns have a lot of NULL values. The CSV sniffer then defaults to VARCHAR causing this schema mismatch.

In the Parquet reader, this would work, and we would just insert a cast. The cast then fails at run-time rather than at sniffing time if there is an actual mismatch. In this case since there is no mismatch - the script would just work.

Perhaps something to look into? Or maybe for a follow-up PR?

@pdet
Copy link
Contributor Author

pdet commented Jun 5, 2024

Thanks for the PR! The code looks good to me.

I've tried it on the data set in this issue - #11334

I get a number of errors, e.g.:

select count(*) from 'csvtest/data*/**/*.csv';
Main file schema: csvtest/data_Q1_2022/2022-01-01.csv
Current file: csvtest/data_Q1_2023/2023-01-01.csv
Column with name: "smart_13_normalized" is expected to have type: BIGINT But has type: VARCHAR
Column with name: "smart_13_raw" is expected to have type: BIGINT But has type: VARCHAR

The reason for these errors is not an actual schema mismatch, but rather the columns have a lot of NULL values. The CSV sniffer then defaults to VARCHAR causing this schema mismatch.

In the Parquet reader, this would work, and we would just insert a cast. The cast then fails at run-time rather than at sniffing time if there is an actual mismatch. In this case since there is no mismatch - the script would just work.

Perhaps something to look into? Or maybe for a follow-up PR?

I already have a different branch that I'm working on the minimal sniffer, can add this there, if that's ok.

@Mytherin
Copy link
Collaborator

Mytherin commented Jun 5, 2024

Playing around with this some more I think this might have some other issues remaining, e.g. writing lineitem with the columns re-ordered:

call dbgen(sf=1);
copy (select l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment from lineitem limit 100000 offset 0)
  to 'lineitem_part1.csv';
copy (select l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_tax,l_discount,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment from lineitem limit 100000 offset 100000)
  to 'lineitem_part2.csv';
copy (select l_comment,l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_tax,l_discount,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode from lineitem limit 100000 offset 200000)
  to 'lineitem_part3.csv';
copy (select l_comment,l_orderkey,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_tax,l_discount,l_shipmode from lineitem limit 100000 offset 300000)
  to 'lineitem_part4.csv';
copy (select l_comment,l_orderkey,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_tax,l_discount,l_shipmode,l_shipmode as shipmode2 from lineitem limit 100000 offset 400000)
  to 'lineitem_part5.csv';

Then reading it back, results in the following error:

select * from read_csv(['lineitem_part1.csv', 'lineitem_part2.csv', 'lineitem_part3.csv', 'lineitem_part4.csv', 'lineitem_part5.csv']);
Conversion Error: CSV Error on Line: 2
Original Line: "y even courts. even, special pinto",199652,45864,8369,1,3.00,5429.58,0.07,0.07,R,F,1994-10-06,1994-10-13,1994-10-24,DELIVER IN PERSON,RAIL
Error when converting column "l_orderkey". Could not convert string "y even courts. even, special pinto" to 'BIGINT'

Column l_orderkey is being converted as type BIGINT
This type was auto-detected from the CSV file.
Possible solutions:
* Override the type for this column manually by setting the type explicitly, e.g. types={'l_orderkey': 'VARCHAR'}
* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1
* Use a COPY statement to automatically derive types from an existing table.

  file=lineitem_part1.csv
  delimiter = , (Auto-Detected)
  quote = " (Auto-Detected)
  escape = " (Auto-Detected)
  new_line = \n (Auto-Detected)
  header = true (Auto-Detected)
  skip_rows = 0 (Auto-Detected)
  date_format =  (Auto-Detected)
  timestamp_format =  (Auto-Detected)
  null_padding=0
  sample_size=20480
  ignore_errors=false
  all_varchar=0

The file in the error seems to be incorrect as well, since line 2 in lineitem_part1.csv is actually the following:

1,155190,7706,1,17.00,21168.23,0.04,0.02,N,O,1996-03-13,1996-02-12,1996-03-22,DELIVER IN PERSON,TRUCK,to beans x-ray carefull

@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 5, 2024 14:10
@Mytherin Mytherin marked this pull request as ready for review June 6, 2024 09:55
@Mytherin
Copy link
Collaborator

Thanks! Looks good - could you just have a look at the merge conflicts?

@Mytherin Mytherin changed the base branch from feature to main June 21, 2024 12:37
@duckdb-draftbot duckdb-draftbot marked this pull request as draft June 24, 2024 09:17
@pdet pdet marked this pull request as ready for review June 24, 2024 09:18
@Mytherin Mytherin merged commit 28bc6f8 into duckdb:main Jun 24, 2024
39 of 40 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Jun 24, 2024
Merge pull request duckdb/duckdb#12288 from pdet/auto_glob_matching
@pdet pdet deleted the auto_glob_matching branch June 25, 2024 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants