[CSV Reader] Reorder of Columns for CSV Scans on multiple files. #12288

pdet · 2024-05-28T12:47:50Z

This PR adds the ability of reordering columns from a CSV Scanner over multiple files, if they match the schema.

For example, consider the following files:

a1.csv

a;b;c
1;2;3

a2.csv

c;b;a
1;2;3

We can now scan these files, without setting the union_by_name option.
e.g., FROM 'a*.csv'

In practice, these columns are reordered after sniffing to match the order of the first sniffed filed.

In case of a schema mismatch, for example by having:

a3.csv

c;b
1;2

An appropriate error message will be thrown:

Invalid Input Error: Schema mismatch between globbed files.
Main file schema: a1.csv
Current file: a3.csv
Column with name: "a" is missing
Potential Fix: Since your schema has a mismatch, consider setting union_by_name=true

It is important to notice that if a3.csv has columns not defined in the first sniffed file, these will just be dropped.
For example if a3.csv was:

a3.csv

c;b;a;d
1;2;3;4

FROM 'a*.csv' would not consider the d column from a3.csv.

In terms of performance, this brings about 30% better performance for these cases. The benchmark runs a FROM 'file_*.csv' over 100 files of 5 columns and ~ 1000000 rows each.

union_by_name = false: 2.20
union_by_name = true: 3

I think the next step here, is to implement "adaptive sniffing" where instead of sniffing the default or set values of the file, we sniff a much lower number of rows, and if they don't match the expected schema, we adaptively sniff more and more.

…ror messages and defaulting to true if it's all varchar

pdet · 2024-06-05T07:59:44Z

@Mytherin is this good to go?

Mytherin · 2024-06-05T08:28:02Z

Thanks for the PR! The code looks good to me.

I've tried it on the data set in this issue - #11334

I get a number of errors, e.g.:

select count(*) from 'csvtest/data*/**/*.csv';
Main file schema: csvtest/data_Q1_2022/2022-01-01.csv
Current file: csvtest/data_Q1_2023/2023-01-01.csv
Column with name: "smart_13_normalized" is expected to have type: BIGINT But has type: VARCHAR
Column with name: "smart_13_raw" is expected to have type: BIGINT But has type: VARCHAR

The reason for these errors is not an actual schema mismatch, but rather the columns have a lot of NULL values. The CSV sniffer then defaults to VARCHAR causing this schema mismatch.

In the Parquet reader, this would work, and we would just insert a cast. The cast then fails at run-time rather than at sniffing time if there is an actual mismatch. In this case since there is no mismatch - the script would just work.

Perhaps something to look into? Or maybe for a follow-up PR?

pdet · 2024-06-05T08:39:01Z

Thanks for the PR! The code looks good to me.

I've tried it on the data set in this issue - #11334

I get a number of errors, e.g.:
select count(*) from 'csvtest/data*/**/*.csv';
Main file schema: csvtest/data_Q1_2022/2022-01-01.csv
Current file: csvtest/data_Q1_2023/2023-01-01.csv
Column with name: "smart_13_normalized" is expected to have type: BIGINT But has type: VARCHAR
Column with name: "smart_13_raw" is expected to have type: BIGINT But has type: VARCHAR
The reason for these errors is not an actual schema mismatch, but rather the columns have a lot of NULL values. The CSV sniffer then defaults to VARCHAR causing this schema mismatch.

In the Parquet reader, this would work, and we would just insert a cast. The cast then fails at run-time rather than at sniffing time if there is an actual mismatch. In this case since there is no mismatch - the script would just work.

Perhaps something to look into? Or maybe for a follow-up PR?

I already have a different branch that I'm working on the minimal sniffer, can add this there, if that's ok.

Mytherin · 2024-06-05T08:52:39Z

Playing around with this some more I think this might have some other issues remaining, e.g. writing lineitem with the columns re-ordered:

call dbgen(sf=1);
copy (select l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment from lineitem limit 100000 offset 0)
  to 'lineitem_part1.csv';
copy (select l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_tax,l_discount,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment from lineitem limit 100000 offset 100000)
  to 'lineitem_part2.csv';
copy (select l_comment,l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_tax,l_discount,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode from lineitem limit 100000 offset 200000)
  to 'lineitem_part3.csv';
copy (select l_comment,l_orderkey,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_tax,l_discount,l_shipmode from lineitem limit 100000 offset 300000)
  to 'lineitem_part4.csv';
copy (select l_comment,l_orderkey,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_tax,l_discount,l_shipmode,l_shipmode as shipmode2 from lineitem limit 100000 offset 400000)
  to 'lineitem_part5.csv';

Then reading it back, results in the following error:

select * from read_csv(['lineitem_part1.csv', 'lineitem_part2.csv', 'lineitem_part3.csv', 'lineitem_part4.csv', 'lineitem_part5.csv']);

Conversion Error: CSV Error on Line: 2
Original Line: "y even courts. even, special pinto",199652,45864,8369,1,3.00,5429.58,0.07,0.07,R,F,1994-10-06,1994-10-13,1994-10-24,DELIVER IN PERSON,RAIL
Error when converting column "l_orderkey". Could not convert string "y even courts. even, special pinto" to 'BIGINT'

Column l_orderkey is being converted as type BIGINT
This type was auto-detected from the CSV file.
Possible solutions:
* Override the type for this column manually by setting the type explicitly, e.g. types={'l_orderkey': 'VARCHAR'}
* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1
* Use a COPY statement to automatically derive types from an existing table.

  file=lineitem_part1.csv
  delimiter = , (Auto-Detected)
  quote = " (Auto-Detected)
  escape = " (Auto-Detected)
  new_line = \n (Auto-Detected)
  header = true (Auto-Detected)
  skip_rows = 0 (Auto-Detected)
  date_format =  (Auto-Detected)
  timestamp_format =  (Auto-Detected)
  null_padding=0
  sample_size=20480
  ignore_errors=false
  all_varchar=0

The file in the error seems to be incorrect as well, since line 2 in lineitem_part1.csv is actually the following:

1,155190,7706,1,17.00,21168.23,0.04,0.02,N,O,1996-03-13,1996-02-12,1996-03-22,DELIVER IN PERSON,TRUCK,to beans x-ray carefull

pdet · 2024-06-06T09:51:21Z

I've added:

https://github.com/duckdb/duckdb/blob/f0944f33814abbfedd63a32434261c10b6438668/test/sql/copy/csv/test_glob_reorder_null.test as a minimal test for the NULL checking

https://github.com/duckdb/duckdb/blob/f0944f33814abbfedd63a32434261c10b6438668/test/sql/copy/csv/test_glob_reorder_lineitem.test for the lineitem test

Mytherin · 2024-06-18T06:21:38Z

Thanks! Looks good - could you just have a look at the merge conflicts?

Mytherin · 2024-06-24T17:12:04Z

Thanks!

Merge pull request duckdb/duckdb#12288 from pdet/auto_glob_matching

pdet added 26 commits April 19, 2024 16:15

Basica of schema matching

1195752

Get build on this right

0afca36

small tweaks

180d3f0

Adding SchemasMatch function

4f01ede

Improve error message

9cc4666

wip

ca030cf

wip

6f8ef9a

Type matching

1c82270

fixing test

764fd60

Merge remote-tracking branch 'origin/main' into auto_glob_matching

28a3573

starting rework of type matching

7123522

merge

ce48f19

use csv types

65743ce

Only replace if the mapping is not defined in the multi scanner reader

55e55a6

Lots on making columns parameter more functional

fc949db

Introduce columns_set variable to keep track if columns were set

f73e01d

More on column being set

21de12e

make sure to set columns in copy

cc02caf

Merge remote-tracking branch 'origin/feature' into auto_glob_matching

e9eae48

Being more flexible with types'

b840f22

same is same

bafd948

properly skip multifile options

091aec7

this test now works

b0af185

Adjustments to header detection for set columns, mainly adding new er…

c6daabc

…ror messages and defaulting to true if it's all varchar

Accidental comment

3a68cfd

add benchmarks

e169d42

pdet changed the base branch from main to feature May 28, 2024 12:48

pdet added 2 commits May 29, 2024 12:45

Merge remote-tracking branch 'origin/feature' into auto_glob_matching

616b299

format

7643b25

duckdb-draftbot marked this pull request as draft May 29, 2024 10:47

pdet marked this pull request as ready for review June 4, 2024 11:11

If files are just header and whatnot we go to next file

228dff3

duckdb-draftbot marked this pull request as draft June 4, 2024 13:59

pdet marked this pull request as ready for review June 4, 2024 14:05

pdet added 3 commits June 5, 2024 14:53

wip in trying to use multifilereader to get projections

44e4b3f

Magic

557531a

add more tests

80c7b25

duckdb-draftbot marked this pull request as draft June 5, 2024 14:10

pdet added 2 commits June 6, 2024 11:38

be more flexible with null handling when sniffing extra files

104bf23

Add glob reorder test

f0944f3

Mytherin marked this pull request as ready for review June 6, 2024 09:55

format

630c37d

Mytherin added Merge Conflict and removed Ready For Review labels Jun 18, 2024

Mytherin changed the base branch from feature to main June 21, 2024 12:37

pdet added 2 commits June 24, 2024 10:30

merge

60cfe17

More on the merge

7b15182

duckdb-draftbot marked this pull request as draft June 24, 2024 09:17

pdet removed the Merge Conflict label Jun 24, 2024

pdet marked this pull request as ready for review June 24, 2024 09:18

Mytherin merged commit 28bc6f8 into duckdb:main Jun 24, 2024
39 of 40 checks passed

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Jun 24, 2024

chore: Update vendored sources to duckdb/duckdb@28bc6f8

150102d

Merge pull request duckdb/duckdb#12288 from pdet/auto_glob_matching

pdet deleted the auto_glob_matching branch June 25, 2024 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CSV Reader] Reorder of Columns for CSV Scans on multiple files. #12288

[CSV Reader] Reorder of Columns for CSV Scans on multiple files. #12288

pdet commented May 28, 2024 •

edited

Loading

pdet commented Jun 5, 2024

Mytherin commented Jun 5, 2024 •

edited

Loading

pdet commented Jun 5, 2024

Mytherin commented Jun 5, 2024

pdet commented Jun 6, 2024

Mytherin commented Jun 18, 2024

Mytherin commented Jun 24, 2024

[CSV Reader] Reorder of Columns for CSV Scans on multiple files. #12288

[CSV Reader] Reorder of Columns for CSV Scans on multiple files. #12288

Conversation

pdet commented May 28, 2024 • edited Loading

pdet commented Jun 5, 2024

Mytherin commented Jun 5, 2024 • edited Loading

pdet commented Jun 5, 2024

Mytherin commented Jun 5, 2024

pdet commented Jun 6, 2024

Mytherin commented Jun 18, 2024

Mytherin commented Jun 24, 2024

pdet commented May 28, 2024 •

edited

Loading

Mytherin commented Jun 5, 2024 •

edited

Loading