CSV Sniffer - State Machine #8253

pdet · 2023-07-14T13:24:04Z

I'm genuinely sorry for the massive PR. I swear I've tried to make it as minimal as possible.

Featured Changes:

CSV State Machine.

We now use a CSV State machine (as described in #7213). The state machine is generated based on the options set (e.g., delimiter, quote, ...) for that CSV Read. It is currently only used in the Sniffer and implements operations that parse a CSV File.

FW: There is still significant branching that can be removed from the existing parsing functions of the state machine. We can return more efficient types (than values and data chunks) and have specialized try_cast methods.

CSV Buffer Manager.

During the sniffing and on the initial runs of the actual CSV Parsing, one or more CSV Buffers will be cached and properly reused. I've created a buffer manager class that manages, caches and removes buffers accordingly. One thing to notice is that the char CSVBufferIterator::GetNextChar() function is very inefficient.
FW: This should be rewritten to avoid all the branching of buffer checking.

New Sniffer Code.

The previous sniffer code was quite ingrained with the Buffered CSV Reader (Our single-threaded CSV Reader implementation). I've separated it from it and created a CSVSniffer class that performs the sniffing. Besides that, the Sniffer now runs on options.sample_chunks. Before dialect detection would only run on one chunk, this should fix errors like #7789.
The main goal of sniffing is to detect column types, column names, and the CSV options used to parse a CSV File.
In summary, it consists of the following steps:

Dialect Detection

Generate the CSV Options (delimiter, quote, escape, etc.)
Dialect Detection consists of four phases:

Generates the search space candidates for the dialect. In this phase, we generate the search space of options for this CSV File. This is based on our default (or user-defined) options for quotes, escapes, and delimiters.
Generate State Machines: After defining our search space, we generate one CSV State Machine for each combination of quotes, escape, and delimiter.
Analyze Dialect Candidates: For each state machine created in phase 2, we will parse the first chunk of the CSV File and determine, based on the number of consistent rows and the number of columns, our top State Machine candidates.
Refine Candidates: It repeats step 3 over the remaining chunks set on options.sample_chunks. Eliminating candidates with inconsistent rows.

After Running the dialect detection method, we have our best candidates (i.e., the ones with the most consistent rows and the maximum number of columns) from our search space.

Type Detection

Figures out the types of columns (For the first chunk).

For each row of the first chunk, it tries to cast columns defined in our auto_type_candidates variable.
Since we don't know if this file has a header, we eliminate the first row and consider it a possible header. This will be verified in the next phase.

//! Types considered as candidates for auto-detection ordered by descending specificity (~ from high to low)
	vector<LogicalType> auto_type_candidates = {LogicalType::VARCHAR, LogicalType::TIMESTAMP, LogicalType::DATE,
	                                            LogicalType::TIME,    LogicalType::DOUBLE,    LogicalType::BIGINT,
	                                            LogicalType::BOOLEAN, LogicalType::SQLNULL};

It starts with the back() type and pops it in case it can't cast it until finding the first elements type that works. It then continues the same process for the remaining rows.
We also try to detect their format for Date and Timestamp types, following a similar logic.

	//! Format Candidates for Date and Timestamp Types
	const std::map<LogicalTypeId, vector<const char *>> format_template_candidates = {
	    {LogicalTypeId::DATE, {"%m-%d-%Y", "%m-%d-%y", "%d-%m-%Y", "%d-%m-%y", "%Y-%m-%d", "%y-%m-%d"}},
	    {LogicalTypeId::TIMESTAMP,
	     {"%Y-%m-%d %H:%M:%S.%f", "%m-%d-%Y %I:%M:%S %p", "%m-%d-%y %I:%M:%S %p", "%d-%m-%Y %H:%M:%S",
	      "%d-%m-%y %H:%M:%S", "%Y-%m-%d %H:%M:%S", "%y-%m-%d %H:%M:%S"}},
	};

We run the type detection for each candidate from the previous phase. We keep the one that returns the least LogicalType::VARCHAR columns as our best candidate.

FW: Value Casting could be more efficient. I imagine we can have specialized code to try to cast char* directly.

Header Detection

Figures out if the CSV file has a header and produces the names of the columns.
We try to cast all columns from our possible header to the type detection we did in the previous phase. If they don't match, we consider the first row a header.

Type Replacement:

Replaces the types of columns if the user specified them. Since these are supplied with a column_name:column_type map, we must first run the header detection to know the column names.

Type Refinement:

Refines the types of columns for the remaining chunks.
We continue the type-detection process for the remaining options.sample_chunks. The main difference between this phase to the Type Detection phase is that we use vector casts instead of value casts. We also don't try to continue date and time format detection.
FW: The same as Type detection should apply.

Removal of complex CSV Options.

No more multi-character quotes, delimiters, and escapes. Most libraries do not support this. I've also removed tests related to that.

Micro Regression Tests

I've added tests for the CSV Reader:

Sniffer.
Old timing: 0.107391
New timing: 0.164979
Small CSV Reader.
Old timing: 0.000259
New timing: 0.000612
CSV Reader
Old timing: 7.447226
New timing: 6.886744

Future Work:

Performance: As described above, there are many steps that can be optimized to ensure faster CSV Parsing.
Removal of Buffered CSV Reader: When executed with a single thread, the Parallel CSV should behave like the Buffered CSV Reader.
Implement Sampling on Sniffer: We are currently doing sequential access on the CSV for sniffing; however, CSV files may change their column types or dialect over longer periods of time, hence being able to run the Sniffer over different parts of the file.
Use CSV State Machine in the Parallel CSV Reading parsing.

…ection from 1

…re than one chunk. Which makes it discard all dialect candidates when null padding is set to false

…last pos

…Dialect Detection

…niffing types

Mytherin

Thanks for the massive PR! Looks great - some comments below:

src/include/duckdb/main/client_data.hpp

test/api/test_pending_query.cpp

test/sql/copy/csv/auto/test_auto_cranlogs.test

…settings

pdet · 2023-09-03T18:34:13Z

@Mytherin , is this good to go?

pdet added 30 commits June 5, 2023 11:15

Sketch of new sniffer

604e9ee

Move CSV Reader code to its own folder

a6325c2

Basics of state machine creation

f388b55

Starting integration of Dialect Sniffer with State Machine

d9e66fb

Move sniffer code out of buffered reader

02b40c6

changing to branch

906e9d2

Moving Sniffing methods to sniffing class

c042dd3

Got CSV Reader to run

8179a34

Removing complex values

5362c44

Remove more of complex csv options

26ea8a4

More of removal of complex options

19667f1

More of the removing complex options

ae44e42

Fixing predication for newline detection

0a918d0

Properly handle quotes and escapes in the state machine

586c698

Remove max_row from dialect detection, count columns from dialect det…

f263acd

…ection from 1

Removing tests and files related to multi byte csv options

a909ab7

Skip rows on dialect detection if options are set

0a62f91

Skip rows on dialect detection if options are set

7b6d84d

Properly add escapes when getting new values

337378d

Auto detection now fails for the null padding tests since we check mo…

42a95db

…re than one chunk. Which makes it discard all dialect candidates when null padding is set to false

Fix for test replacement scan alias

ce83757

We can now detect this as a one column string

f63c9e7

Refuse newline detection if does nto fit what was provided by the user

c51f905

I don't think the previous test result of this test made sense 5077

fb38a4d

Add unquoted state

4b50d2c

Return -1 on sniffed buffer position since we don't really check the …

35e71ff

…last pos

This file is so borked anything can be a decent valid answer

90d4f02

Count how many lines to skip and skip completely emply lines for CSV …

1a9fe35

…Dialect Detection

Fix python build

d378cc3

Merge remote-tracking branch 'origin/feature' into new_sniffer

98eef32

pdet added 4 commits September 1, 2023 12:37

Removing sniffing code for decimal and float detection, 100% cov in s…

83a0314

…niffing types

Type Refinement 100% codecov and Format

49ab052

Merge remote-tracking branch 'origin/main' into csv_sniffer

2432511

Finish the merge with main, was more complex than I thought

c2c7add

github-actions bot marked this pull request as draft September 1, 2023 11:53

Mytherin reviewed Sep 1, 2023

View reviewed changes

src/include/duckdb/main/client_data.hpp Outdated Show resolved Hide resolved

test/api/test_pending_query.cpp Outdated Show resolved Hide resolved

test/sql/copy/csv/auto/test_auto_cranlogs.test Outdated Show resolved Hide resolved

pdet added 2 commits September 1, 2023 15:43

Re-adding cran test and removing unnecessary pending query test

a9a10b4

Moving state machine cache

1778524

pdet marked this pull request as ready for review September 1, 2023 14:05

pdet added 2 commits September 1, 2023 20:19

more on codecov

ec12e31

more on codecov

8972841

github-actions bot marked this pull request as draft September 1, 2023 18:51

pdet marked this pull request as ready for review September 1, 2023 19:10

Can safely remove this, it's now handled by the buffers with destroy …

257ba06

…settings

github-actions bot marked this pull request as draft September 1, 2023 20:30

pdet marked this pull request as ready for review September 1, 2023 22:27

Updating path

dd342a4

github-actions bot marked this pull request as draft September 2, 2023 07:22

pdet marked this pull request as ready for review September 2, 2023 07:24

Reduce SF of micro benchmark so CI can reliably finish it

3a82fb0

github-actions bot marked this pull request as draft September 2, 2023 10:09

pdet marked this pull request as ready for review September 2, 2023 10:21

wrong path

f7ec9b5

github-actions bot marked this pull request as draft September 2, 2023 23:33

pdet marked this pull request as ready for review September 3, 2023 10:31

Mytherin merged commit 3b58f44 into duckdb:main Sep 4, 2023
51 of 53 checks passed

cpcloud mentioned this pull request Sep 5, 2023

CSV file that parses with 0.7.1 fails with 0.8.0 and master #7789

Closed

2 tasks

pdet mentioned this pull request Sep 29, 2023

Regression in Python read_csv v0.9.0 #9124

Closed

1 task

taniabogatsch mentioned this pull request Oct 31, 2023

IGNORE_ERRORS on CSV import ignored #5250

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV Sniffer - State Machine #8253

CSV Sniffer - State Machine #8253

pdet commented Jul 14, 2023

Mytherin left a comment

pdet commented Sep 3, 2023

CSV Sniffer - State Machine #8253

CSV Sniffer - State Machine #8253

Conversation

pdet commented Jul 14, 2023

Featured Changes:

CSV State Machine.

CSV Buffer Manager.

New Sniffer Code.

Dialect Detection

Type Detection

Header Detection

Type Replacement:

Type Refinement:

Removal of complex CSV Options.

Micro Regression Tests

Future Work:

Mytherin left a comment

Choose a reason for hiding this comment

pdet commented Sep 3, 2023