Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV Sniffer - State Machine #8253

Merged
merged 189 commits into from
Sep 4, 2023
Merged

CSV Sniffer - State Machine #8253

merged 189 commits into from
Sep 4, 2023

Conversation

pdet
Copy link
Contributor

@pdet pdet commented Jul 14, 2023

I'm genuinely sorry for the massive PR. I swear I've tried to make it as minimal as possible.

Featured Changes:

CSV State Machine.

We now use a CSV State machine (as described in #7213). The state machine is generated based on the options set (e.g., delimiter, quote, ...) for that CSV Read. It is currently only used in the Sniffer and implements operations that parse a CSV File.

FW: There is still significant branching that can be removed from the existing parsing functions of the state machine. We can return more efficient types (than values and data chunks) and have specialized try_cast methods.

CSV Buffer Manager.

During the sniffing and on the initial runs of the actual CSV Parsing, one or more CSV Buffers will be cached and properly reused. I've created a buffer manager class that manages, caches and removes buffers accordingly. One thing to notice is that the char CSVBufferIterator::GetNextChar() function is very inefficient.
FW: This should be rewritten to avoid all the branching of buffer checking.

New Sniffer Code.

The previous sniffer code was quite ingrained with the Buffered CSV Reader (Our single-threaded CSV Reader implementation). I've separated it from it and created a CSVSniffer class that performs the sniffing. Besides that, the Sniffer now runs on options.sample_chunks. Before dialect detection would only run on one chunk, this should fix errors like #7789.
The main goal of sniffing is to detect column types, column names, and the CSV options used to parse a CSV File.
In summary, it consists of the following steps:

Dialect Detection

Generate the CSV Options (delimiter, quote, escape, etc.)
Dialect Detection consists of four phases:

  1. Generates the search space candidates for the dialect. In this phase, we generate the search space of options for this CSV File. This is based on our default (or user-defined) options for quotes, escapes, and delimiters.
  2. Generate State Machines: After defining our search space, we generate one CSV State Machine for each combination of quotes, escape, and delimiter.
  3. Analyze Dialect Candidates: For each state machine created in phase 2, we will parse the first chunk of the CSV File and determine, based on the number of consistent rows and the number of columns, our top State Machine candidates.
  4. Refine Candidates: It repeats step 3 over the remaining chunks set on options.sample_chunks. Eliminating candidates with inconsistent rows.

After Running the dialect detection method, we have our best candidates (i.e., the ones with the most consistent rows and the maximum number of columns) from our search space.

Type Detection

Figures out the types of columns (For the first chunk).

For each row of the first chunk, it tries to cast columns defined in our auto_type_candidates variable.
Since we don't know if this file has a header, we eliminate the first row and consider it a possible header. This will be verified in the next phase.

//! Types considered as candidates for auto-detection ordered by descending specificity (~ from high to low)
	vector<LogicalType> auto_type_candidates = {LogicalType::VARCHAR, LogicalType::TIMESTAMP, LogicalType::DATE,
	                                            LogicalType::TIME,    LogicalType::DOUBLE,    LogicalType::BIGINT,
	                                            LogicalType::BOOLEAN, LogicalType::SQLNULL};

It starts with the back() type and pops it in case it can't cast it until finding the first elements type that works. It then continues the same process for the remaining rows.
We also try to detect their format for Date and Timestamp types, following a similar logic.

	//! Format Candidates for Date and Timestamp Types
	const std::map<LogicalTypeId, vector<const char *>> format_template_candidates = {
	    {LogicalTypeId::DATE, {"%m-%d-%Y", "%m-%d-%y", "%d-%m-%Y", "%d-%m-%y", "%Y-%m-%d", "%y-%m-%d"}},
	    {LogicalTypeId::TIMESTAMP,
	     {"%Y-%m-%d %H:%M:%S.%f", "%m-%d-%Y %I:%M:%S %p", "%m-%d-%y %I:%M:%S %p", "%d-%m-%Y %H:%M:%S",
	      "%d-%m-%y %H:%M:%S", "%Y-%m-%d %H:%M:%S", "%y-%m-%d %H:%M:%S"}},
	};

We run the type detection for each candidate from the previous phase. We keep the one that returns the least LogicalType::VARCHAR columns as our best candidate.

FW: Value Casting could be more efficient. I imagine we can have specialized code to try to cast char* directly.

Header Detection

Figures out if the CSV file has a header and produces the names of the columns.
We try to cast all columns from our possible header to the type detection we did in the previous phase. If they don't match, we consider the first row a header.

Type Replacement:

Replaces the types of columns if the user specified them. Since these are supplied with a column_name:column_type map, we must first run the header detection to know the column names.

Type Refinement:

Refines the types of columns for the remaining chunks.
We continue the type-detection process for the remaining options.sample_chunks. The main difference between this phase to the Type Detection phase is that we use vector casts instead of value casts. We also don't try to continue date and time format detection.
FW: The same as Type detection should apply.

Removal of complex CSV Options.

No more multi-character quotes, delimiters, and escapes. Most libraries do not support this. I've also removed tests related to that.

Micro Regression Tests

I've added tests for the CSV Reader:

  1. Sniffer.
    Old timing: 0.107391
    New timing: 0.164979

  2. Small CSV Reader.
    Old timing: 0.000259
    New timing: 0.000612

  3. CSV Reader
    Old timing: 7.447226
    New timing: 6.886744

Future Work:

  1. Performance: As described above, there are many steps that can be optimized to ensure faster CSV Parsing.
  2. Removal of Buffered CSV Reader: When executed with a single thread, the Parallel CSV should behave like the Buffered CSV Reader.
  3. Implement Sampling on Sniffer: We are currently doing sequential access on the CSV for sniffing; however, CSV files may change their column types or dialect over longer periods of time, hence being able to run the Sniffer over different parts of the file.
  4. Use CSV State Machine in the Parallel CSV Reading parsing.

pdet added 30 commits June 5, 2023 11:15
…re than one chunk. Which makes it discard all dialect candidates when null padding is set to false
@github-actions github-actions bot marked this pull request as draft September 1, 2023 11:53
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the massive PR! Looks great - some comments below:

src/include/duckdb/main/client_data.hpp Outdated Show resolved Hide resolved
test/api/test_pending_query.cpp Outdated Show resolved Hide resolved
test/sql/copy/csv/auto/test_auto_cranlogs.test Outdated Show resolved Hide resolved
@pdet pdet marked this pull request as ready for review September 1, 2023 14:05
@github-actions github-actions bot marked this pull request as draft September 1, 2023 18:51
@pdet pdet marked this pull request as ready for review September 1, 2023 19:10
@github-actions github-actions bot marked this pull request as draft September 1, 2023 20:30
@pdet pdet marked this pull request as ready for review September 1, 2023 22:27
@github-actions github-actions bot marked this pull request as draft September 2, 2023 07:22
@pdet pdet marked this pull request as ready for review September 2, 2023 07:24
@github-actions github-actions bot marked this pull request as draft September 2, 2023 10:09
@pdet pdet marked this pull request as ready for review September 2, 2023 10:21
@github-actions github-actions bot marked this pull request as draft September 2, 2023 23:33
@pdet pdet marked this pull request as ready for review September 3, 2023 10:31
@pdet
Copy link
Contributor Author

pdet commented Sep 3, 2023

@Mytherin , is this good to go?

@Mytherin Mytherin merged commit 3b58f44 into duckdb:main Sep 4, 2023
51 of 53 checks passed
@pdet pdet mentioned this pull request Sep 29, 2023
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants