-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV Sniffer - State Machine #8253
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…re than one chunk. Which makes it discard all dialect candidates when null padding is set to false
…Dialect Detection
Mytherin
reviewed
Sep 1, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the massive PR! Looks great - some comments below:
@Mytherin , is this good to go? |
2 tasks
1 task
2 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I'm genuinely sorry for the massive PR. I swear I've tried to make it as minimal as possible.
Featured Changes:
CSV State Machine.
We now use a CSV State machine (as described in #7213). The state machine is generated based on the options set (e.g., delimiter, quote, ...) for that CSV Read. It is currently only used in the Sniffer and implements operations that parse a CSV File.
FW: There is still significant branching that can be removed from the existing parsing functions of the state machine. We can return more efficient types (than values and data chunks) and have specialized try_cast methods.
CSV Buffer Manager.
During the sniffing and on the initial runs of the actual CSV Parsing, one or more CSV Buffers will be cached and properly reused. I've created a buffer manager class that manages, caches and removes buffers accordingly. One thing to notice is that the char
CSVBufferIterator::GetNextChar()
function is very inefficient.FW: This should be rewritten to avoid all the branching of buffer checking.
New Sniffer Code.
The previous sniffer code was quite ingrained with the Buffered CSV Reader (Our single-threaded CSV Reader implementation). I've separated it from it and created a
CSVSniffer
class that performs the sniffing. Besides that, the Sniffer now runs onoptions.sample_chunks
. Before dialect detection would only run on one chunk, this should fix errors like #7789.The main goal of sniffing is to detect column types, column names, and the CSV options used to parse a CSV File.
In summary, it consists of the following steps:
Dialect Detection
Generate the CSV Options (delimiter, quote, escape, etc.)
Dialect Detection consists of four phases:
options.sample_chunks
. Eliminating candidates with inconsistent rows.After Running the dialect detection method, we have our best candidates (i.e., the ones with the most consistent rows and the maximum number of columns) from our search space.
Type Detection
Figures out the types of columns (For the first chunk).
For each row of the first chunk, it tries to cast columns defined in our
auto_type_candidates
variable.Since we don't know if this file has a header, we eliminate the first row and consider it a possible header. This will be verified in the next phase.
//! Types considered as candidates for auto-detection ordered by descending specificity (~ from high to low) vector<LogicalType> auto_type_candidates = {LogicalType::VARCHAR, LogicalType::TIMESTAMP, LogicalType::DATE, LogicalType::TIME, LogicalType::DOUBLE, LogicalType::BIGINT, LogicalType::BOOLEAN, LogicalType::SQLNULL};
It starts with the
back()
type and pops it in case it can't cast it until finding the first elements type that works. It then continues the same process for the remaining rows.We also try to detect their format for
Date
andTimestamp
types, following a similar logic.We run the type detection for each candidate from the previous phase. We keep the one that returns the least
LogicalType::VARCHAR
columns as our best candidate.FW: Value Casting could be more efficient. I imagine we can have specialized code to try to cast
char*
directly.Header Detection
Figures out if the CSV file has a header and produces the names of the columns.
We try to cast all columns from our possible header to the type detection we did in the previous phase. If they don't match, we consider the first row a header.
Type Replacement:
Replaces the types of columns if the user specified them. Since these are supplied with a column_name:column_type map, we must first run the header detection to know the column names.
Type Refinement:
Refines the types of columns for the remaining chunks.
We continue the type-detection process for the remaining
options.sample_chunks
. The main difference between this phase to the Type Detection phase is that we use vector casts instead of value casts. We also don't try to continuedate
andtime
format detection.FW: The same as Type detection should apply.
Removal of complex CSV Options.
No more multi-character quotes, delimiters, and escapes. Most libraries do not support this. I've also removed tests related to that.
Micro Regression Tests
I've added tests for the CSV Reader:
Sniffer.
Old timing: 0.107391
New timing: 0.164979
Small CSV Reader.
Old timing: 0.000259
New timing: 0.000612
CSV Reader
Old timing: 7.447226
New timing: 6.886744
Future Work: