-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decimal separator option for CSV reader #5958
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Looks good - some comments below.
@@ -27,13 +27,27 @@ struct TryCast { | |||
} | |||
}; | |||
|
|||
struct TryCastCommaSeparated { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This separate struct does not need to exist, no? Only TryCastErrorMessageCommaSeparated
would be sufficient.
|
||
template <> | ||
bool TryCastErrorMessage::Operation(string_t input, float &result, string *error_message, bool strict) { | ||
if (!TryCast::Operation<string_t, float>(input, result, strict)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be added?
auto scale = DecimalType::GetScale(sql_type); | ||
switch (sql_type.InternalType()) { | ||
case PhysicalType::INT8: | ||
return TryCastDecimalOperator::Operation<TryCastToDecimalCommaSeparated, int8_t>(value_str, width, scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decimals cannot be stored in INT8
, only in 16, 32, 64, 128
- so this switch case can be removed.
|
||
switch (type_id) { | ||
case PhysicalType::INT8: | ||
success = TryCastDecimalVector<int8_t>(options, parse_chunk.data[col_idx], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - INT8
is not a valid decimal type
@@ -343,6 +447,58 @@ bool BaseCSVReader::Flush(DataChunk &insert_chunk, bool try_add_line) { | |||
// use the date format to cast the chunk | |||
success = TryCastTimestampVector(options, parse_chunk.data[col_idx], insert_chunk.data[insert_idx], | |||
parse_chunk.size(), error_message); | |||
} else if ((return_types[col_idx].id() == LogicalTypeId::FLOAT) || | |||
(return_types[col_idx].id() == LogicalTypeId::DOUBLE)) { | |||
auto type_id = return_types[col_idx].InternalType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we perhaps move these blocks to separate functions similar to the TryCastDateVector
and TryCastTimestampVector
functions?
@@ -343,6 +447,58 @@ bool BaseCSVReader::Flush(DataChunk &insert_chunk, bool try_add_line) { | |||
// use the date format to cast the chunk | |||
success = TryCastTimestampVector(options, parse_chunk.data[col_idx], insert_chunk.data[insert_idx], | |||
parse_chunk.size(), error_message); | |||
} else if ((return_types[col_idx].id() == LogicalTypeId::FLOAT) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to do this if the decimal separator is not .
, right?
default: | ||
throw InternalException("Unimplemented physical type for floating"); | ||
} | ||
} else if ((return_types[col_idx].id() == LogicalTypeId::DECIMAL)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - we only need to do this if the decimal separator is not .
bool TryCastDecimalVector(BufferedCSVReaderOptions &options, Vector &input_vector, Vector &result_vector, idx_t count, | ||
string &error_message, uint8_t width, uint8_t scale, string decimal_separator) { | ||
if (decimal_separator == ".") { | ||
return TemplatedTryCastDecimalVector<TryCastToDecimal, T>(options, input_vector, result_vector, count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can ensure that the decimal separator is ,
at a higher level (since we can use the standard cast for a decimal separator of .
) we can remove this if statement and simplify the code
bool TryCastFloatingVector(BufferedCSVReaderOptions &options, Vector &input_vector, Vector &result_vector, idx_t count, | ||
string &error_message, string decimal_separator) { | ||
if (decimal_separator == ".") { | ||
return TemplatedTryCastFloatingVector<TryCastErrorMessage, T>(options, input_vector, result_vector, count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - this branch can be removed and the code can be simplified
|
||
template <class T> | ||
bool TryCastDecimalVector(BufferedCSVReaderOptions &options, Vector &input_vector, Vector &result_vector, idx_t count, | ||
string &error_message, uint8_t width, uint8_t scale, string decimal_separator) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing in a std::string
to a function invokes a string copy - const string &
would be better here (or avoid passing in the decimal separator entirely by making sure it is a comma first).
dfb7a51
to
f42f7d0
Compare
Thanks for the review, all comments should be addressed now. Also added more informative assertions for the error cases in the unit test. For some reason one of the tests failed in VectorSizes, as if it was reading the rows in a different order, is it expected? |
No, but running with a lower vector size might influence CSV reader auto-detection. You could add |
Checked against master now and got the same failure there, so it's not related to the changes in this PR => added the vector_size restriction to the test |
Thanks for the changes! LGTM |
Add a
decimal_separator
option to the CSV reader, so that numeric data produced with e.g. certain European locales can be parsed. Only comma and period are supported, with period being the default. The implementation bypasses the default casting logic in the CSV reader when comma is specified, and calls newly introduced functions fromcast_operators.cpp
which pass the correct decimal separator to the casting functions.