Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decimal separator option for CSV reader #5958

Merged
merged 13 commits into from
Jan 26, 2023

Conversation

eeroel
Copy link
Contributor

@eeroel eeroel commented Jan 22, 2023

Add a decimal_separator option to the CSV reader, so that numeric data produced with e.g. certain European locales can be parsed. Only comma and period are supported, with period being the default. The implementation bypasses the default casting logic in the CSV reader when comma is specified, and calls newly introduced functions from cast_operators.cpp which pass the correct decimal separator to the casting functions.

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good - some comments below.

@@ -27,13 +27,27 @@ struct TryCast {
}
};

struct TryCastCommaSeparated {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This separate struct does not need to exist, no? Only TryCastErrorMessageCommaSeparated would be sufficient.


template <>
bool TryCastErrorMessage::Operation(string_t input, float &result, string *error_message, bool strict) {
if (!TryCast::Operation<string_t, float>(input, result, strict)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be added?

auto scale = DecimalType::GetScale(sql_type);
switch (sql_type.InternalType()) {
case PhysicalType::INT8:
return TryCastDecimalOperator::Operation<TryCastToDecimalCommaSeparated, int8_t>(value_str, width, scale);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decimals cannot be stored in INT8, only in 16, 32, 64, 128 - so this switch case can be removed.


switch (type_id) {
case PhysicalType::INT8:
success = TryCastDecimalVector<int8_t>(options, parse_chunk.data[col_idx],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - INT8 is not a valid decimal type

@@ -343,6 +447,58 @@ bool BaseCSVReader::Flush(DataChunk &insert_chunk, bool try_add_line) {
// use the date format to cast the chunk
success = TryCastTimestampVector(options, parse_chunk.data[col_idx], insert_chunk.data[insert_idx],
parse_chunk.size(), error_message);
} else if ((return_types[col_idx].id() == LogicalTypeId::FLOAT) ||
(return_types[col_idx].id() == LogicalTypeId::DOUBLE)) {
auto type_id = return_types[col_idx].InternalType();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we perhaps move these blocks to separate functions similar to the TryCastDateVector and TryCastTimestampVector functions?

@@ -343,6 +447,58 @@ bool BaseCSVReader::Flush(DataChunk &insert_chunk, bool try_add_line) {
// use the date format to cast the chunk
success = TryCastTimestampVector(options, parse_chunk.data[col_idx], insert_chunk.data[insert_idx],
parse_chunk.size(), error_message);
} else if ((return_types[col_idx].id() == LogicalTypeId::FLOAT) ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to do this if the decimal separator is not ., right?

default:
throw InternalException("Unimplemented physical type for floating");
}
} else if ((return_types[col_idx].id() == LogicalTypeId::DECIMAL)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - we only need to do this if the decimal separator is not .

bool TryCastDecimalVector(BufferedCSVReaderOptions &options, Vector &input_vector, Vector &result_vector, idx_t count,
string &error_message, uint8_t width, uint8_t scale, string decimal_separator) {
if (decimal_separator == ".") {
return TemplatedTryCastDecimalVector<TryCastToDecimal, T>(options, input_vector, result_vector, count,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can ensure that the decimal separator is , at a higher level (since we can use the standard cast for a decimal separator of .) we can remove this if statement and simplify the code

bool TryCastFloatingVector(BufferedCSVReaderOptions &options, Vector &input_vector, Vector &result_vector, idx_t count,
string &error_message, string decimal_separator) {
if (decimal_separator == ".") {
return TemplatedTryCastFloatingVector<TryCastErrorMessage, T>(options, input_vector, result_vector, count,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - this branch can be removed and the code can be simplified


template <class T>
bool TryCastDecimalVector(BufferedCSVReaderOptions &options, Vector &input_vector, Vector &result_vector, idx_t count,
string &error_message, uint8_t width, uint8_t scale, string decimal_separator) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing in a std::string to a function invokes a string copy - const string & would be better here (or avoid passing in the decimal separator entirely by making sure it is a comma first).

@eeroel eeroel force-pushed the feature/decimal_separator_v2 branch from dfb7a51 to f42f7d0 Compare January 24, 2023 19:38
@eeroel
Copy link
Contributor Author

eeroel commented Jan 25, 2023

Thanks for the review, all comments should be addressed now. Also added more informative assertions for the error cases in the unit test. For some reason one of the tests failed in VectorSizes, as if it was reading the rows in a different order, is it expected?

@eeroel eeroel requested a review from Mytherin January 25, 2023 07:17
@Mytherin
Copy link
Collaborator

Thanks for the review, all comments should be addressed now. Also added more informative assertions for the error cases in the unit test. For some reason one of the tests failed in VectorSizes, as if it was reading the rows in a different order, is it expected?

No, but running with a lower vector size might influence CSV reader auto-detection. You could add require vector_size 512 to the test to disable the test with low vector sizes if you are certain it is not triggering a bug in your code.

@eeroel
Copy link
Contributor Author

eeroel commented Jan 25, 2023

Thanks for the review, all comments should be addressed now. Also added more informative assertions for the error cases in the unit test. For some reason one of the tests failed in VectorSizes, as if it was reading the rows in a different order, is it expected?

No, but running with a lower vector size might influence CSV reader auto-detection. You could add require vector_size 512 to the test to disable the test with low vector sizes if you are certain it is not triggering a bug in your code.

Checked against master now and got the same failure there, so it's not related to the changes in this PR => added the vector_size restriction to the test

@Mytherin Mytherin merged commit 5b40a5f into duckdb:master Jan 26, 2023
@Mytherin
Copy link
Collaborator

Thanks for the changes! LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants