New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Decimal separator option for CSV reader #5958

Merged

Mytherin merged 13 commits into duckdb:master from eeroel:feature/decimal_separator_v2

Jan 26, 2023

Contributor

eeroel commented Jan 22, 2023

Add a decimal_separator option to the CSV reader, so that numeric data produced with e.g. certain European locales can be parsed. Only comma and period are supported, with period being the default. The implementation bypasses the default casting logic in the CSV reader when comma is specified, and calls newly introduced functions from cast_operators.cpp which pass the correct decimal separator to the casting functions.

eeroel added 5 commits

January 21, 2023 19:12


          add tests, decimal separator option

5a7741c


          handle decimal separator in casting

93d8f69


          more tests

db8756e


          update tests

5465a58


          fix format

9bcea42

Mytherin reviewed

View reviewed changes

Collaborator

Mytherin left a comment

Thanks for the PR! Looks good - some comments below.

src/include/duckdb/common/operator/cast_operators.hpp Outdated

@@ @@ -27,13 +27,27 @@ struct TryCast { @@
               	}
               };
+              struct TryCastCommaSeparated {

Collaborator

Mytherin Jan 24, 2023

This separate struct does not need to exist, no? Only TryCastErrorMessageCommaSeparated would be sufficient.

src/common/operator/cast_operators.cpp Outdated

+              template <>
+              bool TryCastErrorMessage::Operation(string_t input, float &result, string *error_message, bool strict) {
+              	if (!TryCast::Operation<string_t, float>(input, result, strict)) {

Collaborator

Mytherin Jan 24, 2023

Does this need to be added?

src/execution/operator/persistent/base_csv_reader.cpp Outdated

+              		auto scale = DecimalType::GetScale(sql_type);
+              		switch (sql_type.InternalType()) {
+              		case PhysicalType::INT8:
+              			return TryCastDecimalOperator::Operation<TryCastToDecimalCommaSeparated, int8_t>(value_str, width, scale);

Collaborator

Mytherin Jan 24, 2023

Decimals cannot be stored in INT8, only in 16, 32, 64, 128 - so this switch case can be removed.

src/execution/operator/persistent/base_csv_reader.cpp Outdated

+              				switch (type_id) {
+              				case PhysicalType::INT8:
+              					success = TryCastDecimalVector<int8_t>(options, parse_chunk.data[col_idx],

Collaborator

Mytherin Jan 24, 2023

Same here - INT8 is not a valid decimal type

src/execution/operator/persistent/base_csv_reader.cpp Outdated

               				// use the date format to cast the chunk
               				success = TryCastTimestampVector(options, parse_chunk.data[col_idx], insert_chunk.data[insert_idx],
               				                                 parse_chunk.size(), error_message);
+              			} else if ((return_types[col_idx].id() == LogicalTypeId::FLOAT) ||
+              			           (return_types[col_idx].id() == LogicalTypeId::DOUBLE)) {
+              				auto type_id = return_types[col_idx].InternalType();

Collaborator

Mytherin Jan 24, 2023

Can we perhaps move these blocks to separate functions similar to the TryCastDateVector and TryCastTimestampVector functions?

src/execution/operator/persistent/base_csv_reader.cpp Outdated

               				// use the date format to cast the chunk
               				success = TryCastTimestampVector(options, parse_chunk.data[col_idx], insert_chunk.data[insert_idx],
               				                                 parse_chunk.size(), error_message);
+              			} else if ((return_types[col_idx].id() == LogicalTypeId::FLOAT) ||

Collaborator

Mytherin Jan 24, 2023

We only need to do this if the decimal separator is not ., right?

src/execution/operator/persistent/base_csv_reader.cpp Outdated

+              				default:
+              					throw InternalException("Unimplemented physical type for floating");
+              				}
+              			} else if ((return_types[col_idx].id() == LogicalTypeId::DECIMAL)) {

Collaborator

Mytherin Jan 24, 2023

Same here - we only need to do this if the decimal separator is not .

src/execution/operator/persistent/base_csv_reader.cpp Outdated

+              bool TryCastDecimalVector(BufferedCSVReaderOptions &options, Vector &input_vector, Vector &result_vector, idx_t count,
+                                        string &error_message, uint8_t width, uint8_t scale, string decimal_separator) {
+              	if (decimal_separator == ".") {
+              		return TemplatedTryCastDecimalVector<TryCastToDecimal, T>(options, input_vector, result_vector, count,

Collaborator

Mytherin Jan 24, 2023

If we can ensure that the decimal separator is , at a higher level (since we can use the standard cast for a decimal separator of .) we can remove this if statement and simplify the code

src/execution/operator/persistent/base_csv_reader.cpp Outdated

+              bool TryCastFloatingVector(BufferedCSVReaderOptions &options, Vector &input_vector, Vector &result_vector, idx_t count,
+                                         string &error_message, string decimal_separator) {
+              	if (decimal_separator == ".") {
+              		return TemplatedTryCastFloatingVector<TryCastErrorMessage, T>(options, input_vector, result_vector, count,

Collaborator

Mytherin Jan 24, 2023

Same here - this branch can be removed and the code can be simplified

src/execution/operator/persistent/base_csv_reader.cpp Outdated

+              template <class T>
+              bool TryCastDecimalVector(BufferedCSVReaderOptions &options, Vector &input_vector, Vector &result_vector, idx_t count,
+                                        string &error_message, uint8_t width, uint8_t scale, string decimal_separator) {

Collaborator

Mytherin Jan 24, 2023

Passing in a std::string to a function invokes a string copy - const string & would be better here (or avoid passing in the decimal separator entirely by making sure it is a comma first).

eeroel added 6 commits

January 24, 2023 17:59


          remove unnecessary branches

58918cc


          move decimal separator check upstream

81db0a2


          simplify cast_operators

1ad273e


          simplify base_csv_reader

b1ceb64


          better test assertions

1b071fc


          fix format

f42f7d0

eeroel force-pushed the feature/decimal_separator_v2 branch from dfb7a51 to f42f7d0 Compare

January 24, 2023 19:38


          remove unused code

095455c

Contributor Author

eeroel commented Jan 25, 2023

Thanks for the review, all comments should be addressed now. Also added more informative assertions for the error cases in the unit test. For some reason one of the tests failed in VectorSizes, as if it was reading the rows in a different order, is it expected?

eeroel requested a review from Mytherin

January 25, 2023 07:17

Collaborator

Mytherin commented Jan 25, 2023

Thanks for the review, all comments should be addressed now. Also added more informative assertions for the error cases in the unit test. For some reason one of the tests failed in VectorSizes, as if it was reading the rows in a different order, is it expected?

No, but running with a lower vector size might influence CSV reader auto-detection. You could add require vector_size 512 to the test to disable the test with low vector sizes if you are certain it is not triggering a bug in your code.


          force vector size in test

787fc30

Contributor Author

eeroel commented Jan 25, 2023

Thanks for the review, all comments should be addressed now. Also added more informative assertions for the error cases in the unit test. For some reason one of the tests failed in VectorSizes, as if it was reading the rows in a different order, is it expected?

No, but running with a lower vector size might influence CSV reader auto-detection. You could add require vector_size 512 to the test to disable the test with low vector sizes if you are certain it is not triggering a bug in your code.

Checked against master now and got the same failure there, so it's not related to the changes in this PR => added the vector_size restriction to the test

Mytherin merged commit 5b40a5f into duckdb:master

Collaborator

Mytherin commented Jan 26, 2023

Thanks for the changes! LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment