Support read decimal data from csv reader if user provide the schema with decimal data type #941

liukun4515 · 2021-11-11T11:13:57Z

Which issue does this PR close?

Closes #926

Rationale for this change

support read decimal data from csv reader if user provide the schema with decimal data type
add function to convert the string format decimal data type to i128 data type
add test for parse decimal data type

What changes are included in this PR?

add test for parse string to i128 for decimal data
add it for using csv reader to read decimal data

Are there any user-facing changes?

No

codecov-commenter · 2021-11-11T11:28:38Z

Codecov Report

Merging #941 (e93d85d) into master (f5a4341) will increase coverage by 0.03%.
The diff coverage is 92.19%.

@@            Coverage Diff             @@
##           master     #941      +/-   ##
==========================================
+ Coverage   82.36%   82.40%   +0.03%     
==========================================
  Files         168      168              
  Lines       48587    48728     +141     
==========================================
+ Hits        40020    40153     +133     
- Misses       8567     8575       +8

Impacted Files	Coverage Δ
arrow/src/array/builder.rs	`86.64% <ø> (ø)`
arrow/src/csv/reader.rs	`90.39% <92.19%> (+0.33%)`	⬆️
arrow/src/array/transform/mod.rs	`85.33% <0.00%> (ø)`
parquet_derive/src/parquet_field.rs	`65.98% <0.00%> (ø)`
arrow/src/datatypes/field.rs	`53.68% <0.00%> (+0.30%)`	⬆️
arrow/src/datatypes/datatype.rs	`66.38% <0.00%> (+0.42%)`	⬆️
arrow/src/error.rs	`17.77% <0.00%> (+2.22%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f5a4341...e93d85d. Read the comment docs.

liukun4515 · 2021-11-12T08:07:51Z

@alamb PTAL

alamb

Thank you for the contribution @liukun4515 -- this is looking great.

I had a question on the handling of decimals when the input doesn't have the same number of values as the declared scale

alamb · 2021-11-12T12:08:28Z

arrow/src/csv/reader.rs

@@ -513,6 +514,9 @@ fn parse(
            let field = &fields[i];
            match field.data_type() {
                DataType::Boolean => build_boolean_array(line_number, rows, i),
+                DataType::Decimal(p, v) => {


I think using DataType::Decimal(precision, scale) rather than DataType::Decimal(p, v) would be more consistent with the rest of the codebase

alamb · 2021-11-12T12:11:45Z

arrow/src/csv/reader.rs

+    precision: usize,
+    scale: usize,
+) -> Result<ArrayRef> {
+    let mut decimal_builder = DecimalBuilder::new(line_number, precision, scale);


the first argument to DecimalBuilder is the capacity (aka the expected size of the array). So I think it would be more performant to use rows.size() rather than line_number here

https://docs.rs/arrow/6.1.0/arrow/array/struct.DecimalBuilder.html#method.new

alamb · 2021-11-12T12:26:13Z

arrow/src/csv/reader.rs

+    Ok(Arc::new(decimal_builder.finish()))
+}
+
+// parse the string format decimal value to i128 format.


I am surprised that we don't already have a &str --> i128 parser, but I also could not find one.

alamb · 2021-11-12T12:26:17Z

arrow/src/csv/reader.rs

+        // each byte is digit、'-' or '.'
+        let bytes = s.as_bytes();
+        let mut negitive = false;
+        let mut result: i128 = 0;


I think you could probably use str::parse here (after first removing all '.') rather than a custom parsing loop

For example:

let i: i128 = "123".parse().unwrap(); println!("i is {}", i);

prints "123"

This is a simple way to convert a string to i128, but we may meet some corner case.
For example, we want to convert the value to decimal(10,2).
The string value (12.0000000000000000000000000000000000....000) (too may 'zero' after the decimal), if we remove the '.' and convert the value to i128, we may meet the overflow error.

alamb · 2021-11-12T12:28:21Z

arrow/src/csv/reader.rs

+        assert_eq!("53.002666", lat.value_as_string(1));
+        assert_eq!("52.412811", lat.value_as_string(2));
+        assert_eq!("51.481583", lat.value_as_string(3))
+    }


Could you also add some tests where the full precision is not supplied? For example, try parsing

57.65

(only 2 places after the decimal)

I think this implementation may end up producing a decimal of 0.005765 rather than 57.65000

Also, we should test

56.3838748284

(or something else that has more data after the decimal than the declared scale)

I have added other test cases to ensure right for above case.

liukun4515 · 2021-11-14T02:44:05Z

Thank you for the contribution @liukun4515 -- this is looking great.

I had a question on the handling of decimals when the input doesn't have the same number of values as the declared scale

thanks for your comments, i also have the same question.
I will do some investigation about decimal in other DB system, and reply the question.

liukun4515 · 2021-11-15T06:26:06Z

Hi @alamb , I have checked the mysql and pg database.
We have two problems which should be discussed.
The first is about scale, if the input string doesn't have the same value for the declared scale, what should we do to handle this?
In the PG:

postgres=# select cast('123' as decimal(10,2));
 numeric
---------
  123.00
(1 row)

postgres=# select cast('123.123213321' as decimal(10,2));
 numeric
---------
  123.12
(1 row)

In the MYSQL:

mysql> select cast('123' as decimal(10,2));
+------------------------------+
| cast('123' as decimal(10,2)) |
+------------------------------+
|                       123.00 |
+------------------------------+
1 row in set (0.00 sec)

mysql> select cast('123.1' as decimal(10,2));
+--------------------------------+
| cast('123.1' as decimal(10,2)) |
+--------------------------------+
|                         123.10 |
+--------------------------------+
1 row in set (0.00 sec)

mysql> select cast('123.12345' as decimal(10,2));
+------------------------------------+
| cast('123.12345' as decimal(10,2)) |
+------------------------------------+
|                             123.12 |
+------------------------------------+
1 row in set (0.00 sec)

From the two database, we can know that the number of value after the decimal must match the scale parameter.
If the number of value is less than the scale, we should fill the 0 in the tail.
If the number of value is greater than the scale, we should truncate the tail to meet the scale.

The second is about precision, if the input string value is out of bounds for the declared precision, what should we do to handle this?

This is the result for PG

postgres=# select cast('123' as decimal(2,2));
ERROR:  numeric field overflow
DETAIL:  A field with precision 2, scale 2 must round to an absolute value less than 1.

This is the result for mysql

mysql> select cast('123.12345' as decimal(2,2));
+-----------------------------------+
| cast('123.12345' as decimal(2,2)) |
+-----------------------------------+
|                              0.99 |
+-----------------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+----------------------------------------------------------------------------+
| Level   | Code | Message                                                                    |
+---------+------+----------------------------------------------------------------------------+
| Warning | 1264 | Out of range value for column 'cast('123.12345' as decimal(2,2))' at row 1 |
+---------+------+----------------------------------------------------------------------------+
1 row in set (0.00 sec)

In the PG and mysql, if the input string value is out of bounds for the precision, we will get some error.

alamb

Thank you @liukun4515 --- this is now looking good to me. I had one suggestion for a test as well as some style comments, but I think we could also merge this PR as is and fix them as a follow on PR

alamb · 2021-11-15T14:08:02Z

arrow/src/csv/reader.rs

+        let mut base = 1;
+
+        // handle the value after the '.' and meet the scale
+        let delimiter_position = s.find('.');


I wonder what will happen if there are two '.' in the string - like 123.456.789 ( I would expect a parsing error in this case)

I think this value 123.123.13 will get an error, because of the regex static ref PARSE_DECIMAL_RE: Regex = Regex::new(r"^-?(\d+\.?\d*|\d*\.?\d+)$").unwrap();.
I will add more test case and change the code style in the next pull request.

alamb · 2021-11-15T14:08:31Z

arrow/src/csv/reader.rs

@@ -1503,6 +1586,54 @@ mod tests {
        }
    }

+    #[test]
+    fn test_parse_decimal_with_parameter() {


alamb · 2021-11-15T14:09:22Z

arrow/src/csv/reader.rs

+            let result = parse_decimal_with_parameter(s, 20, 3);
+            assert_eq!(i, result.unwrap())
+        }
+        let can_not_parse_tests = ["123,123", "."];


I recommend a test with two . in it as well, such as 123.123.123

alamb · 2021-11-15T14:14:12Z

arrow/src/csv/reader.rs

+        let mut negative = false;
+        let mut result: i128 = 0;
+
+        while offset > 0 {


Something that might make this code faster (and more idomatic rust) would be to avoid the bytes[offset - 1] call (which does a bounds check).

It might look something like this (untested):

bytes[0..offset].iter().rev() .try_for_each(|b| { match b { b'-' => negative = true, b'.' => {}, ... } })?;

Great suggestion!!!
I'm a rust learner, some rust code style are relatively new to me.
I think we can fix it in the next pull request.

👍 yeah it took me a while to get into the Rust / functional style

liukun4515 · 2021-11-16T14:15:01Z

@alamb maybe you merge this pr first.
I will add the test cases you mentioned in the next pr with the code style.

liukun4515 · 2021-11-23T01:47:38Z

@alamb don't forget cherry pick this pull request with #952

…with decimal data type (#941) * support decimal data type for csv reader * format code and fix lint check * fix the clippy error * enchance the parse csv to decimal and add more test

…with decimal data type (#941) (#974) * support decimal data type for csv reader * format code and fix lint check * fix the clippy error * enchance the parse csv to decimal and add more test Co-authored-by: Kun Liu <liukun@apache.org>

liukun4515 added 2 commits November 11, 2021 19:02

support decimal data type for csv reader

c9b3b43

Merge remote-tracking branch 'origin/master' into arrow-rs-926

cbac920

github-actions bot added the arrow Changes to the arrow crate label Nov 11, 2021

liukun4515 mentioned this pull request Nov 11, 2021

Add decimal for CSVReader if user provide the schema with decimal #926

Closed

liukun4515 added 2 commits November 11, 2021 22:21

format code and fix lint check

544ffaf

fix the clippy error

0b25e97

alamb changed the title ~~ARROW-RS-926: support read decimal data from csv reader if user provide the schema with decimal data type~~ Support read decimal data from csv reader if user provide the schema with decimal data type Nov 12, 2021

alamb reviewed Nov 12, 2021

View reviewed changes

liukun4515 added 2 commits November 15, 2021 17:25

enchance the parse csv to decimal and add more test

360eb81

Merge remote-tracking branch 'origin/master' into arrow-rs-926

e93d85d

alamb approved these changes Nov 15, 2021

View reviewed changes

alamb merged commit b03b80c into apache:master Nov 16, 2021

liukun4515 mentioned this pull request Nov 17, 2021

add more error test case and change the code style #952

Merged

liukun4515 deleted the arrow-rs-926 branch November 23, 2021 01:51

alamb added the cherry-picked label Nov 23, 2021

alamb mentioned this pull request Nov 23, 2021

Cherry pick Support read decimal data from csv reader if user provide the schema with decimal data type to active_release #974

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support read decimal data from csv reader if user provide the schema with decimal data type #941

Support read decimal data from csv reader if user provide the schema with decimal data type #941

liukun4515 commented Nov 11, 2021 •

edited

Loading

codecov-commenter commented Nov 11, 2021 •

edited

Loading

liukun4515 commented Nov 12, 2021

alamb left a comment

alamb Nov 12, 2021

alamb Nov 12, 2021

alamb Nov 12, 2021

alamb Nov 12, 2021

liukun4515 Nov 15, 2021

alamb Nov 12, 2021

liukun4515 Nov 15, 2021

liukun4515 commented Nov 14, 2021

liukun4515 commented Nov 15, 2021 •

edited

Loading

alamb left a comment

alamb Nov 15, 2021

liukun4515 Nov 16, 2021

alamb Nov 15, 2021

alamb Nov 15, 2021

alamb Nov 15, 2021

liukun4515 Nov 16, 2021 •

edited

Loading

alamb Nov 16, 2021

liukun4515 commented Nov 16, 2021

liukun4515 commented Nov 23, 2021

Support read decimal data from csv reader if user provide the schema with decimal data type #941

Support read decimal data from csv reader if user provide the schema with decimal data type #941

Conversation

liukun4515 commented Nov 11, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented Nov 11, 2021 • edited Loading

Codecov Report

liukun4515 commented Nov 12, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 commented Nov 14, 2021

liukun4515 commented Nov 15, 2021 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 Nov 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 commented Nov 16, 2021

liukun4515 commented Nov 23, 2021

liukun4515 commented Nov 11, 2021 •

edited

Loading

codecov-commenter commented Nov 11, 2021 •

edited

Loading

liukun4515 commented Nov 15, 2021 •

edited

Loading

liukun4515 Nov 16, 2021 •

edited

Loading