Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support read decimal data from csv reader if user provide the schema with decimal data type #941

Merged
merged 6 commits into from
Nov 16, 2021

Conversation

liukun4515
Copy link
Contributor

@liukun4515 liukun4515 commented Nov 11, 2021

Which issue does this PR close?

Closes #926

Rationale for this change

  1. support read decimal data from csv reader if user provide the schema with decimal data type
  2. add function to convert the string format decimal data type to i128 data type
  3. add test for parse decimal data type

What changes are included in this PR?

  1. add test for parse string to i128 for decimal data
  2. add it for using csv reader to read decimal data

Are there any user-facing changes?

No

@codecov-commenter
Copy link

codecov-commenter commented Nov 11, 2021

Codecov Report

Merging #941 (e93d85d) into master (f5a4341) will increase coverage by 0.03%.
The diff coverage is 92.19%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #941      +/-   ##
==========================================
+ Coverage   82.36%   82.40%   +0.03%     
==========================================
  Files         168      168              
  Lines       48587    48728     +141     
==========================================
+ Hits        40020    40153     +133     
- Misses       8567     8575       +8     
Impacted Files Coverage Δ
arrow/src/array/builder.rs 86.64% <ø> (ø)
arrow/src/csv/reader.rs 90.39% <92.19%> (+0.33%) ⬆️
arrow/src/array/transform/mod.rs 85.33% <0.00%> (ø)
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (ø)
arrow/src/datatypes/field.rs 53.68% <0.00%> (+0.30%) ⬆️
arrow/src/datatypes/datatype.rs 66.38% <0.00%> (+0.42%) ⬆️
arrow/src/error.rs 17.77% <0.00%> (+2.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f5a4341...e93d85d. Read the comment docs.

@liukun4515
Copy link
Contributor Author

@alamb PTAL

@alamb alamb changed the title ARROW-RS-926: support read decimal data from csv reader if user provide the schema with decimal data type Support read decimal data from csv reader if user provide the schema with decimal data type Nov 12, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @liukun4515 -- this is looking great.

I had a question on the handling of decimals when the input doesn't have the same number of values as the declared scale

@@ -513,6 +514,9 @@ fn parse(
let field = &fields[i];
match field.data_type() {
DataType::Boolean => build_boolean_array(line_number, rows, i),
DataType::Decimal(p, v) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using DataType::Decimal(precision, scale) rather than DataType::Decimal(p, v) would be more consistent with the rest of the codebase

precision: usize,
scale: usize,
) -> Result<ArrayRef> {
let mut decimal_builder = DecimalBuilder::new(line_number, precision, scale);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first argument to DecimalBuilder is the capacity (aka the expected size of the array). So I think it would be more performant to use rows.size() rather than line_number here

https://docs.rs/arrow/6.1.0/arrow/array/struct.DecimalBuilder.html#method.new

Ok(Arc::new(decimal_builder.finish()))
}

// parse the string format decimal value to i128 format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised that we don't already have a &str --> i128 parser, but I also could not find one.

Comment on lines 777 to 780
// each byte is digit、'-' or '.'
let bytes = s.as_bytes();
let mut negitive = false;
let mut result: i128 = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could probably use str::parse here (after first removing all '.') rather than a custom parsing loop

For example:

    let i: i128 = "123".parse().unwrap();
    println!("i is {}", i);

prints "123"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a simple way to convert a string to i128, but we may meet some corner case.
For example, we want to convert the value to decimal(10,2).
The string value (12.0000000000000000000000000000000000....000) (too may 'zero' after the decimal), if we remove the '.' and convert the value to i128, we may meet the overflow error.

assert_eq!("53.002666", lat.value_as_string(1));
assert_eq!("52.412811", lat.value_as_string(2));
assert_eq!("51.481583", lat.value_as_string(3))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add some tests where the full precision is not supplied? For example, try parsing

57.65

(only 2 places after the decimal)

I think this implementation may end up producing a decimal of 0.005765 rather than 57.65000

Also, we should test

56.3838748284

(or something else that has more data after the decimal than the declared scale)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added other test cases to ensure right for above case.

@liukun4515
Copy link
Contributor Author

Thank you for the contribution @liukun4515 -- this is looking great.

I had a question on the handling of decimals when the input doesn't have the same number of values as the declared scale

thanks for your comments, i also have the same question.
I will do some investigation about decimal in other DB system, and reply the question.

@liukun4515
Copy link
Contributor Author

liukun4515 commented Nov 15, 2021

Hi @alamb , I have checked the mysql and pg database.
We have two problems which should be discussed.
The first is about scale, if the input string doesn't have the same value for the declared scale, what should we do to handle this?
In the PG:

postgres=# select cast('123' as decimal(10,2));
 numeric
---------
  123.00
(1 row)

postgres=# select cast('123.123213321' as decimal(10,2));
 numeric
---------
  123.12
(1 row)

In the MYSQL:

mysql> select cast('123' as decimal(10,2));
+------------------------------+
| cast('123' as decimal(10,2)) |
+------------------------------+
|                       123.00 |
+------------------------------+
1 row in set (0.00 sec)

mysql> select cast('123.1' as decimal(10,2));
+--------------------------------+
| cast('123.1' as decimal(10,2)) |
+--------------------------------+
|                         123.10 |
+--------------------------------+
1 row in set (0.00 sec)

mysql> select cast('123.12345' as decimal(10,2));
+------------------------------------+
| cast('123.12345' as decimal(10,2)) |
+------------------------------------+
|                             123.12 |
+------------------------------------+
1 row in set (0.00 sec)

From the two database, we can know that the number of value after the decimal must match the scale parameter.
If the number of value is less than the scale, we should fill the 0 in the tail.
If the number of value is greater than the scale, we should truncate the tail to meet the scale.

The second is about precision, if the input string value is out of bounds for the declared precision, what should we do to handle this?

This is the result for PG

postgres=# select cast('123' as decimal(2,2));
ERROR:  numeric field overflow
DETAIL:  A field with precision 2, scale 2 must round to an absolute value less than 1.

This is the result for mysql

mysql> select cast('123.12345' as decimal(2,2));
+-----------------------------------+
| cast('123.12345' as decimal(2,2)) |
+-----------------------------------+
|                              0.99 |
+-----------------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+----------------------------------------------------------------------------+
| Level   | Code | Message                                                                    |
+---------+------+----------------------------------------------------------------------------+
| Warning | 1264 | Out of range value for column 'cast('123.12345' as decimal(2,2))' at row 1 |
+---------+------+----------------------------------------------------------------------------+
1 row in set (0.00 sec)

In the PG and mysql, if the input string value is out of bounds for the precision, we will get some error.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @liukun4515 --- this is now looking good to me. I had one suggestion for a test as well as some style comments, but I think we could also merge this PR as is and fix them as a follow on PR

let mut base = 1;

// handle the value after the '.' and meet the scale
let delimiter_position = s.find('.');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what will happen if there are two '.' in the string - like 123.456.789 ( I would expect a parsing error in this case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this value 123.123.13 will get an error, because of the regex static ref PARSE_DECIMAL_RE: Regex = Regex::new(r"^-?(\d+\.?\d*|\d*\.?\d+)$").unwrap();.
I will add more test case and change the code style in the next pull request.

@@ -1503,6 +1586,54 @@ mod tests {
}
}

#[test]
fn test_parse_decimal_with_parameter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

let result = parse_decimal_with_parameter(s, 20, 3);
assert_eq!(i, result.unwrap())
}
let can_not_parse_tests = ["123,123", "."];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend a test with two . in it as well, such as 123.123.123

let mut negative = false;
let mut result: i128 = 0;

while offset > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that might make this code faster (and more idomatic rust) would be to avoid the bytes[offset - 1] call (which does a bounds check).

It might look something like this (untested):

bytes[0..offset].iter().rev()
  .try_for_each(|b| {
    match b {
      b'-' => negative = true,
      b'.' => {},
      ...
    }
  })?;

Copy link
Contributor Author

@liukun4515 liukun4515 Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion!!!
I'm a rust learner, some rust code style are relatively new to me.
I think we can fix it in the next pull request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 yeah it took me a while to get into the Rust / functional style

@liukun4515
Copy link
Contributor Author

@alamb maybe you merge this pr first.
I will add the test cases you mentioned in the next pr with the code style.

@alamb alamb merged commit b03b80c into apache:master Nov 16, 2021
@liukun4515
Copy link
Contributor Author

@alamb don't forget cherry pick this pull request with #952

@liukun4515 liukun4515 deleted the arrow-rs-926 branch November 23, 2021 01:51
alamb pushed a commit that referenced this pull request Nov 23, 2021
…with decimal data type (#941)

* support decimal data type for csv reader

* format code and fix lint check

* fix the clippy error

* enchance the parse csv to decimal and add more test
alamb added a commit that referenced this pull request Nov 24, 2021
…with decimal data type (#941) (#974)

* support decimal data type for csv reader

* format code and fix lint check

* fix the clippy error

* enchance the parse csv to decimal and add more test

Co-authored-by: Kun Liu <liukun@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add decimal for CSVReader if user provide the schema with decimal
3 participants