feat: Implement Spark-compatible CAST from string to integral types #307

andygrove · 2024-04-23T15:17:01Z

Which issue does this PR close?

Part of #286
Closes #15

Rationale for this change

Improve compatibility with Apache Spark

What changes are included in this PR?

Add custom implementation of CAST from string to integral rather than delegate to DataFusion

How are these changes tested?

andygrove · 2024-04-23T15:18:11Z

I am now working on refactoring to reduce code duplication by leveraging macros/generics.

andygrove · 2024-04-23T19:29:19Z

core/src/execution/datafusion/expressions/cast.rs

+            (
+                DataType::Dictionary(key_type, value_type),
+                DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64,
+            ) if key_type.as_ref() == &DataType::Int32


@viirya do you know if dictionary keys will always be i32?

I've been assuming it to be so, though @viirya can give us the definitive answer

I remember in many places in native code, we assume that dictionary keys are always Int32 type.

But I forgot that where we make such assumption. 😅

cc @sunchao Do you remember that?

Oh, I see. I think the assumption comes from native scan side where the Parquet dictionary indices is always of integer type so dictionary keys read from native scan is always Int32 type.

You can check the DictDecoder in native scan implementation.

Except for that any operator or expression during execution produce a dictionary with keys other than Int32 type. But for that I think it should be considered a bug for us to fix because I don't think it makes sense to change dictionary key type.

Oh, I see. I think the assumption comes from native scan side where the Parquet dictionary indices is always of integer type so dictionary keys read from native scan is always Int32 type.

Yes that is exactly right.

Thanks @sunchao for confirming it.

comphead · 2024-04-23T19:30:33Z

core/src/execution/datafusion/expressions/cast.rs

@@ -64,6 +68,25 @@ pub struct Cast {
    pub timezone: String,
 }

+macro_rules! spark_cast_utf8_to_integral {


maybe utf8_to_integer?

spark not involved in native exec, not sure why spark is needed.
Integral type also includes booleans and this scope limited by integers afaik

comphead · 2024-04-23T19:32:08Z

core/src/execution/datafusion/expressions/cast.rs

+macro_rules! spark_cast_utf8_to_integral {
+    ($string_array:expr, $eval_mode:expr, $array_type:ty, $cast_method:ident) => {{
+        let mut cast_array = PrimitiveArray::<$array_type>::builder($string_array.len());
+        for i in 0..$string_array.len() {


we probably can use iterator instead of for loop?

and lets calc $string_array.len() once

I think that $string_array.len() is already only computed once?

I see them on lines 73,74 🤔

I missed that! Thanks

comphead · 2024-04-23T19:32:54Z

core/src/execution/datafusion/expressions/cast.rs

+        Ok(spark_cast(cast_result, from_type, to_type))
+    }
+
+    fn spark_cast_string_to_integral(


string_to_int?

core/src/execution/datafusion/expressions/cast.rs

comphead · 2024-04-23T19:42:55Z

Thanks @andygrove btw I'm wondering if this PR should cover scope with formatting https://spark.apache.org/docs/latest/sql-ref-number-pattern.html#the-to_number-function

Co-authored-by: comphead <comphead@users.noreply.github.com>

…datafusion-comet into cast-string-to-integral

parthchandra · 2024-04-23T19:53:38Z

spark/src/test/scala/org/apache/comet/CometCastSuite.scala

  }

-  ignore("cast string to short") {
+  test("cast string to short") {


We should probably have some negative tests with invalid strings.
Also, curious, what does cast(".") yield?

The fuzz testing does generate many invalid inputs. I can add some more explicit ones to these tests, though.

cast(".") will yield different results depending on the eval mode:

LEGACY -> 0

TRY -> null

ANSI -> error

+1 for these test btw

andygrove · 2024-04-23T19:54:27Z

Thanks @andygrove btw I'm wondering if this PR should cover scope with formatting https://spark.apache.org/docs/latest/sql-ref-number-pattern.html#the-to_number-function

Sorry, I'm not sure I understand. You are referring to the error message formatting?

comphead · 2024-04-23T20:13:13Z

Thanks @andygrove btw I'm wondering if this PR should cover scope with formatting https://spark.apache.org/docs/latest/sql-ref-number-pattern.html#the-to_number-function

Sorry, I'm not sure I understand. You are referring to the error message formatting?

Oh it covers just cast string to integers, I thought to_number() is also covered as it has casting behind the scenes

advancedxy · 2024-04-25T06:12:17Z

core/src/execution/datafusion/expressions/cast.rs

+    let negative = chars[0] == '-';
+    if negative || chars[0] == '+' {


This seems wrong.
It should be chars[i] == '-' instead? Otherwise, this cast doesn't work for -124

Thank you! The code was originally trimming the string before this point and I missed updating this when I removed the trim. I have now fixed this.

advancedxy · 2024-04-25T06:13:45Z

core/src/execution/datafusion/expressions/cast.rs

+    use super::{cast_string_to_i8, EvalMode};
+
+    #[test]
+    fn test_cast_string_as_i8() {


how about add more tests about i32 and i64 with its min/max and zero input?

I am going to focus on improving the tests in this PR today

I have now added tests for all min/max boundary values in the Scala tests

advancedxy · 2024-04-25T06:16:10Z

core/src/execution/datafusion/expressions/cast.rs

@@ -103,10 +125,72 @@ impl Cast {
            (DataType::LargeUtf8, DataType::Boolean) => {
                Self::spark_cast_utf8_to_boolean::<i64>(&array, self.eval_mode)?


Not part of this pr. But if we are going to name the added method as cast_string_to_int.

This method should be renamed to cast_utf8_to_boolean as well in a follow-up PR?

I agree. I didn't want to start making unrelated changes in this PR, but we should rename this.

advancedxy · 2024-04-25T06:16:52Z

core/src/execution/datafusion/expressions/cast.rs

+                // Note that we are unpacking a dictionary-encoded array and then performing
+                // the cast. We could potentially improve performance here by casting the
+                // dictionary values directly without unpacking the array first, although this
+                // would add more complexity to the code


I think we can leave a TODO to cast dictionary directly?

…aster)

advancedxy · 2024-04-29T02:49:45Z

spark/src/test/scala/org/apache/comet/CometCastSuite.scala

-
-  ignore("cast string to long") {
-    castTest(generateStrings(numericPattern, 8).toDF("a"), DataTypes.LongType)
+  private val castStringToIntegralInputs: Seq[String] = Seq(


nit: Since the cast code handles leading and trailing white spaces, I think we can add more input with white spaces.

For example:

castStringToIntegeralnputs.flatMap { x => Seq(" " + x, x + " ", " " + x + " ") }

advancedxy

Thanks for your effort @andygrove, the new code is well crafted.

andygrove · 2024-04-29T13:39:21Z

@viirya @sunchao @parthchandra @comphead I did quite a bit of refactoring and performance tuning over the weekend. Please take another look when you can.

andygrove · 2024-04-29T14:32:48Z

Thanks for your effort @andygrove, the new code is well crafted.

Thank you for the thorough review @advancedxy!

comphead · 2024-04-29T19:58:56Z

core/src/execution/datafusion/expressions/cast.rs

+        let len = $array.len();
+        let mut cast_array = PrimitiveArray::<$array_type>::builder(len);
+        for i in 0..len {
+            if $array.is_null(i) {


maybe it can be simplified to

if let Some(cast_value) = $cast_method($array.value(i).trim(), $eval_mode)? { cast_array.append_value(cast_value); } else { cast_array.append_null() }

If there is a null input then we will always want a null output and we don't want to add the overhead of calling the cast logic in this case.

core/src/execution/datafusion/expressions/cast.rs

comphead

lgtm thanks @andygrove couple of minors

viirya · 2024-04-29T20:45:11Z

core/src/execution/datafusion/expressions/cast.rs

+/// Either return Ok(None) or Err(CometError::CastInvalidValue) depending on the evaluation mode
+fn none_or_err<T>(eval_mode: EvalMode, type_name: &str, str: &str) -> CometResult<Option<T>> {
+    match eval_mode {
+        EvalMode::Ansi => Err(invalid_value(str, "STRING", type_name)),
+        _ => Ok(None),
+    }
+}
+
+fn invalid_value(value: &str, from_type: &str, to_type: &str) -> CometError {
+    CometError::CastInvalidValue {
+        value: value.to_string(),
+        from_type: from_type.to_string(),
+        to_type: to_type.to_string(),
+    }
+}


I think these can be inline function?

Thanks. I have updated this.

viirya · 2024-04-29T20:45:55Z

core/src/execution/datafusion/expressions/cast.rs

+fn cast_string_to_i32(str: &str, eval_mode: EvalMode) -> CometResult<Option<i32>> {
+    do_cast_string_to_int::<i32>(str, eval_mode, "INT", i32::MIN)
+}
+
+fn cast_string_to_i64(str: &str, eval_mode: EvalMode) -> CometResult<Option<i64>> {
+    do_cast_string_to_int::<i64>(str, eval_mode, "BIGINT", i64::MIN)
+}


Why only i8 and i16 have range check?

The code is ported directly from Spark. This is the approach that is used there.

Spark has IntWrapper and LongWrapper which are equivalent to do_cast_string_to_int::<i32> and do_cast_string_to_int::<i64> in this PR.

This is the logic for casting to byte in Spark. It uses IntWrapper then casts to byte.

public boolean toByte(IntWrapper intWrapper) { if (toInt(intWrapper)) { int intValue = intWrapper.value; byte result = (byte) intValue;

I pushed a commit to add some comments referencing the Spark code that this code is based on

@viirya Let me know if there is anything else to address. I have upmerged with latest from main branch so this PR is a little smaller now.

I will go to look at this again tonight.

viirya · 2024-04-29T20:55:40Z

core/src/execution/datafusion/expressions/cast.rs

+            // Since the previous result is less than or equal to stopValue(Integer.MIN_VALUE /
+            // radix), we can just use `result > 0` to check overflow. If result
+            // overflows, we should stop


Do you mean "more than or equal to"? I think the above condition (L352) is already for result < stop_value?

The comment was copied from the Spark code in org/apache/spark/unsafe/types/UTF8String.java, but I agree that it seems incorrect. I have updated it.

Co-authored-by: comphead <comphead@users.noreply.github.com>

viirya · 2024-05-01T04:47:45Z

Looks good to me. Thanks @andygrove

parthchandra

LGTM

parthchandra · 2024-05-01T18:22:17Z

core/src/execution/datafusion/expressions/cast.rs

+                return none_or_err(eval_mode, type_name, str);
+            };
+
+            // We are going to process the new digit and accumulate the result. However, before


A comment to explain why we're using subtraction instead of addition would make it easier to understand this part of the code.

andygrove added 3 commits April 23, 2024 07:42

Implement Spark-compatible cast from string to integral types

e674d03

remove debug println

5628a3e

update rust tests

79633bc

andygrove marked this pull request as draft April 23, 2024 15:17

andygrove mentioned this pull request Apr 23, 2024

[EPIC] Spark-compatible cast / try_cast operations #286

Open

37 tasks

andygrove added 8 commits April 23, 2024 09:39

clippy

004f431

more clippy

7cf9816

minor refactort to reduce code duplication

73cbe8f

introduce accumulator

32632c1

small refactor

2ea4442

introduce a macro

d4fd8ff

remove overhead of trim on each string

92029ba

comment

85c646c

andygrove marked this pull request as ready for review April 23, 2024 19:23

andygrove commented Apr 23, 2024

View reviewed changes

comphead reviewed Apr 23, 2024

View reviewed changes

core/src/execution/datafusion/expressions/cast.rs Outdated Show resolved Hide resolved

remove spark from some method names

89e8cca

andygrove and others added 3 commits April 23, 2024 13:47

rename some methods

0f4da8f

Update core/src/execution/datafusion/expressions/cast.rs

8a856d7

Co-authored-by: comphead <comphead@users.noreply.github.com>

Merge branch 'cast-string-to-integral' of github.com:andygrove/arrow-…

7ff70d3

…datafusion-comet into cast-string-to-integral

parthchandra reviewed Apr 23, 2024

View reviewed changes

addressing feedback

8c87059

advancedxy reviewed Apr 25, 2024

View reviewed changes

andygrove added 5 commits April 27, 2024 12:12

address feedback

b7b6bfb

fix

2be0f9d

Add criterion benchmark

659b191

optimize cast implementation to avoid copying string to chars (~60% f…

52de3f1

…aster)

fix regression

7196ea1

advancedxy reviewed Apr 29, 2024

View reviewed changes

advancedxy approved these changes Apr 29, 2024

View reviewed changes

comphead reviewed Apr 29, 2024

View reviewed changes

core/src/execution/datafusion/expressions/cast.rs Outdated Show resolved Hide resolved

comphead reviewed Apr 29, 2024

View reviewed changes

core/src/execution/datafusion/expressions/cast.rs Outdated Show resolved Hide resolved

comphead approved these changes Apr 29, 2024

View reviewed changes

viirya reviewed Apr 29, 2024

View reviewed changes

andygrove and others added 10 commits April 29, 2024 17:01

Update core/src/execution/datafusion/expressions/cast.rs

b9ca5e2

Co-authored-by: comphead <comphead@users.noreply.github.com>

fix error in suggested code change

4ab7c37

address feedback

9906c91

Update core/src/execution/datafusion/expressions/cast.rs

a8fada1

Co-authored-by: comphead <comphead@users.noreply.github.com>

cargo fmt

68a5f5d

add comments with references to Spark code that this code is based on

e5118db

upmerge

5ae2a8e

upmerge

60483eb

fix merge conflict

3511d21

fix merge conflict

c42e326

viirya approved these changes May 1, 2024

View reviewed changes

andygrove merged commit cbf4730 into apache:main May 1, 2024
28 checks passed

parthchandra reviewed May 1, 2024

View reviewed changes

		let negative = chars[0] == '-';
		if negative \|\| chars[0] == '+' {

		@@ -103,10 +125,72 @@ impl Cast {
		(DataType::LargeUtf8, DataType::Boolean) => {
		Self::spark_cast_utf8_to_boolean::<i64>(&array, self.eval_mode)?

feat: Implement Spark-compatible CAST from string to integral types #307

feat: Implement Spark-compatible CAST from string to integral types #307

Conversation

andygrove commented Apr 23, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

andygrove commented Apr 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead commented Apr 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Apr 23, 2024

comphead commented Apr 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

advancedxy left a comment

Choose a reason for hiding this comment

andygrove commented Apr 29, 2024

andygrove commented Apr 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented May 1, 2024

parthchandra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Apr 23, 2024 •

edited

Loading