ARROW-10173: [Rust][DataFusion] Implement support for direct comparison to scalar values #8660

yordan-pavlov · 2020-11-13T19:54:07Z

This PR addresses the inefficient comparison to scalar values, where an array is built with the scalar value repeated, by
changing the return value of expressions from Result<ArrayRef> to Result<ColumnarValue> where ColumnarValue is defined as:

pub enum ColumnarValue {
    /// Array of values
    Array(ArrayRef),
    /// A single value 
    Scalar(ScalarValue)
}

This enables scalar values to be used in comparison operators directly, and for the simple query used in the benchmark ("select f32, f64 from t where f32 >= 250 and f64 > 250") shows approximately 10x performance improvement:

before:
filter_scalar time: [35.733 ms 36.613 ms 37.924 ms]

after:
filter_scalar time: [3.5938 ms 3.6450 ms 3.7035 ms]
change: [-90.048% -89.846% -89.625%] (p = 0.00 < 0.05)

I have also added a benchmark to compare the change in performance when comparing two arrays (using query "select f32, f64 from t where f32 >= f64") and it is negligible:

before:
filter_array time: [11.601 ms 11.656 ms 11.718 ms]

after:
filter_array time: [11.854 ms 11.957 ms 12.070 ms]
change: [+1.8032% +3.6391% +5.5671%] (p = 0.00 < 0.05)

@andygrove @alamb let me know what you think

github-actions · 2020-11-13T20:04:59Z

https://issues.apache.org/jira/browse/ARROW-10173

andygrove · 2020-11-13T21:18:08Z

rust/datafusion/src/physical_plan/expressions.rs

+            }
+            (ColumnarValue::Scalar(scalar), ColumnarValue::Array(array)) => {
+                // if right is literal and left is array - reverse operator and parameters
+                let result: Result<ArrayRef> = match &self.op {


It looks like this block is duplicated for the two match arms and could be moved into a function

good question; the code blocks are not exactly the same, there is a small difference; notice how in (ColumnarValue::Array(array), ColumnarValue::Scalar(scalar)) we have Operator::Lt => binary_array_op_scalar!(array, scalar.clone(), lt), but under (ColumnarValue::Scalar(scalar), ColumnarValue::Array(array)) we have Operator::Lt => binary_array_op_scalar!(array, scalar.clone(), gt);
this is because there is only one version of arrow comparison kernel functions for scalar comparison where the scalar value can only be on one side of the comparison, for example pub fn lt_scalar<T>(left: &PrimitiveArray<T>, right: T::Native) -> Result<BooleanArray>

Ok, that makes sense. I hadn't looked closely enough to see the differences.

Another structure would be to normalize the invocations by finding the array, and the literal and then have a single call site for invoking the comparison

Like turning both array > lit_1 and lit_1 < array into

A = array
lit = lit_
op = >

However this involves changing the comparison ops and I am not sure I can claim the code would be any simpler / potentially less bug prone.

rust/datafusion/src/physical_plan/expressions.rs

andygrove · 2020-11-13T21:19:58Z

rust/datafusion/src/physical_plan/mod.rs

@@ -100,6 +100,30 @@ pub enum Distribution {
    SinglePartition,
 }

+/// Represents the result from an expression
+pub enum ColumnarValue {


Perhaps we could push the new ColumnarValue enum down to the core arrow crate since it isn't specific to DataFusion?

possibly; where / how could you see the ColumnarValue enum used in core arrow? also wouldn't ScalarValue need to move as well?

C++ uses a Datum, which is also an enum over Scalar, Array, and a few other things.
We could have a separate module called arrow::scalar, then in the long run we could convert the compute kernels to take Datum, and push the optimisations of "array vs scalar" there

I agree.

It would also simplify the API of many vertical operations (e.g. aggregates), as their input and result would have a common type. I have a branch on which I am doing that, but I have not finish it yet.

(spoiler alert: it is not so easy ^_^)

I also went down that rabbit hole over a year ago, yeah it's not easy

I think starting with ColumnValue in DataFusion and then hoisting it out into arrow makes a lot of sense

I agree with @alamb ; I think if we wanted to move ColumnarValue into arrow it would be better to do that in a separate PR after this one has been merged

Note that our use of Datum in C++ is far from perfect, it's actually a relatively complex data structure and I've contemplated using something simplified that doesn't have as many non-trivial C++ objects inside it in the internals of function execution

andygrove · 2020-11-13T21:20:53Z

@yordan-pavlov I took a quick skim through and this is looking really good! Could you rebase?

rust/datafusion/src/physical_plan/filter.rs

alamb

I love this PR -- thanks @yordan-pavlov -- I think we should merge this once we get this rebased and the tests are passing. Let me know if you need any help -- this optimization is directly relevant to work we are doing in my work project.

I am going to try and fire up my TPCH benchmark locally and see if I can get any more performance data

alamb · 2020-11-14T12:47:12Z

rust/datafusion/src/physical_plan/expressions.rs

+            )?))
+        } else {
+            Err(ExecutionError::General(format!(
+                "compute_utf8_op_scalar failed to cast literal value {}",


Suggested change

"compute_utf8_op_scalar failed to cast literal value {}",

"internal error: compute_utf8_op_scalar failed to cast literal value {}",

The point being that if this code is hit it isn't likely a bug in how someone is using datafusion, it is a bug in datafusion itself.

We have an error for this: Internal

rust/datafusion/src/physical_plan/expressions.rs

alamb · 2020-11-14T12:52:24Z

rust/datafusion/src/physical_plan/expressions.rs

+            }
+            (ColumnarValue::Scalar(scalar), ColumnarValue::Array(array)) => {
+                // if right is literal and left is array - reverse operator and parameters
+                let result: Result<ArrayRef> = match &self.op {


Another structure would be to normalize the invocations by finding the array, and the literal and then have a single call site for invoking the comparison

Like turning both array > lit_1 and lit_1 < array into

A = array
lit = lit_
op = >

However this involves changing the comparison ops and I am not sure I can claim the code would be any simpler / potentially less bug prone.

alamb · 2020-11-14T12:54:12Z

rust/datafusion/src/physical_plan/expressions.rs

+        let (left, right) = match (left_value, right_value) {
+            // if both arrays - extract and continue execution
+            (ColumnarValue::Array(left), ColumnarValue::Array(right)) => (left, right),
+            // if both literals - not supported


I think this is fine -- we should handle such scalar op scalar things in the planner / optimizer, in my opinion

alamb · 2020-11-14T12:55:38Z

rust/datafusion/src/physical_plan/expressions.rs

@@ -1571,24 +1754,6 @@ impl Literal {
    }
 }

-/// Build array containing the same literal value repeated. This is necessary because the Arrow


alamb · 2020-11-14T12:56:14Z

rust/datafusion/src/physical_plan/expressions.rs

+        let array_to_sort = match values_to_sort {
+            ColumnarValue::Array(array) => array,
+            ColumnarValue::Scalar(scalar) => {
+                return Err(ExecutionError::General(format!(


again, I like this approach -- we should be removing scalar values out of Sort exprs in the planner, not during execution

alamb · 2020-11-14T12:57:18Z

rust/datafusion/src/physical_plan/mod.rs

@@ -100,6 +100,30 @@ pub enum Distribution {
    SinglePartition,
 }

+/// Represents the result from an expression
+pub enum ColumnarValue {


I think starting with ColumnValue in DataFusion and then hoisting it out into arrow makes a lot of sense

alamb · 2020-11-14T12:58:37Z

rust/datafusion/src/scalar.rs

+        self.to_array_of_size(1)
+    }
+
+    /// Converts a scalar value into an 1-row array.


Suggested change

/// Converts a scalar value into an 1-row array.

/// Converts a scalar value into an array of `size` rows.

alamb · 2020-11-14T13:10:18Z

I tried to run this code against our simple TPCH Q1 implementation (which is dominated by evaluating expressions against constants), and sadly it hit an error

cd arrow/rust/benchmarks
cargo run --release --bin tpch -- --iterations 3 --path /Users/alamb/Software/tpch_data/SF10-parquet-64 --format parquet --query 1 --batch-size 4096
    Finished release [optimized] target(s) in 0.13s
     Running `/Users/alamb/Software/arrow2/rust/target/release/tpch --iterations 3 --path /Users/alamb/Software/tpch_data/SF10-parquet-64 --format parquet --query 1 --batch-size 4096`
Running benchmarks with the following options: TpchOpt { query: 1, debug: false, iterations: 3, concurrency: 2, batch_size: 4096, path: "/Users/alamb/Software/tpch_data/SF10-parquet-64", file_format: "parquet" }

Error: ArrowError(ExternalError(General("Scalar values on left side of operator - are not supported")))

I plan to figure out what is going on and try and fix it tomorrow morning, US Eastern time. As I said this I think this PR is a major step forward and I want to see it merged!

Andrew

andygrove · 2020-11-14T15:44:39Z

I would recommend that we implement an optimizer rule to "swap" the order of expressions when we see an unsupported combination such as a scalar on the left. The logic for this rule already exists in this PR. The benefit of having it as an optimizer rule is that it would handle invalid plans created from both the SQL and DataFrame APIs.

yordan-pavlov · 2020-11-14T17:31:22Z

thanks for all the comments, I will try to rebase this evening

yordan-pavlov · 2020-11-14T22:51:37Z

I have now rebased the branch on the latest changes from master; will try to review comments tomorrow.

yordan-pavlov · 2020-11-15T22:05:11Z

I tried to run this code against our simple TPCH Q1 implementation (which is dominated by evaluating expressions against constants), and sadly it hit an error

@alamb it might be necessary to fallback to generating an array where the scalar value is repeated, for some operations that do not have a version which accepts a scalar argument

alamb · 2020-11-16T13:09:42Z

@alamb it might be necessary to fallback to generating an array where the scalar value is repeated, for some operations that do not have a version which accepts a scalar argument

@yordan-pavlov I agree with this plan. While long term I would like to see the ability to evaluate every function with constant arguments, that is definitely beyond the scope of this PR, and so keeping the fall back to constant arrays is a good idea.

Given how important this functionality is, I suggest we focus on getting this PR into master as soon as possible and then iterating on additional functionality like additional function support, converting to canonical expr _op_ colref form, constant folding, hoisting ColumValue into arrow, etc. There is so much great work to do!

… have a scalar implementation

yordan-pavlov · 2020-11-16T21:23:07Z

@alamb I have now implemented falling back to scalar arrays for operations where scalar arguments are currently not supported; this should now work for operators such as "-" where previously an error was returned ("Scalar values on left side of operator - are not supported"); I have also added an extra test for this;
could you try running the TPCH test again?

alamb · 2020-11-16T23:24:14Z

@yordan-pavlov When I ran the benchmark locally again on my laptop:

cargo run --release --bin tpch -- --iterations 3 --path /Users/alamb/Software/tpch_data/SF10-parquet-64 --format parquet --query 1 --batch-size 4096

benchmarks on this branch (fd77fd6)

Running benchmarks with the following options: TpchOpt { query: 1, debug: false, iterations: 3, concurrency: 2, batch_size: 4096, path: "/Users/alamb/Software/tpch_data/SF10-parquet-64", file_format: "parquet", mem_table: false }
Query 1 iteration 0 took 5371 ms
Query 1 iteration 1 took 6121 ms
Query 1 iteration 2 took 6160 ms
alamb@ip-192-168-0-133 benchmarks %

benchmark on master e5fce7f

Running benchmarks with the following options: TpchOpt { query: 1, debug: false, iterations: 3, concurrency: 2, batch_size: 4096, path: "/Users/alamb/Software/tpch_data/SF10-parquet-64", file_format: "parquet", mem_table: false }
Query 1 iteration 0 took 7640 ms
Query 1 iteration 1 took 7808 ms
Query 1 iteration 2 took 7765 ms

So it seems to me the the performance improved a bit. I am second guessing my performance setup here - it is with a laptop and I do wonder if my CPU is being throttled.

Regardless, I think this is good enough results!

codecov-io · 2020-11-17T00:47:16Z

Codecov Report

Merging #8660 (e5fce7f) into master (6b910ab) will decrease coverage by 0.01%.
The diff coverage is 75.00%.

@@            Coverage Diff             @@
##           master    #8660      +/-   ##
==========================================
- Coverage   84.56%   84.54%   -0.02%     
==========================================
  Files         177      177              
  Lines       43611    43645      +34     
==========================================
+ Hits        36879    36901      +22     
- Misses       6732     6744      +12

Impacted Files	Coverage Δ
rust/arrow/src/compute/kernels/cast.rs	`96.66% <ø> (ø)`
rust/arrow/src/util/bit_util.rs	`100.00% <ø> (ø)`
...ust/datafusion/src/physical_plan/hash_aggregate.rs	`86.68% <69.04%> (-2.59%)`	⬇️
rust/arrow/src/buffer.rs	`95.65% <100.00%> (ø)`
rust/arrow/src/compute/kernels/aggregate.rs	`100.00% <100.00%> (ø)`
rust/datafusion/src/physical_plan/merge.rs	`66.12% <0.00%> (+1.61%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b910ab...fd77fd6. Read the comment docs.

yordan-pavlov · 2020-11-17T18:32:51Z

@alamb thanks for running that test again; even if comparisons against scalar values are significantly faster, there is much more to the tests, such as loading data, etc.; I will be looking at that next. Overall though, good result as you said.

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

jorgecarleitao

LGTM! Super cool!

jorgecarleitao · 2020-11-14T13:52:40Z

rust/datafusion/src/physical_plan/expressions.rs

+            )?))
+        } else {
+            Err(ExecutionError::General(format!(
+                "compute_utf8_op_scalar failed to cast literal value {}",


We have an error for this: Internal

@andygrove

…on to scalar values This PR addresses the inefficient comparison to scalar values, where an array is built with the scalar value repeated, by changing the return value of expressions from `Result<ArrayRef>` to `Result<ColumnarValue>` where `ColumnarValue` is defined as: ``` pub enum ColumnarValue { /// Array of values Array(ArrayRef), /// A single value Scalar(ScalarValue) } ``` This enables scalar values to be used in comparison operators directly, and for the simple query used in the benchmark ("select f32, f64 from t where f32 >= 250 and f64 > 250") shows approximately 10x performance improvement: before: filter_scalar time: [35.733 ms 36.613 ms 37.924 ms] after: filter_scalar time: [3.5938 ms 3.6450 ms 3.7035 ms] change: [-90.048% -89.846% -89.625%] (p = 0.00 < 0.05) I have also added a benchmark to compare the change in performance when comparing two arrays (using query "select f32, f64 from t where f32 >= f64") and it is negligible: before: filter_array time: [11.601 ms 11.656 ms 11.718 ms] after: filter_array time: [11.854 ms 11.957 ms 12.070 ms] change: [+1.8032% +3.6391% +5.5671%] (p = 0.00 < 0.05) @andygrove @alamb let me know what you think Closes apache#8660 from yordan-pavlov/impl_scalar_expr_results Lead-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com> Co-authored-by: Yordan Pavlov <64363766+yordan-pavlov@users.noreply.github.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

github-actions bot added Component: Rust - DataFusion Component: Rust labels Nov 13, 2020

andygrove reviewed Nov 13, 2020

View reviewed changes

rust/datafusion/src/physical_plan/expressions.rs Show resolved Hide resolved

andygrove reviewed Nov 13, 2020

View reviewed changes

nevi-me added the needs-rebase A PR that needs to be rebased by the author label Nov 14, 2020

jorgecarleitao reviewed Nov 14, 2020

View reviewed changes

rust/datafusion/src/physical_plan/filter.rs Show resolved Hide resolved

alamb approved these changes Nov 14, 2020

View reviewed changes

yordan-pavlov force-pushed the impl_scalar_expr_results branch 2 times, most recently from 66d8080 to 3e12c10 Compare November 14, 2020 22:39

yordan-pavlov added 5 commits November 15, 2020 21:57

implement filter query benchmark

b955ae7

change PhysicalExpr::evaluate() to return ColumnarValue

eeb280a

fix rustfmt issues

44a5020

add benchmark for comparing values from two arrays

1926366

implement support for LIKE expression with scalar argument

68f7136

fix query_without_from test

305676f

yordan-pavlov force-pushed the impl_scalar_expr_results branch from e08d646 to 305676f Compare November 15, 2020 22:24

nevi-me removed the needs-rebase A PR that needs to be rebased by the author label Nov 16, 2020

yordan-pavlov added 3 commits November 16, 2020 19:03

fix rustfmt issues in scalar.rs

8344695

implement fallback to array implementation for operations that do not…

8077f30

… have a scalar implementation

fix rustfmt issues in tests/sql.rs

fd77fd6

yordan-pavlov and others added 3 commits November 17, 2020 18:37

Update rust/datafusion/src/physical_plan/expressions.rs to add comment

213bac8

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

fix rustfmt issue in comment

e811015

fix clippy errors

a96bc3b

jorgecarleitao approved these changes Nov 18, 2020

View reviewed changes

jorgecarleitao closed this in c8c2110 Nov 18, 2020

asfimport mentioned this pull request Dec 24, 2020

[Rust][DataFusion] Improve performance of equality to a constant predicate support #26181

Closed

	"compute_utf8_op_scalar failed to cast literal value {}",
	"internal error: compute_utf8_op_scalar failed to cast literal value {}",

	/// Converts a scalar value into an 1-row array.
	/// Converts a scalar value into an array of `size` rows.

ARROW-10173: [Rust][DataFusion] Implement support for direct comparison to scalar values #8660

ARROW-10173: [Rust][DataFusion] Implement support for direct comparison to scalar values #8660

Conversation

yordan-pavlov commented Nov 13, 2020

github-actions bot commented Nov 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgecarleitao Nov 14, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Nov 13, 2020

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Nov 14, 2020

andygrove commented Nov 14, 2020

yordan-pavlov commented Nov 14, 2020

yordan-pavlov commented Nov 14, 2020

yordan-pavlov commented Nov 15, 2020

alamb commented Nov 16, 2020

yordan-pavlov commented Nov 16, 2020

alamb commented Nov 16, 2020 • edited

codecov-io commented Nov 17, 2020

Codecov Report

yordan-pavlov commented Nov 17, 2020

jorgecarleitao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgecarleitao Nov 14, 2020 •

edited

alamb commented Nov 16, 2020 •

edited