Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rust][DataFusion] Improve performance of equality to a constant predicate support #26181

Closed
asfimport opened this issue Oct 5, 2020 · 5 comments

Comments

@asfimport
Copy link

asfimport commented Oct 5, 2020

I noticed this behavior while working on support for DictionaryArrays and wanted to capture it in a ticket in case someone has time to work on it.

In order to implement an equality predicate to a constant such as d1 = 'three', DataFusion effectively creates an array with the same value 'three' repeated over and over again and uses the equality compute kernel. This is ... suboptimal.

Here is what the predicate looks like:

        predicate: BinaryExpr {
            left: CastExpr {
                expr: Column {
                    name: "d1",
                },
                cast_type: Utf8,
            },
            op: Eq,
            right: Literal {
                value: Utf8("three"),
            },
        },

Reporter: Andrew Lamb / @alamb
Assignee: Yordan Pavlov / @yordan-pavlov

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-10173. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Yordan Pavlov / @yordan-pavlov:
this issue is related to:

https://issues.apache.org/jira/browse/ARROW-8908 

which proposes an optimization to creating literal arrays; 

this, however, is still sub-optimal; ideally comparison to literal values should be performed directly as proposed here:

https://issues.apache.org/jira/browse/ARROW-8907

 

@asfimport
Copy link
Author

Andrew Lamb / @alamb:
I agree that implementing direct comparison of array to scalar is the fastest way to make such predicates fast

@asfimport
Copy link
Author

Yordan Pavlov / @yordan-pavlov:
I have an initial implementation of direct comparison operations to scalar values in datafusion which, for the simple query used in the benchmark ("select f32, f64 from t where f32 >= 250 and f64 > 250") shows approximately 10x performance improvement:

before:
filter_scalar time: [35.733 ms 36.613 ms 37.924 ms]

after:
filter_scalar time: [3.5938 ms 3.6450 ms 3.7035 ms]
change: [-90.048% -89.846% -89.625%] (p = 0.00 < 0.05)

 

I have also added a benchmark to compare the change in performance when comparing two arrays (using query "select f32, f64 from t where f32 >= f64") and it is negligible:

before:
filter_array time: [11.601 ms 11.656 ms 11.718 ms]

after:
filter_array time: [11.854 ms 11.957 ms 12.070 ms]
change: [+1.8032% +3.6391% +5.5671%] (p = 0.00 < 0.05)

I will be submitting a PR for this change soon.

@asfimport
Copy link
Author

Andrew Lamb / @alamb:
@yordan-pavlov this sounds like great news. I can't wait to see it

@asfimport
Copy link
Author

Jorge Leitão / @jorgecarleitao:
Issue resolved by pull request 8660
#8660

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant