Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Cast scalars to type of field in Expression building #32726

Closed
asfimport opened this issue Aug 18, 2022 · 1 comment
Closed

[R] Cast scalars to type of field in Expression building #32726

asfimport opened this issue Aug 18, 2022 · 1 comment

Comments

@asfimport
Copy link

asfimport commented Aug 18, 2022

After looking at the ExecPlan output of some queries, it jumped out at me how we translate int_field == 5 in R as cast(int_field, float64) == 5 because 5 is a double in R.

This extra work has a noticeable performance impact. Here's a simple query on the taxi dataset, filtering down to 54 out of 1.5 billion rows and selecting a single column. My idea was to make a query that does not much other than evaluate the filter.

> system.time(ds |> select(passenger_count) |> filter(passenger_count > 10) |> compute())
   user  system elapsed 
  0.391   0.024   0.362 

> system.time(ds |> select(passenger_count) |> filter(passenger_count > Scalar$create(10, type = int8())) |> compute())
   user  system elapsed 
  0.206   0.025   0.179 

You can see the difference in the query plans too:

> ds |> select(passenger_count) |> filter(passenger_count > 10) |> explain()
ExecPlan with 4 nodes:
3:SinkNode{}
  2:ProjectNode{projection=[passenger_count]}
    1:FilterNode{filter=(cast(passenger_count, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}) > 10)}
      0:SourceNode{}

> ds |> select(passenger_count) |> filter(passenger_count > Scalar$create(10, type = int8())) |> explain()
ExecPlan with 4 nodes:
3:SinkNode{}
  2:ProjectNode{projection=[passenger_count]}
    1:FilterNode{filter=(passenger_count > 10)}
      0:SourceNode{}

Ideally Acero would do this more intelligently (cf. ARROW-11402), but we should also be able to do smarter things when assembling the Expression in R.

Reporter: Neal Richardson / @nealrichardson
Assignee: Neal Richardson / @nealrichardson

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-17462. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
Issue resolved by pull request 13985
#13985

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants