Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Retain schema in new Expressions #28820

Closed
asfimport opened this issue Jun 18, 2021 · 7 comments
Closed

[R] Retain schema in new Expressions #28820

asfimport opened this issue Jun 18, 2021 · 7 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jun 18, 2021

When a new Expression is created, schema should be retained from the expression(s) it was created from. That way, the type() and type_id() methods of the new Expression will work. For example, currently this happens:

> x <- Expression$field_ref("x")
> x$schema <- Schema$create(x = int32())
> 
> y <- Expression$field_ref("y")
> y$schema <- Schema$create(y = int32())
> 
> Expression$create("add_checked", x, y)$type()
Error: !is.null(schema) is not TRUE 

This is what we want to happen:

> Expression$create("add_checked", x, y)$type()
Int32
int32

Reporter: Ian Cook / @ianmcook
Assignee: Ian Cook / @ianmcook

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-13117. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
I don't think I agree with your expectation. Expression objects should have no type (though $type() shouldn't error) until it is bound to a Schema. So Expression$create("add_checked", x, y)$type() should return NULL IMO.

@asfimport
Copy link
Collaborator Author

Ian Cook / @ianmcook:
In the abstract, I think that makes sense. But in practice, if we don't automatically bind schemas to expressions, then we cannot evaluate dplyr expressions like this:

is.double(x + y) 

If that seems like a silly or contrived example: There are some R functions in which we need to know what the type of the input is in order to know which Arrow kernels to call. For example, to make is.na() to work consistently with base R (ARROW-12055) we will need to call different Arrow compute functions depending on the type of the input passed to it. If we want is.na() to be able to take expressions (not just bare column references) as input, then we need to automatically bind schemas to expressions.

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
I think we should distinguish two things: (1) is.* functions that operate on the data, as in mutate(); and (2) is.* functions used in dplyr's column predicate across() et al. methods. is.double(x + y) doesn't make much sense to me as case (1) because it evaluates to a constant. It might make sense in the context of a Union type, where each row could be a different type. Either way, if this is something we want to support, it sounds like it should be a C++ compute kernel (a unary scalar one), and it sounds low priority (though maybe I'm missing something). 

In case (2), I would think we could handle this in a more targeted way inside of where() or across() etc. 

Re: ARROW-12055, I see where you're going with it but it feels like we're hacking on ourselves to support that, and we shouldn't have to do that. I'd personally prefer to add is_nan methods for all other types in C++ (always returning false). 

My pushback comes from various past experiences of trying to hack together interfaces that seemingly need to track their state, and trying to get certain APIs to conform to expectations from R. Sometimes that's the right choice, but it's a slippery slope and we should spend some extra time looking for a cleaner solution before going down it.

@asfimport
Copy link
Collaborator Author

Ian Cook / @ianmcook:
The distinction between (1) and (2) is not significant. (1) is just (2) with scalar recycling. The same mechanisms underlie both.

The principle I had in mind with this PR is straightforward: An expression can optionally have a schema bound to it, and once a schema is bound to an expression, it will stay bound to derivative expressions.

Before this PR, as soon as you create an expression inside a dplyr verb, you immediately lose the ability to know what its type is, despite the fact that it is fully knowable. You need to wait until arrow_mask() is run again to know it. In arrow_mask(), this command re-binds the schema to all the new expressions, so their types can be known again:

map(.data$selected_columns, ~(.$schema <- .data$.data$schema))

After this PR, the types of expressions are known at all times. I think it is a clean solution; it's +7, -2 lines of actual package code; everything else is tests, docs, etc.

If the central objection here is that this causes the Expression object in the R package to deviate further from the Expression object in the C++ library—yes I agree; I wish there was a way to avoid that. Unfortunately compute::Expression in the C++ library lets you bind a schema to an expression, but the schema does not remain bound to derivative expressions. We already deviated somewhat by storing the schema in the Expression object in ARROW-12781. This PR takes better advantage of the fact that we're doing that.

I experimented with trying to achieve the changes in this PR through changes in expression.cpp, instead of to the R code, but that did not work because the problem is due to limitations in what is possible with compute::Expression, not limitations in how it is exposed to R.

@asfimport
Copy link
Collaborator Author

Ian Cook / @ianmcook:
ARROW-13167 would eliminate the need for all these gymnastics (but it has been rejected).

@asfimport
Copy link
Collaborator Author

Ian Cook / @ianmcook:
In recognition of the following...

  • Enabling the data type of an Expression to be knowable at all times is important for enabling broader support for expressions in dplyr verbs.

  • The PR here and the the earlier changes in ARROW-12781 enable that, but in a somewhat kludgy.

  • As kludges go, this one is not so bad, and it and would be straightforward to implement this more cleanly in the future.

  • At present, there is no clear way to implement this more cleanly, at least not without doing a major refactor or compromising its functionality.

    ... I created ARROW-13186 for future consideration of ways to implement this more cleanly, and for now I will merge this PR.

@asfimport
Copy link
Collaborator Author

Ian Cook / @ianmcook:
Issue resolved by pull request 10563
#10563

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants