[R] Retain schema in new Expressions #28820

asfimport · 2021-06-18T14:58:25Z

When a new Expression is created, schema should be retained from the expression(s) it was created from. That way, the type() and type_id() methods of the new Expression will work. For example, currently this happens:

> x <- Expression$field_ref("x")
> x$schema <- Schema$create(x = int32())
> 
> y <- Expression$field_ref("y")
> y$schema <- Schema$create(y = int32())
> 
> Expression$create("add_checked", x, y)$type()
Error: !is.null(schema) is not TRUE

This is what we want to happen:

> Expression$create("add_checked", x, y)$type()
Int32
int32

Reporter: Ian Cook / @ianmcook
Assignee: Ian Cook / @ianmcook

Related issues:

[R] Set empty schema in scalar Expressions (is related to)

PRs and other links:

GitHub Pull Request #10563

_{Note: This issue was originally created as ARROW-13117. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2021-06-21T21:00:26Z

Neal Richardson / @nealrichardson:
I don't think I agree with your expectation. Expression objects should have no type (though $type() shouldn't error) until it is bound to a Schema. So Expression$create("add_checked", x, y)$type() should return NULL IMO.

asfimport · 2021-06-21T22:17:12Z

Ian Cook / @ianmcook:
In the abstract, I think that makes sense. But in practice, if we don't automatically bind schemas to expressions, then we cannot evaluate dplyr expressions like this:

is.double(x + y)

If that seems like a silly or contrived example: There are some R functions in which we need to know what the type of the input is in order to know which Arrow kernels to call. For example, to make is.na() to work consistently with base R (ARROW-12055) we will need to call different Arrow compute functions depending on the type of the input passed to it. If we want is.na() to be able to take expressions (not just bare column references) as input, then we need to automatically bind schemas to expressions.

asfimport · 2021-06-22T00:04:55Z

Neal Richardson / @nealrichardson:
I think we should distinguish two things: (1) is.* functions that operate on the data, as in mutate(); and (2) is.* functions used in dplyr's column predicate across() et al. methods. is.double(x + y) doesn't make much sense to me as case (1) because it evaluates to a constant. It might make sense in the context of a Union type, where each row could be a different type. Either way, if this is something we want to support, it sounds like it should be a C++ compute kernel (a unary scalar one), and it sounds low priority (though maybe I'm missing something).

In case (2), I would think we could handle this in a more targeted way inside of where() or across() etc.

Re: ARROW-12055, I see where you're going with it but it feels like we're hacking on ourselves to support that, and we shouldn't have to do that. I'd personally prefer to add is_nan methods for all other types in C++ (always returning false).

My pushback comes from various past experiences of trying to hack together interfaces that seemingly need to track their state, and trying to get certain APIs to conform to expectations from R. Sometimes that's the right choice, but it's a slippery slope and we should spend some extra time looking for a cleaner solution before going down it.

asfimport · 2021-06-22T02:49:43Z

Ian Cook / @ianmcook:
The distinction between (1) and (2) is not significant. (1) is just (2) with scalar recycling. The same mechanisms underlie both.

The principle I had in mind with this PR is straightforward: An expression can optionally have a schema bound to it, and once a schema is bound to an expression, it will stay bound to derivative expressions.

Before this PR, as soon as you create an expression inside a dplyr verb, you immediately lose the ability to know what its type is, despite the fact that it is fully knowable. You need to wait until arrow_mask() is run again to know it. In arrow_mask(), this command re-binds the schema to all the new expressions, so their types can be known again:

map(.data$selected_columns, ~(.$schema <- .data$.data$schema))

After this PR, the types of expressions are known at all times. I think it is a clean solution; it's +7, -2 lines of actual package code; everything else is tests, docs, etc.

If the central objection here is that this causes the Expression object in the R package to deviate further from the Expression object in the C++ library—yes I agree; I wish there was a way to avoid that. Unfortunately compute::Expression in the C++ library lets you bind a schema to an expression, but the schema does not remain bound to derivative expressions. We already deviated somewhat by storing the schema in the Expression object in ARROW-12781. This PR takes better advantage of the fact that we're doing that.

I experimented with trying to achieve the changes in this PR through changes in expression.cpp, instead of to the R code, but that did not work because the problem is due to limitations in what is possible with compute::Expression, not limitations in how it is exposed to R.

asfimport · 2021-06-24T17:01:56Z

Ian Cook / @ianmcook:
ARROW-13167 would eliminate the need for all these gymnastics (but it has been rejected).

asfimport · 2021-06-25T15:42:33Z

Ian Cook / @ianmcook:
In recognition of the following...

Enabling the data type of an Expression to be knowable at all times is important for enabling broader support for expressions in dplyr verbs.
The PR here and the the earlier changes in ARROW-12781 enable that, but in a somewhat kludgy.
As kludges go, this one is not so bad, and it and would be straightforward to implement this more cleanly in the future.
At present, there is no clear way to implement this more cleanly, at least not without doing a major refactor or compromising its functionality.

... I created ARROW-13186 for future consideration of ways to implement this more cleanly, and for now I will merge this PR.

asfimport · 2021-06-25T20:57:21Z

Ian Cook / @ianmcook:
Issue resolved by pull request 10563
#10563

asfimport closed this as completed Jun 25, 2021

asfimport assigned ianmcook Jan 10, 2023

asfimport added this to the 5.0.0 milestone Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[R] Set empty schema in scalar Expressions #28821

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Retain schema in new Expressions #28820

[R] Retain schema in new Expressions #28820

asfimport commented Jun 18, 2021 •

edited

Loading

asfimport commented Jun 21, 2021

asfimport commented Jun 21, 2021

asfimport commented Jun 22, 2021

asfimport commented Jun 22, 2021

asfimport commented Jun 24, 2021

asfimport commented Jun 25, 2021

asfimport commented Jun 25, 2021

[R] Retain schema in new Expressions #28820

[R] Retain schema in new Expressions #28820

Comments

asfimport commented Jun 18, 2021 • edited Loading

Related issues:

PRs and other links:

asfimport commented Jun 21, 2021

asfimport commented Jun 21, 2021

asfimport commented Jun 22, 2021

asfimport commented Jun 22, 2021

asfimport commented Jun 24, 2021

asfimport commented Jun 25, 2021

asfimport commented Jun 25, 2021

asfimport commented Jun 18, 2021 •

edited

Loading