-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-18818: [R] Create a field ref to a field in a struct #19706
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to review more as this progresses...just a few observations:
- I think
ls(x)
is not a robust way to check for collisions. You may have to useget()
ormget()
withinherits = TRUE
since the symbol might exist in a parent environment? - Can
Exression$field_ref()
be updated to to accept a vector? Then you could avoid the C++compute___expr__nested_field_ref()
and do it all in R.? - There should probably be a way to get all the components of a field ref (maybe
compute___expr__get_field_ref_name()
should returncpp11::strings()
?
For R6 this works because I'm just looking for a method name, and inheritance isn't handled via nested environments it seems (i.e. We've been using this trick for a while for Tables and RecordBatches: https://github.com/apache/arrow/blob/master/r/R/arrow-tabular.R#L153-L160
It could, and that was my first instinct too, but that's actually not helpful for the
Probably. I'm going to focus on getting the essential dplyr stuff working first and we'll see what's required for that. One further complication that is completely punted here is that FieldRefs can be from strings or integers, so a nested path could contain a mix of integers and strings: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L1651-L1660 So just switching to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few bits!
It looks like integers nestings aren't supported yet...that seems fine for the purposes of arrow--dplyr wrapping.
Was devtools::document()
run after the S3 methods were added? (I didn't see a NAMESPACE file update but maybe GitHub is hiding it from me)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one expect_error()
without an expected error to fix and this is good to go! Thank you!
auto field_refs = FieldsInExpression(*x); | ||
for (auto f : field_refs) { | ||
out.push_back(*f.name()); | ||
if (f.IsNested()) { | ||
// We keep the top-level field name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be used in practice (in a mutate
call where you select the field, you also directly specify the resulting column name), but otherwise it might also make sense to keep the innermost field name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is only used to prune columns in the dataset scanner, and IIRC that interface accepts column names, not FieldRefs, so I need the names of the top-level columns. But if I'm mistaken and we can use FieldRefs there now, we can refactor this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also specify field refs (well, generic expressions), but then you also need to pass the resulting name for the schema. See the second Project signature at
arrow/cpp/src/arrow/dataset/scanner.h
Lines 463 to 484 in 4e439f6
/// \brief Set the subset of columns to materialize. | |
/// | |
/// Columns which are not referenced may not be read from fragments. | |
/// | |
/// \param[in] columns list of columns to project. Order and duplicates will | |
/// be preserved. | |
/// | |
/// \return Failure if any column name does not exists in the dataset's | |
/// Schema. | |
Status Project(std::vector<std::string> columns); | |
/// \brief Set expressions which will be evaluated to produce the materialized | |
/// columns. | |
/// | |
/// Columns which are not referenced may not be read from fragments. | |
/// | |
/// \param[in] exprs expressions to evaluate to produce columns. | |
/// \param[in] names list of names for the resulting columns. | |
/// | |
/// \return Failure if any referenced column does not exists in the dataset's | |
/// Schema. | |
Status Project(std::vector<compute::Expression> exprs, std::vector<std::string> names); |
which gets translated to ScanOptions.projection. It seems that is also what the R bindings actually do inside ExecNode_Scan
(it will convert the materialized_field_names back to FieldRefs). Now, the scanner itself will also just use the top-level name of a nested field ref to do pruning of what it needs to read, so right now preserving the nested field ref is not useful. But ideally in the future we would optimize that for formats that can do that (like parquet, cfr #33167)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointers. I've deferred cleaning this up to #33760 since I see a few places where it could be more involved than just deleting code.
cdb0b24
to
f5217fc
Compare
Benchmark runs are scheduled for baseline = 359f28b and contender = 1d9366f. 1d9366f is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
This PR implements
$.Expression
and[[.Expression
methods, such that if the Expression is a FieldRef, it returns a nested FieldRef. This required revising some assumptions in a few places, particularly that if an Expression is a FieldRef, it has aname
, and that all FieldRefs correspond to a Field in a Schema. In the case where the Expression is not a FieldRef, it will create an Expression call tostruct_field
to extract the field, iff the Expression has a knowabletype
, the type isStructType
, and the field name exists in the struct.Things not done because they weren't needed to get this working:
Expression$field_ref()
take a vector to construct a nested refNext steps for future PRs:
Wrap this in tidyr::unpack() method (but unfortunately, unpack() is not a generic)
[R] Support making FieldRef from integer #33756
[R] Bindings for list_element and list_slice #33757
[R] Push projection expressions into ScanNode #33760
Closes: [R] Create a field ref to a field in a struct #18818