Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-18818: [R] Create a field ref to a field in a struct #19706

Merged
merged 9 commits into from
Jan 18, 2023

Conversation

nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Jan 10, 2023

This PR implements $.Expression and [[.Expression methods, such that if the Expression is a FieldRef, it returns a nested FieldRef. This required revising some assumptions in a few places, particularly that if an Expression is a FieldRef, it has a name, and that all FieldRefs correspond to a Field in a Schema. In the case where the Expression is not a FieldRef, it will create an Expression call to struct_field to extract the field, iff the Expression has a knowable type, the type is StructType, and the field name exists in the struct.

Things not done because they weren't needed to get this working:

  • Expression$field_ref() take a vector to construct a nested ref
  • Method to return vector of nested components of a field ref in R

Next steps for future PRs:

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to review more as this progresses...just a few observations:

  • I think ls(x) is not a robust way to check for collisions. You may have to use get() or mget() with inherits = TRUE since the symbol might exist in a parent environment?
  • Can Exression$field_ref() be updated to to accept a vector? Then you could avoid the C++ compute___expr__nested_field_ref() and do it all in R.?
  • There should probably be a way to get all the components of a field ref (maybe compute___expr__get_field_ref_name() should return cpp11::strings()?

@nealrichardson
Copy link
Member Author

Happy to review more as this progresses...just a few observations:

  • I think ls(x) is not a robust way to check for collisions. You may have to use get() or mget() with inherits = TRUE since the symbol might exist in a parent environment?

For R6 this works because I'm just looking for a method name, and inheritance isn't handled via nested environments it seems (i.e. class_name is defined in the base ArrowObject class but it shows up in ls() for Expression).

We've been using this trick for a while for Tables and RecordBatches: https://github.com/apache/arrow/blob/master/r/R/arrow-tabular.R#L153-L160

  • Can Exression$field_ref() be updated to to accept a vector? Then you could avoid the C++ compute___expr__nested_field_ref() and do it all in R.?

It could, and that was my first instinct too, but that's actually not helpful for the $ method, where you have a FieldRef and the name of the thing you want to nest further. You'd still have to call out to C++ to get the vector of field names in the path of the existing FieldRef.

  • There should probably be a way to get all the components of a field ref (maybe compute___expr__get_field_ref_name() should return cpp11::strings()?

Probably. I'm going to focus on getting the essential dplyr stuff working first and we'll see what's required for that.

One further complication that is completely punted here is that FieldRefs can be from strings or integers, so a nested path could contain a mix of integers and strings: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L1651-L1660 So just switching to std::vector<std::string> everywhere isn't a simple solution.

@nealrichardson nealrichardson marked this pull request as ready for review January 11, 2023 15:04
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few bits!

It looks like integers nestings aren't supported yet...that seems fine for the purposes of arrow--dplyr wrapping.

Was devtools::document() run after the S3 methods were added? (I didn't see a NAMESPACE file update but maybe GitHub is hiding it from me)

r/tests/testthat/test-dplyr-query.R Outdated Show resolved Hide resolved
r/src/expression.cpp Outdated Show resolved Hide resolved
r/tests/testthat/test-expression.R Outdated Show resolved Hide resolved
r/src/expression.cpp Outdated Show resolved Hide resolved
@assignUser assignUser changed the title ARROW-13858: [R] Create a field ref to a field in a struct GH-18818: [R] Create a field ref to a field in a struct Jan 11, 2023
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #18818 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one expect_error() without an expected error to fix and this is good to go! Thank you!

r/tests/testthat/test-dplyr-query.R Outdated Show resolved Hide resolved
r/R/expression.R Outdated Show resolved Hide resolved
auto field_refs = FieldsInExpression(*x);
for (auto f : field_refs) {
out.push_back(*f.name());
if (f.IsNested()) {
// We keep the top-level field name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be used in practice (in a mutate call where you select the field, you also directly specify the resulting column name), but otherwise it might also make sense to keep the innermost field name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only used to prune columns in the dataset scanner, and IIRC that interface accepts column names, not FieldRefs, so I need the names of the top-level columns. But if I'm mistaken and we can use FieldRefs there now, we can refactor this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also specify field refs (well, generic expressions), but then you also need to pass the resulting name for the schema. See the second Project signature at

/// \brief Set the subset of columns to materialize.
///
/// Columns which are not referenced may not be read from fragments.
///
/// \param[in] columns list of columns to project. Order and duplicates will
/// be preserved.
///
/// \return Failure if any column name does not exists in the dataset's
/// Schema.
Status Project(std::vector<std::string> columns);
/// \brief Set expressions which will be evaluated to produce the materialized
/// columns.
///
/// Columns which are not referenced may not be read from fragments.
///
/// \param[in] exprs expressions to evaluate to produce columns.
/// \param[in] names list of names for the resulting columns.
///
/// \return Failure if any referenced column does not exists in the dataset's
/// Schema.
Status Project(std::vector<compute::Expression> exprs, std::vector<std::string> names);

which gets translated to ScanOptions.projection. It seems that is also what the R bindings actually do inside ExecNode_Scan (it will convert the materialized_field_names back to FieldRefs). Now, the scanner itself will also just use the top-level name of a nested field ref to do pruning of what it needs to read, so right now preserving the nested field ref is not useful. But ideally in the future we would optimize that for formats that can do that (like parquet, cfr #33167)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointers. I've deferred cleaning this up to #33760 since I see a few places where it could be more involved than just deleting code.

@ursabot
Copy link

ursabot commented Jan 20, 2023

Benchmark runs are scheduled for baseline = 359f28b and contender = 1d9366f. 1d9366f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Finished ⬇️2.04% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.16% ⬆️0.22%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 1d9366f1 ec2-t3-xlarge-us-east-2
[Finished] 1d9366f1 test-mac-arm
[Finished] 1d9366f1 ursa-i9-9960x
[Finished] 1d9366f1 ursa-thinkcentre-m75q
[Finished] 359f28ba ec2-t3-xlarge-us-east-2
[Failed] 359f28ba test-mac-arm
[Finished] 359f28ba ursa-i9-9960x
[Finished] 359f28ba ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Jan 20, 2023

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] Create a field ref to a field in a struct
4 participants