Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17439: [R] Change behavior of pull to compute instead of collect #14330

Merged
merged 4 commits into from
Oct 12, 2022

Conversation

amoeba
Copy link
Member

@amoeba amoeba commented Oct 5, 2022

I could use some help with a couple of things:

  • devtools::check warns about pull.ArrowTabluar being undocumented. I @exported it to stay consistent with other ArrowTabular generics defined in arrow-tabular.R and don't understand why checking doesn't warn on all of these. Does this one just not need exporting?

    Relevant devtools::check() output
    ❯ checking for missing documentation entries ... WARNING
      Undocumented code objects:
        ‘pull.ArrowTabular’
      All user-level objects in a package should have documentation entries.
      See chapter ‘Writing R documentation files’ in the ‘Writing R
      Extensions’ manual.
    
  • I found an inconsistency with dplyr::pull and Tables: Pulling an ungrouped Table produces a ChunkedArray whereas pulling a grouped Table produces a Table. This makes a subsequent call to as.vector produce an error of Error in as.vector(x, mode) : cannot coerce type 'environment' to vector of type 'any'

    Example of the difference
    > sw_table <- arrow_table(starwars)
    
    > sw_table |>
    +   filter(height > 100) |>
    +   group_by(homeworld) |>
    +   pull(name) |> 
    +   class()
    [1] "Table"        "ArrowTabular" "ArrowObject"  "R6"          
    
    > sw_table |>
    +   filter(height > 100) |>
    +   pull(name) |> 
    +   class()
    [1] "ChunkedArray" "ArrowDatum"   "ArrowObject"  "R6"

@github-actions
Copy link

github-actions bot commented Oct 5, 2022

@github-actions
Copy link

github-actions bot commented Oct 5, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@eitsupi
Copy link
Contributor

eitsupi commented Oct 6, 2022

2. Pulling an ungrouped Table produces a ChunkedArray whereas pulling a grouped Table produces a Table. This makes a subsequent call to as.vector produce an error of Error in as.vector(x, mode) : cannot coerce type 'environment' to vector of type 'any'

This is due to ARROW-17738.
You can use as_arrow_table instead of compute here I think. #14160 (comment)

@wjones127
Copy link
Member

On the CHECK issue, IIRC we don't @export dplyr S3 methods. We do call s3_register on them here:

for (cl in c("Dataset", "ArrowTabular", "RecordBatchReader", "arrow_dplyr_query")) {
for (m in names(supported_dplyr_methods)) {
s3_register(paste0("dplyr::", m), cl)
}
}

@amoeba
Copy link
Member Author

amoeba commented Oct 8, 2022

Thanks for the link @eitsupi that change fixes the issue I was having.

Re: @wjones127's comment:

On the CHECK issue, IIRC we don't @export dplyr S3 methods. We do call s3_register on them here:

I removed the export directive but I would eventually like to understand why the other exported functions in arrow-tabular.R (and probably elsewhere) don't need specific documentation (but I'll do that elsewhere).

Both the issues I raised are resolved and this can be reviewed.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion but otherwise LGTM, thanks!

@@ -259,3 +259,7 @@ na.omit.ArrowTabular <- function(object, ...) {

#' @export
na.exclude.ArrowTabular <- na.omit.ArrowTabular

pull.ArrowTabular <- function(x, var = -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this to dplyr-collect.R to keep it with the other pull() method definitions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Done in 9c8691b.

@eitsupi
Copy link
Contributor

eitsupi commented Oct 10, 2022

As I mentioned in JIRA, do you implement an argument like as_data_frame?
For consistency with other functions, wouldn't it be better to return an R vector by default and an Arrow object optionally?

@nealrichardson
Copy link
Member

As I mentioned in JIRA, do you implement an argument like as_data_frame? For consistency with other functions, wouldn't it be better to return an R vector by default and an Arrow object optionally?

Personally I don't think so, and as I previously commented on the jira, I don't think the consistency argument is quite so simple because arrow is different from querying remote databases with dbplyr.

@eitsupi
Copy link
Contributor

eitsupi commented Oct 11, 2022

Since the dplyr documentation says that pull returns a R vector, I think the documentation should at least explain that pull to arrow objects does not return a vector and using with as.vector to become a vector.
Users likely know very little about Chuncked Arrays.

@amoeba
Copy link
Member Author

amoeba commented Oct 11, 2022

Thanks the look @nealrichardson. This should be good to merge now.

@nealrichardson
Copy link
Member

Since the dplyr documentation says that pull returns a R vector, I think the documentation should at least explain that pull to arrow objects does not return a vector and using with as.vector to become a vector. Users likely know very little about Chuncked Arrays.

@eitsupi Good call, I'll add a note about this in #14387

@nealrichardson nealrichardson merged commit 20626f8 into apache:master Oct 12, 2022
@ursabot
Copy link

ursabot commented Oct 13, 2022

Benchmark runs are scheduled for baseline = 093a4fe and contender = 20626f8. 20626f8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.11% ⬆️0.0%] test-mac-arm
[Failed ⬇️2.74% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.14% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 20626f83 ec2-t3-xlarge-us-east-2
[Failed] 20626f83 test-mac-arm
[Failed] 20626f83 ursa-i9-9960x
[Finished] 20626f83 ursa-thinkcentre-m75q
[Finished] 093a4fe3 ec2-t3-xlarge-us-east-2
[Failed] 093a4fe3 test-mac-arm
[Failed] 093a4fe3 ursa-i9-9960x
[Finished] 093a4fe3 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Oct 13, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants