Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Simple arrange %>% head does not respect ordering #29749

Closed
asfimport opened this issue Sep 29, 2021 · 2 comments
Closed

[R] Simple arrange %>% head does not respect ordering #29749

asfimport opened this issue Sep 29, 2021 · 2 comments
Assignees
Labels
Component: R Priority: Blocker Marks a blocker for the release Type: bug
Milestone

Comments

@asfimport
Copy link

This was originally reported by @jonkeane in ARROW-13893 but that issue was covering a different topic so I am opening a new issue for this specific behavior.

> library(arrow)
> library(dplyr)
> 
> tab <- Table$create(mtcars)
> 
> tab %>% 
+   arrange(mpg) %>% 
+   head(4) %>% 
+   collect()
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
> 
> mtcars %>% 
+   arrange(mpg) %>% 
+   head(4) %>% 
+   collect()
                     mpg cyl disp  hp drat    wt  qsec vs am gear carb
Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4

Reporter: Weston Pace / @westonpace
Assignee: Neal Richardson / @nealrichardson

Note: This issue was originally created as ARROW-14162. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Weston Pace / @westonpace:
The call to head is triggering an (immediate?) call to the legacy scanner head method. The resulting dataset is then returned. Then the remaining dplyr execution is resolved against the in-memory data. ExecPlan is not used at all. So it is first fetching the first 4 rows and then sorting instead of sorting and then fetching.

If this is truly a blocker for 6.0.0 then it might be an problem. The head can't be applied in R because it would read in all of the data (presumably you could abort the read partway through but I think this would be overly complex).

If we want to do a proper ordered head in C++ then my recommendation would be the batch index scheme proposed in the sequencing doc here but I'm not sure we want to tackle that as part of 6.0.0.

As a short term solution we can modify the sorting sink node to accept a limit argument. That should be a reasonably quick solution and could maybe fit in 6.0.0 but I'm not sure how much time we want to invest in stop-gap measures.

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
All sink nodes should probably take a limit argument, right? (Not saying for 6.0, just in general)

@asfimport asfimport added this to the 6.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: R Priority: Blocker Marks a blocker for the release Type: bug
Projects
None yet
Development

No branches or pull requests

2 participants