Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Implement bindings for cumsum function #35180

Closed
arnaud-feldmann opened this issue Apr 17, 2023 · 8 comments · Fixed by #35339
Closed

[R] Implement bindings for cumsum function #35180

arnaud-feldmann opened this issue Apr 17, 2023 · 8 comments · Fixed by #35339

Comments

@arnaud-feldmann
Copy link
Contributor

arnaud-feldmann commented Apr 17, 2023

Describe the enhancement requested

There exist cumulative functions in cpp, but those aren't linked to cumsum within R.

Could it be possible to add the link ?

The point is that to compute something that is pretty common in stats, a weighted median, one has to make raw function calls.

Thanks

Component(s)

R

@thisisnic
Copy link
Member

Would you be interested in submitting a PR to add this, @arnaud-feldmann ?

@thisisnic thisisnic changed the title r cumsum [R] Implement bindings for cumsum function Apr 24, 2023
@arnaud-feldmann
Copy link
Contributor Author

@thisisnic Ok I'm doing that

@thisisnic
Copy link
Member

Wonderful, thanks! Let us know if you have any issues, or anything we can help with.

@arnaud-feldmann
Copy link
Contributor Author

Tried all the afternoon, seems like it is above my level. Thought it was just a lacking binding, now I understand there's a bigger problem.

Sorry for the bother @thisisnic . Thanks for your work

@thisisnic
Copy link
Member

It's not a bother at all! Happy to guide if you've got a specific bit you were stuck at, or if you'd rather not, that's totally fine, but mind sharing the bigger problem, so if I (or someone else) has time to pick this up, I can use that as a starting point?

I've not had the chance to look into it at all myself yet, so you're ahead of me on this right now.

Honestly, this stuff can be tricky, and that's coming from someone who's been working on this stuff for 2 years and still gets tripped up all the time!

@arnaud-feldmann
Copy link
Contributor Author

arnaud-feldmann commented Apr 24, 2023

Right now I do that ugly thing to compute weighted medians :

library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
#> 
#> Attachement du package : 'arrow'
#> L'objet suivant est masqué depuis 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> Les objets suivants sont masqués depuis 'package:stats':
#> 
#>     filter, lag
#> Les objets suivants sont masqués depuis 'package:base':
#> 
#>     intersect, setdiff, setequal, union
tf <- tempfile()
dir.create(tf)
write_dataset(mtcars, tf, partitioning = "cyl")
ds <- open_dataset(tf)

# weighted median of gross horspower by weight
get_median <- function(arrow, var, wt) {
  var <- ensym(var)
  wt <- ensym(wt)
  
  sel <-
    arrow %>%
    filter(! is.na(!! var) & ! is.na(!! wt)) %>%
    mutate(!! var := as.double(!! var),
           !! wt := as.double(!! wt)) %>%
    arrange(!! var)
  sel_wt <- sel %>% pull(!! wt, as_vector = FALSE)
  sel_val <- sel %>% pull(!! var, as_vector = FALSE)
  sel_cum_poids <- call_function("cumulative_sum_checked",sel_wt)
  sel_sum_wt <- Scalar$create(sel_cum_poids[sel_cum_poids$length()])
  sel_val[sel_cum_poids>  sel_sum_wt/2][1L]$as_vector()
}
get_median(ds, hp, wt)
#> [1] 175

This isn't very pretty. That's why I searched for a direct binding.

I tried to find a place for Array functions, but I realise that there are none (for the moment). Only summarising or scalar functions.

And, even, cheating, using the UDF interface :

register_scalar_function("cumsum",
                         function(context, x) call_function("cumulative_sum_checked", x),
                         in_type = schema(field("x", float64())),
                         out_type = float64(),
                         auto_convert = FALSE)
ds %>%
  mutate(cs = cumsum(wt)) %>% collect()
#>     mpg  disp  hp drat    wt  qsec vs am gear carb cyl     cs
#> 1  21.0 160.0 110 3.90 2.620 16.46  0  1    4    4   6  2.620
#> 2  21.0 160.0 110 3.90 2.875 17.02  0  1    4    4   6  5.495
#> 3  21.4 258.0 110 3.08 3.215 19.44  1  0    3    1   6  8.710
#> 4  18.1 225.0 105 2.76 3.460 20.22  1  0    3    1   6 12.170
#> 5  19.2 167.6 123 3.92 3.440 18.30  1  0    4    4   6 15.610
#> 6  17.8 167.6 123 3.92 3.440 18.90  1  0    4    4   6 19.050
#> 7  19.7 145.0 175 3.62 2.770 15.50  0  1    5    6   6 21.820
#> 8  22.8 108.0  93 3.85 2.320 18.61  1  1    4    1   4  2.320
#> 9  24.4 146.7  62 3.69 3.190 20.00  1  0    4    2   4  5.510
#> 10 22.8 140.8  95 3.92 3.150 22.90  1  0    4    2   4  8.660
#> 11 32.4  78.7  66 4.08 2.200 19.47  1  1    4    1   4 10.860
#> 12 30.4  75.7  52 4.93 1.615 18.52  1  1    4    2   4 12.475
#> 13 33.9  71.1  65 4.22 1.835 19.90  1  1    4    1   4 14.310
#> 14 21.5 120.1  97 3.70 2.465 20.01  1  0    3    1   4 16.775
#> 15 27.3  79.0  66 4.08 1.935 18.90  1  1    4    1   4 18.710
#> 16 26.0 120.3  91 4.43 2.140 16.70  0  1    5    2   4 20.850
#> 17 30.4  95.1 113 3.77 1.513 16.90  1  1    5    2   4 22.363
#> 18 21.4 121.0 109 4.11 2.780 18.60  1  1    4    2   4 25.143
#> 19 18.7 360.0 175 3.15 3.440 17.02  0  0    3    2   8  3.440
#> 20 14.3 360.0 245 3.21 3.570 15.84  0  0    3    4   8  7.010
#> 21 16.4 275.8 180 3.07 4.070 17.40  0  0    3    3   8 11.080
#> 22 17.3 275.8 180 3.07 3.730 17.60  0  0    3    3   8 14.810
#> 23 15.2 275.8 180 3.07 3.780 18.00  0  0    3    3   8 18.590
#> 24 10.4 472.0 205 2.93 5.250 17.98  0  0    3    4   8 23.840
#> 25 10.4 460.0 215 3.00 5.424 17.82  0  0    3    4   8 29.264
#> 26 14.7 440.0 230 3.23 5.345 17.42  0  0    3    4   8 34.609
#> 27 15.5 318.0 150 2.76 3.520 16.87  0  0    3    2   8 38.129
#> 28 15.2 304.0 150 3.15 3.435 17.30  0  0    3    2   8 41.564
#> 29 13.3 350.0 245 3.73 3.840 15.41  0  0    3    4   8 45.404
#> 30 19.2 400.0 175 3.08 3.845 17.05  0  0    3    2   8 49.249
#> 31 15.8 351.0 264 4.22 3.170 14.50  0  1    5    4   8 52.419
#> 32 15.0 301.0 335 3.54 3.570 14.60  0  1    5    8   8 55.989

I realised that things starts again at each chunk. Hence that's more complicated than I thought.

But anyway, thanks for your awesome package, I use it everyday.

@thisisnic
Copy link
Member

Thanks for all the effort there trying to make this work, I can see you've delved right into the deep end of things!

I tried to find a place for Array functions, but I realise that there are none

So this isn't super well-documented, which is most likely why you didn't find it, but there's an R6 class in the package called ArrowDatum. It's the base class for Array, ChunkedArray, and Scalar objects, and the point of it is basically to allow developers to write a function e.g. my_fun.ArrowDatum(), which will run on object of this class.

For example, the S3 method for the unique() function (which has an identically named C++ equivalent function) looks like this:

#' @export
unique.ArrowDatum <- function(x, incomparables = FALSE, ...) {
  call_function("unique", x)
}

There's a PR here which shows someone implementing exp and sqrt both for Tables and for ArrowDatum objects.

https://github.com/apache/arrow/pull/13517/files

I reckon you could probably take a similar approach for the cumsum function. It might end up being simpler than you thought (but it's the surrounding complexity, and things being hidden away which make it, reasonably, look complex).

@arnaud-feldmann
Copy link
Contributor Author

@thisisnic I have put the binding for cumsum. As for the datasets, as far as I understand this isn't possible since we use Scalar Expressions.
Thank you for your help.

thisisnic added a commit that referenced this issue Apr 26, 2023
Fixes #35180 
Can't do the binding to dplyr, as dplyr takes Scalar Expressions and cumsum ( #12460 ) isn't a scalar expression.
* Closes: #35180

Lead-authored-by: arnaud-feldmann <arnaud.feldmann@gmail.com>
Co-authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
@thisisnic thisisnic added this to the 13.0.0 milestone Apr 26, 2023
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this issue May 11, 2023
)

Fixes apache#35180 
Can't do the binding to dplyr, as dplyr takes Scalar Expressions and cumsum ( apache#12460 ) isn't a scalar expression.
* Closes: apache#35180

Lead-authored-by: arnaud-feldmann <arnaud.feldmann@gmail.com>
Co-authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
)

Fixes apache#35180 
Can't do the binding to dplyr, as dplyr takes Scalar Expressions and cumsum ( apache#12460 ) isn't a scalar expression.
* Closes: apache#35180

Lead-authored-by: arnaud-feldmann <arnaud.feldmann@gmail.com>
Co-authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
)

Fixes apache#35180 
Can't do the binding to dplyr, as dplyr takes Scalar Expressions and cumsum ( apache#12460 ) isn't a scalar expression.
* Closes: apache#35180

Lead-authored-by: arnaud-feldmann <arnaud.feldmann@gmail.com>
Co-authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants