New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] Implement bindings for cumsum function #35180
Comments
Would you be interested in submitting a PR to add this, @arnaud-feldmann ? |
@thisisnic Ok I'm doing that |
Wonderful, thanks! Let us know if you have any issues, or anything we can help with. |
Tried all the afternoon, seems like it is above my level. Thought it was just a lacking binding, now I understand there's a bigger problem. Sorry for the bother @thisisnic . Thanks for your work |
It's not a bother at all! Happy to guide if you've got a specific bit you were stuck at, or if you'd rather not, that's totally fine, but mind sharing the bigger problem, so if I (or someone else) has time to pick this up, I can use that as a starting point? I've not had the chance to look into it at all myself yet, so you're ahead of me on this right now. Honestly, this stuff can be tricky, and that's coming from someone who's been working on this stuff for 2 years and still gets tripped up all the time! |
Right now I do that ugly thing to compute weighted medians : library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
#>
#> Attachement du package : 'arrow'
#> L'objet suivant est masqué depuis 'package:utils':
#>
#> timestamp
library(dplyr)
#>
#> Attachement du package : 'dplyr'
#> Les objets suivants sont masqués depuis 'package:stats':
#>
#> filter, lag
#> Les objets suivants sont masqués depuis 'package:base':
#>
#> intersect, setdiff, setequal, union
tf <- tempfile()
dir.create(tf)
write_dataset(mtcars, tf, partitioning = "cyl")
ds <- open_dataset(tf)
# weighted median of gross horspower by weight
get_median <- function(arrow, var, wt) {
var <- ensym(var)
wt <- ensym(wt)
sel <-
arrow %>%
filter(! is.na(!! var) & ! is.na(!! wt)) %>%
mutate(!! var := as.double(!! var),
!! wt := as.double(!! wt)) %>%
arrange(!! var)
sel_wt <- sel %>% pull(!! wt, as_vector = FALSE)
sel_val <- sel %>% pull(!! var, as_vector = FALSE)
sel_cum_poids <- call_function("cumulative_sum_checked",sel_wt)
sel_sum_wt <- Scalar$create(sel_cum_poids[sel_cum_poids$length()])
sel_val[sel_cum_poids> sel_sum_wt/2][1L]$as_vector()
}
get_median(ds, hp, wt)
#> [1] 175 This isn't very pretty. That's why I searched for a direct binding. I tried to find a place for Array functions, but I realise that there are none (for the moment). Only summarising or scalar functions. And, even, cheating, using the UDF interface :
I realised that things starts again at each chunk. Hence that's more complicated than I thought. But anyway, thanks for your awesome package, I use it everyday. |
Thanks for all the effort there trying to make this work, I can see you've delved right into the deep end of things!
So this isn't super well-documented, which is most likely why you didn't find it, but there's an R6 class in the package called ArrowDatum. It's the base class for Array, ChunkedArray, and Scalar objects, and the point of it is basically to allow developers to write a function e.g. For example, the S3 method for the
There's a PR here which shows someone implementing https://github.com/apache/arrow/pull/13517/files I reckon you could probably take a similar approach for the cumsum function. It might end up being simpler than you thought (but it's the surrounding complexity, and things being hidden away which make it, reasonably, look complex). |
@thisisnic I have put the binding for cumsum. As for the datasets, as far as I understand this isn't possible since we use Scalar Expressions. |
Fixes #35180 Can't do the binding to dplyr, as dplyr takes Scalar Expressions and cumsum ( #12460 ) isn't a scalar expression. * Closes: #35180 Lead-authored-by: arnaud-feldmann <arnaud.feldmann@gmail.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
) Fixes apache#35180 Can't do the binding to dplyr, as dplyr takes Scalar Expressions and cumsum ( apache#12460 ) isn't a scalar expression. * Closes: apache#35180 Lead-authored-by: arnaud-feldmann <arnaud.feldmann@gmail.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
) Fixes apache#35180 Can't do the binding to dplyr, as dplyr takes Scalar Expressions and cumsum ( apache#12460 ) isn't a scalar expression. * Closes: apache#35180 Lead-authored-by: arnaud-feldmann <arnaud.feldmann@gmail.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
) Fixes apache#35180 Can't do the binding to dplyr, as dplyr takes Scalar Expressions and cumsum ( apache#12460 ) isn't a scalar expression. * Closes: apache#35180 Lead-authored-by: arnaud-feldmann <arnaud.feldmann@gmail.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
Describe the enhancement requested
There exist cumulative functions in cpp, but those aren't linked to
cumsum
within R.Could it be possible to add the link ?
The point is that to compute something that is pretty common in stats, a weighted median, one has to make raw function calls.
Thanks
Component(s)
R
The text was updated successfully, but these errors were encountered: