Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Using dplyr::tally with an Arrow FileSystemDataset crashes R #33807

Closed
ablack3 opened this issue Jan 20, 2023 · 9 comments · Fixed by #37777
Closed

[R] Using dplyr::tally with an Arrow FileSystemDataset crashes R #33807

ablack3 opened this issue Jan 20, 2023 · 9 comments · Fixed by #37777
Assignees
Milestone

Comments

@ablack3
Copy link

ablack3 commented Jan 20, 2023

Describe the bug, including details regarding any error messages, version, and platform.

The following code snippet crashes R. I'm using arrow 10.0.1

library(dplyr)
arrow::write_dataset(cars, here::here("cars.feather"), format = "feather")
a <- arrow::open_dataset(here::here("cars.feather"), format = "feather")
a %>% tally()

Platform information

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.6

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] arrow_10.0.1   testthat_3.1.6

loaded via a namespace (and not attached):
 [1] assertthat_0.2.1 brio_1.1.3       R6_2.5.1         lifecycle_1.0.3  magrittr_2.0.3   rlang_1.0.6     
 [7] cli_3.5.0        rstudioapi_0.14  vctrs_0.5.1      tools_4.2.2      bit64_4.0.5      glue_1.6.2      
[13] purrr_1.0.0      bit_4.0.5        compiler_4.2.2   tidyselect_1.2.0

Component(s)

R

@ablack3
Copy link
Author

ablack3 commented Jan 20, 2023

This might be a clue

 *** caught illegal operation ***
   address 0x13d7349a8, cause 'illegal opcode'
   
   Traceback:
    1: Array__GetScalar(Array$create(x, type = type), 0)
    2: Scalar$create(x)
    3: compute___expr__scalar(Scalar$create(x))
    4: Expression$scalar(1L)
    5: n()
    6: eval_tidy(expr, mask)
    7: doTryCatch(return(expr), name, parentenv, handler)
    8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
    9: tryCatchList(expr, classes, parentenv, handlers)
   10: tryCatch(eval_tidy(expr, mask), error = function(e) {    msg <- conditionMessage(e)    if (getOption("arrow.debug", FALSE))         print(msg)    patterns <- .cache$i18ized_error_pattern    if (is.null(patterns)) {        patterns <- i18ize_error_messages()        .cache$i18ized_error_pattern <- patterns    }    if (grepl(patterns, msg)) {        stop(e)    }    out <- structure(msg, class = "try-error", condition = e)    if (grepl("not supported.*Arrow", msg) || getOption("arrow.debug",         FALSE)) {        class(out) <- c("arrow-try-error", class(out))    }    invisible(out)})
   11: arrow_eval(expr, mask)
   12: arrow_eval_or_stop(as_quosure(expr, ctx$quo_env), ctx$mask)
   13: summarize_eval(names(exprs)[i], exprs[[i]], ctx, length(.data$group_by_vars) >     0)
   14: do_arrow_summarize(.data, !!!exprs, .groups = .groups)
   15: doTryCatch(return(expr), name, parentenv, handler)
   16: tryCatchOne(expr, names, parentenv, handlers[[1L]])
   17: tryCatchList(expr, classes, parentenv, handlers)
   18: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call, nlines = 1L)        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        sm <- strsplit(conditionMessage(e), "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && isTRUE(getOption("show.error.messages"))) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
   19: try(do_arrow_summarize(.data, !!!exprs, .groups = .groups), silent = TRUE)
   20: summarise.ArrowTabular(x, `:=`(!!name, n()))
   21: dplyr::summarize(x, `:=`(!!name, n()))
   22: tally.ArrowTabular(.)
   23: tally(.)

@thisisnic
Copy link
Member

Hi @ablack3, thanks for reporting this! I haven't been able to reproduce this myself, though I am using Ubuntu 22.04 and not macOS. You could get more verbose output by attaching the C++ debugger before running R via the instructions here: https://arrow.apache.org/docs/dev/r/articles/developers/debugging.html

Can you show me the output of running arrow::arrow_info()? There might be some clues there.

@paleolimbot
Copy link
Member

We've had some problems with MacOS and illegal opcodes (#28343, #14826 most recently)...it seems that the way we detect the SIMD level is sometimes not working for Intel MacOS.

Out of curiosity, are you on M1 running a x86_64 version of R? (Or are you on an Intel-based Mac?)

@ianmcook
Copy link
Member

It might be useful to see the output of sysctl machdep.cpu from your macOS terminal

@ablack3
Copy link
Author

ablack3 commented Feb 16, 2023

Sorry for the delay. Here is arrow::arrow_info()

arrow::arrow_info()
#> Arrow package version: 11.0.0.2
#> 
#> Capabilities:
#>                
#> dataset    TRUE
#> substrait FALSE
#> parquet    TRUE
#> json       TRUE
#> s3         TRUE
#> gcs        TRUE
#> utf8proc   TRUE
#> re2        TRUE
#> snappy     TRUE
#> gzip       TRUE
#> brotli     TRUE
#> zstd       TRUE
#> lz4        TRUE
#> lz4_frame  TRUE
#> lzo       FALSE
#> bz2        TRUE
#> jemalloc   TRUE
#> mimalloc   TRUE
#> 
#> Memory:
#>                   
#> Allocator mimalloc
#> Current    0 bytes
#> Max        0 bytes
#> 
#> Runtime:
#>                           
#> SIMD Level          sse4_2
#> Detected SIMD Level sse4_2
#> 
#> Build:
#>                                     
#> C++ Library Version           11.0.0
#> C++ Compiler              AppleClang
#> C++ Compiler Version 10.0.0.10001145

Created on 2023-02-16 with reprex v2.0.2

Thanks for the debugging instructions @thisisnic.

Out of curiosity, are you on M1 running a x86_64 version of R? (Or are you on an Intel-based Mac?)

I am on an M1 running a x86_64 version of R. I used to run the arm version of R but had issues with odbc drivers not working with arm so had to move my R installation to x86_64 via Rosetta.

image

@paleolimbot
Copy link
Member

Thank you for this! I am guessing that whatever runtime detection mechanism we're using might not be working with rosetta.

Do we know if there's any way to force Arrow to pretend that SIMD doesn't exist at runtime?

@westonpace
Copy link
Member

Do we know if there's any way to force Arrow to pretend that SIMD doesn't exist at runtime?

You can try and set the environment variable ARROW_USER_SIMD_LEVEL to NONE.

@ablack3
Copy link
Author

ablack3 commented Mar 9, 2023

This is still crashing R on my machine. I'm using arrow v11.0.0.2

Sys.setenv(ARROW_USER_SIMD_LEVEL="NONE")

library(dplyr)
arrow::write_dataset(cars, here::here("cars.feather"), format = "feather")
a <- arrow::open_dataset(here::here("cars.feather"), format = "feather")
a %>% tally()

@kou kou changed the title Using dplyr::tally with an Arrow FileSystemDataset crashes R [R] Using dplyr::tally with an Arrow FileSystemDataset crashes R Mar 10, 2023
@jonkeane jonkeane self-assigned this Sep 18, 2023
@jonkeane
Copy link
Member

We ran into something like this a few times at @thisisnic and @stephhazlitt 's workshop. What happened was that some folks using Apple ARM-based machines were using R built for x86 (running under Rosetta emulation), and therefore received Arrow package binaries intended for x86, which will crash with illegal op codes.

R has had native builds for R for a long time now (and there are native ARM builds for arrow which work well), so if people are using ARM-based macs, we recommend installing native R and native arrow.

I will also send a PR shortly that adds a detection + warning on package load for arrow if we detect this so that folks know that they should run native R and things will work fine.

jonkeane added a commit that referenced this issue Sep 19, 2023
)

Resolves #33807 and #37034

### Rationale for this change

If someone is running R under emulation, arrow segfaults without error. We can detect this when we load so can also warn people that this is not recommended. Though the version of R being run is not directly an arrow issue, arrow fails very quickly in this configuration.

### What changes are included in this PR?

Detect when running under rosetta (on macOS only) and warn when the library is attached

### Are these changes tested?

No, given the paucity of ARM-based mac CI, testing this organically would be difficult. But the logic is straightforward.

### Are there any user-facing changes?

Yes, a warning when someone loads arrow under emulation.
* Closes: #33807

Authored-by: Jonathan Keane <jkeane@gmail.com>
Signed-off-by: Jonathan Keane <jkeane@gmail.com>
@jonkeane jonkeane added this to the 14.0.0 milestone Sep 19, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
apache#37777)

Resolves apache#33807 and apache#37034

### Rationale for this change

If someone is running R under emulation, arrow segfaults without error. We can detect this when we load so can also warn people that this is not recommended. Though the version of R being run is not directly an arrow issue, arrow fails very quickly in this configuration.

### What changes are included in this PR?

Detect when running under rosetta (on macOS only) and warn when the library is attached

### Are these changes tested?

No, given the paucity of ARM-based mac CI, testing this organically would be difficult. But the logic is straightforward.

### Are there any user-facing changes?

Yes, a warning when someone loads arrow under emulation.
* Closes: apache#33807

Authored-by: Jonathan Keane <jkeane@gmail.com>
Signed-off-by: Jonathan Keane <jkeane@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
apache#37777)

Resolves apache#33807 and apache#37034

If someone is running R under emulation, arrow segfaults without error. We can detect this when we load so can also warn people that this is not recommended. Though the version of R being run is not directly an arrow issue, arrow fails very quickly in this configuration.

Detect when running under rosetta (on macOS only) and warn when the library is attached

No, given the paucity of ARM-based mac CI, testing this organically would be difficult. But the logic is straightforward.

Yes, a warning when someone loads arrow under emulation.
* Closes: apache#33807

Authored-by: Jonathan Keane <jkeane@gmail.com>
Signed-off-by: Jonathan Keane <jkeane@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants