Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Segmentation fault when using write_parquet() #34211

Closed
PMassicotte opened this issue Feb 15, 2023 · 17 comments · Fixed by #34489
Closed

[R] Segmentation fault when using write_parquet() #34211

PMassicotte opened this issue Feb 15, 2023 · 17 comments · Fixed by #34489
Assignees
Labels
Component: R Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Milestone

Comments

@PMassicotte
Copy link

PMassicotte commented Feb 15, 2023

Describe the bug, including details regarding any error messages, version, and platform.

I am randomly getting segfault when using write_parquet() with the latest release (the same code works well with v 10.0.1).

 *** caught segfault ***
address 0x18, cause 'memory not mapped'

Traceback:
 1: Table__from_dots(dots, schema, option_use_threads())
 2: Table$create(x, schema = schema)
 3: as_arrow_table.data.frame(x)
 4: as_arrow_table(x)
 5: doTryCatch(return(expr), name, parentenv, handler)
 6: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 7: tryCatchList(expr, classes, parentenv, handlers)
 8: tryCatch(as_arrow_table(x), arrow_no_method_as_arrow_table = function(e) {    abort("Object must be coercible to an Arrow Table using `as_arrow_table()`",         parent = e, call = caller_env(2))})
 9: as_writable_table(x)
10: write_parquet(bioargo_dark_corrected, here("data", "raw", "bioargo",     "bioargo_correction_c.parquet"))
11: eval(ei, envir)
12: eval(ei, envir)
13: withVisible(eval(ei, envir))
14: source(here("R", "001c_bioargo_chla_dark_correction.R"))

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

Following this (https://arrow.apache.org/docs/7.0/r/articles/developers/debugging.html), here is the exact line when the code crashes.

Thread 1 "R" received signal SIGSEGV, Segmentation fault.
0x00007fffdccb0956 in std::__shared_ptr<arrow::DataType, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=0x7ffffffc4d50) at /usr/include/c++/12/bits/shared_ptr_base.h:1522
1522          __shared_ptr(const __shared_ptr&) noexcept = default;
$> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.10

Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8       
 [4] LC_COLLATE=en_CA.UTF-8     LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gsw_1.1-1              fishmethods_1.12-0     santoku_0.9.0          arrow_11.0.0.100000089
 [5] yardstick_1.1.0        workflowsets_1.0.0     workflows_1.1.2        tune_1.0.1            
 [9] rsample_1.1.1          recipes_1.0.4          parsnip_1.0.3          modeldata_1.1.0       
[13] infer_1.0.4            dials_1.1.0            scales_1.2.1           broom_1.0.3           
[17] tidymodels_1.0.0       data.table_1.14.6      furrr_0.3.1            future_1.31.0         
[21] pins_1.1.0             tidyterra_0.3.1        terra_1.7-3            sf_1.0-9              
[25] patchwork_1.1.2        tidync_0.3.0           here_1.0.1             glue_1.6.2            
[29] ggpmthemes_0.0.2       lubridate_1.9.2        forcats_1.0.0          stringr_1.5.0         
[33] dplyr_1.1.0            purrr_1.0.1            readr_2.1.4            tidyr_1.3.0           
[37] tibble_3.1.8           ggplot2_3.4.1          tidyverse_1.3.2.9000  

loaded via a namespace (and not attached):
 [1] minqa_1.2.5         colorspace_2.1-0    ellipsis_0.3.2      class_7.3-21       
 [5] rprojroot_2.0.3     fs_1.6.1            rstudioapi_0.14     proxy_0.4-27       
 [9] farver_2.1.1        listenv_0.9.0       bit64_4.0.5         prodlim_2019.11.13 
[13] fansi_1.0.4         codetools_0.2-19    splines_4.2.2       ncdf4_1.21         
[17] extrafont_0.19      jsonlite_1.8.4      nloptr_2.0.3        Rttf2pt1_1.3.12    
[21] compiler_4.2.2      backports_1.4.1     assertthat_0.2.1    Matrix_1.5-3       
[25] cli_3.6.0           tools_4.2.2         gtable_0.3.1        rappdirs_0.3.3     
[29] Rcpp_1.0.10         RNetCDF_2.6-2       DiceDesign_1.9      vctrs_0.5.2        
[33] nlme_3.1-162        extrafontdb_1.0     iterators_1.0.14    timeDate_4022.108  
[37] gower_1.0.1         globals_0.16.2      lme4_1.1-31         timechange_0.2.0   
[41] lifecycle_1.0.3     ncmeta_0.3.5        MASS_7.3-58.2       ipred_0.9-13       
[45] hms_1.1.2           parallel_4.2.2      TMB_1.9.2           rpart_4.1.19       
[49] stringi_1.7.12      foreach_1.5.2       e1071_1.7-13        lhs_1.1.6          
[53] boot_1.3-28.1       hardhat_1.2.0       lava_1.7.1          rlang_1.0.6        
[57] pkgconfig_2.0.3     lattice_0.20-45     labeling_0.4.2      bit_4.0.5          
[61] tidyselect_1.2.0    parallelly_1.34.0   magrittr_2.0.3      R6_2.5.1           
[65] generics_0.1.3      bootstrap_2019.6    DBI_1.1.3           pillar_1.8.1       
[69] withr_2.5.0         units_0.8-1         survival_3.5-3      nnet_7.3-18        
[73] future.apply_1.10.0 crayon_1.5.2        KernSmooth_2.23-20  utf8_1.2.3         
[77] tzdb_0.3.0          grid_4.2.2          digest_0.6.31       classInt_0.4-8     
[81] numDeriv_2016.8-1.1 GPfit_1.0-8         munsell_0.5.0     

Component(s)

R

@thisisnic
Copy link
Member

Thanks for reporting this @PMassicotte. A couple of things to suggest:

  1. Does uninstalling and reinstalling the package help at all?
  2. Would you mind running arrow::arrow_info() and sharing the output?
  3. Is it possible to narrow things down to a small reproducible example? It'll make it easier to work out exactly what's going on here.

@PMassicotte
Copy link
Author

Thank you @thisisnic for replying.

  1. I just tried, but no luck.

r$> arrow::arrow_info()
Arrow package version: 11.0.0.2

Capabilities:
               
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
gcs        TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc   TRUE
mimalloc   TRUE

Memory:
                   
Allocator  jemalloc
Current     0 bytes
Max       112.63 Mb

Runtime:
                        
SIMD Level          avx2
Detected SIMD Level avx2

Build:
                           
C++ Library Version  11.0.0
C++ Compiler            GNU
C++ Compiler Version  7.5.0
  1. I will try to create a subset of my data that create the same error.

Thank you very much.

@jiggunjer
Copy link

I encountered segfault when writing tables using pythons parquetwriter, in particular compression=gzip would segfault every time. I noticed that making the code non-threaded fixed the issue, so perhaps the compression types are using too many resources?

@egillax
Copy link
Contributor

egillax commented Mar 2, 2023

Just wanted to chime in here since I'm experiencing a very similar error.

It happens when I'm writing a dataframe to an arrow/feather file. Like for OP it works in arrow 10 but not 11. And is originating from the same line.

Here is the top of the stacktrace running R with debugger attached:

* thread #1, name = 'R', stop reason = signal SIGSEGV: invalid address (fault address: 0x18)
  * frame #0: 0x00007fff89fb0516 arrow.so`arrow::r::InferArrowType(SEXPREC*) at shared_ptr_base.h:1522:7
    frame #1: 0x00007fff89fab73e arrow.so`arrow::r::InferSchemaFromDots(SEXPREC*, SEXPREC*, int, std::shared_ptr<arrow::Schema>&)::'lambda'(int, SEXPREC*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>)::operator()(int, SEXPREC*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>) const at table.cpp:179:62
    frame #2: 0x00007fff89fac6c1 arrow.so`arrow::r::InferSchemaFromDots(SEXPREC*, SEXPREC*, int, std::shared_ptr<arrow::Schema>&) at arrow_types.h:211:15
    frame #3: 0x00007fff89f413aa arrow.so`Table__from_dots(SEXPREC*, SEXPREC*, bool) at r_to_arrow.cpp:1461:44
    frame #4: 0x00007fff89e9a690 arrow.so`_arrow_Table__from_dots at arrowExports.cpp:4321:40

It happens during unit testing in my package so I can reproduce it as much as I want locally, and happens as well running github actions on ubuntu, windows and macOS. But unfortunately have not been able to create a minimum reproducible example. Just calling the function separately that gives the issue with the same inputs does not give the error. So it seems it is dependent on something that happens earlier during my unit testing.

I've tried turning of thread using options(arrow.use_threads = FALSE) but that doesn't solve the issue.

Attached is my sessionInfo and arrowInfo. If you have any ideas on how to debug further please let me know.

SessionInfo R version 4.2.2 (2022-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.10

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.1

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=nl_NL.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] arrow_11.0.0.2

loaded via a namespace (and not attached):
[1] tidyselect_1.2.0 bit_4.0.4 compiler_4.2.2 magrittr_2.0.3 assertthat_0.2.1 R6_2.5.1
[7] cli_3.6.0 tools_4.2.2 glue_1.6.2 rstudioapi_0.14 bit64_4.0.5 vctrs_0.5.1
[13] lifecycle_1.0.3 rlang_1.0.6 purrr_1.0.1

arrowInfo Arrow package version: 11.0.0.2

Capabilities:

dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 TRUE
gcs TRUE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc TRUE
mimalloc TRUE

Memory:

Allocator jemalloc
Current 0 bytes
Max 0 bytes

Runtime:

SIMD Level avx2
Detected SIMD Level avx2

Build:

C++ Library Version 11.0.0
C++ Compiler GNU
C++ Compiler Version 11.3.0

@thisisnic
Copy link
Member

@egillax Thanks for reporting this.

@PMassicotte Would you mind trying again with the debugger attached, but pasting the complete output of when it crashes? Sometimes the top line isn't quite the right one to give us enough clues, and I want to see if your error is the same as the one reported by @egillax .

@paleolimbot Would you mind taking a look at this? If this is a common error between both of the folks above, the line that seems to be where the crash happens was modified in #14277, so you'll know more than me about that bit.

@PMassicotte
Copy link
Author

@thisisnic Sure. How can I do that? Is there documentation showing how to attach the output of the debugger?

@paleolimbot
Copy link
Member

@egillax Is the package you're testing available publicly? It would help me a lot to fix this if I can reproduce locally.

@PMassicotte I don't recall if we have instructions but it looks like you're on Ubuntu so you could probably do R -d gdb and then type run, which should drop you into an R prompt. If your session crashes, you should be able to type bt (for "backtrace"). If you don't get there, don't worry about it! The most helpful thing for fixing this will be the ability for me or Nic to reproduce this locally...even a hint (like...what are the column types of the table you're trying to write?)

@paleolimbot
Copy link
Member

Based on @jiggunjer's observation, you all could also try setting:

arrow:::SetEnableSignalStopSource(FALSE)

(We added some cancellation features in 11.0.0 and that is a way to turn them off)

@PMassicotte
Copy link
Author

@paleolimbot

Here is what I am getting:

> source("R/001c_bioargo_chla_dark_correction.R")
[New Thread 0x7ffeca9ff640 (LWP 2054027)]
[New Thread 0x7ffeca1fe640 (LWP 2054028)]
[New Thread 0x7ffec99fd640 (LWP 2054029)]
[New Thread 0x7ffec91fc640 (LWP 2054030)]
[New Thread 0x7ffec89fb640 (LWP 2054031)]
[New Thread 0x7ffeb3fff640 (LWP 2054032)]
[New Thread 0x7ffeab7fe640 (LWP 2054033)]
[New Thread 0x7ffeb37fe640 (LWP 2054034)]
[New Thread 0x7ffeb2ffd640 (LWP 2054035)]
[New Thread 0x7ffeb27fc640 (LWP 2054036)]
[New Thread 0x7ffeb1ffb640 (LWP 2054037)]
[New Thread 0x7ffeb17fa640 (LWP 2054038)]
[New Thread 0x7ffeb0ff9640 (LWP 2054039)]
[New Thread 0x7ffeabfff640 (LWP 2054040)]
[New Thread 0x7ffeaaffd640 (LWP 2054041)]
[New Thread 0x7ffeaa7fc640 (LWP 2054042)]
[New Thread 0x7ffea9ffb640 (LWP 2054043)]
[New Thread 0x7ffea8fff640 (LWP 2054044)]
[New Thread 0x7ffe73fff640 (LWP 2054045)]
[New Thread 0x7ffe7bfff640 (LWP 2054046)]
[New Thread 0x7ffe7b7fe640 (LWP 2054047)]
[New Thread 0x7ffe7affd640 (LWP 2054048)]
[New Thread 0x7ffe7a7fc640 (LWP 2054049)]
[New Thread 0x7ffe79ffb640 (LWP 2054050)]
[New Thread 0x7ffe797fa640 (LWP 2054051)]
[New Thread 0x7ffe78ff9640 (LWP 2054052)]
[New Thread 0x7ffe737fe640 (LWP 2054053)]
[New Thread 0x7ffe72ffd640 (LWP 2054054)]
[New Thread 0x7ffe727fc640 (LWP 2054055)]
[New Thread 0x7ffe71ffb640 (LWP 2054056)]
[New Thread 0x7ffe717fa640 (LWP 2054057)]
[New Thread 0x7ffe70ff9640 (LWP 2054058)]
[New Thread 0x7ffe3bfff640 (LWP 2054059)]
[New Thread 0x7ffe3b7fe640 (LWP 2054060)]
[New Thread 0x7ffe3aafd640 (LWP 2054061)]
Joining with `by = join_by(takuse, date_time, n_prof)`
Joining with `by = join_by(takuse, date_time, n_prof)`
[New Thread 0x7ffecb66f640 (LWP 2054218)]
[Thread 0x7ffecb66f640 (LWP 2054218) exited]
[New Thread 0x7ffecb66f640 (LWP 2054219)]
[New Thread 0x7ffdfb1ff640 (LWP 2054220)]
[Thread 0x7ffecb66f640 (LWP 2054219) exited]
[New Thread 0x7ffecb66f640 (LWP 2054221)]
[New Thread 0x7ffdfa9fe640 (LWP 2054222)]
[Thread 0x7ffdfb1ff640 (LWP 2054220) exited]
[Thread 0x7ffecb66f640 (LWP 2054221) exited]
[Thread 0x7ffdfa9fe640 (LWP 2054222) exited]
[New Thread 0x7ffdfa9fe640 (LWP 2054223)]
[New Thread 0x7ffecb66f640 (LWP 2054224)]
[Thread 0x7ffdfa9fe640 (LWP 2054223) exited]
[New Thread 0x7ffdfa9fe640 (LWP 2054225)]
[New Thread 0x7ffdfb1ff640 (LWP 2054226)]
[Thread 0x7ffecb66f640 (LWP 2054224) exited]
[Thread 0x7ffdfa9fe640 (LWP 2054225) exited]
[New Thread 0x7ffdfa9fe640 (LWP 2054249)]
[New Thread 0x7ffecb66f640 (LWP 2054250)]
[Thread 0x7ffdfb1ff640 (LWP 2054226) exited]
[Thread 0x7ffdfa9fe640 (LWP 2054249) exited]
[New Thread 0x7ffdfa9fe640 (LWP 2054251)]
[New Thread 0x7ffdfb1ff640 (LWP 2054252)]
[Thread 0x7ffecb66f640 (LWP 2054250) exited]
[Thread 0x7ffdfa9fe640 (LWP 2054251) exited]
[Thread 0x7ffdfb1ff640 (LWP 2054252) exited]

Thread 1 "R" received signal SIGSEGV, Segmentation fault.
0x00007ffed2b6cffd in arrow::r::InferArrowType(SEXPREC*) () from /home/filoche/R/x86_64-pc-linux-gnu-library/4.2/arrow/libs/arrow.so
(gdb) 

@paleolimbot
Copy link
Member

Biogeochemical Argo!!!! (Cool to see it here...that's what I did in my previous job!)

Since all these seem related to InferArrowType(), do you mind attempting to print out the column names and types of the data frame you're passing to arrow::write_parquet()? You could do that maybe with str(the_data_frame_right_before_write_parquet[integer(0), ])?

@PMassicotte
Copy link
Author

Ah, cool to meet a fellow bioargo colleague :)

glimpse(bioargo_dark_corrected)

Rows: 326,727
Columns: 111
$ filename                          <chr> "4902602_Sprof.nc", "4…
$ floatname                         <chr> "4902602", "4902602", …
$ takuse                            <chr> "takuse002b", "takuse0$ date_time                         <dttm> 2021-10-26 14:23:06, …
$ juld                              <dbl> 26231.6, 26231.6, 2623$ juld_qc                           <chr> "1", "1", "1", "1", "1…
$ juld_location                     <dbl> 26231.62, 26231.62, 26…
$ pres                              <dbl> 0.00, 0.10, 0.13, 0.28…
$ pres_qc                           <chr> "1", "1", "1", "1", "1$ pres_adjusted                     <dbl> NA, NA, NA, NA, -0.050$ pres_adjusted_qc                  <chr> " ", " ", " ", " ", "1…
$ pres_adjusted_error               <dbl> NA, NA, NA, NA, NA, NA…
$ temp                              <dbl> -0.5600167, -0.5598500…
$ temp_qc                           <chr> "3", "3", "3", "3", "3$ temp_d_pres                       <dbl> -0.05, 0.03, 0.00, 0.0$ temp_adjusted                     <dbl> -0.5680000, -0.5680000$ temp_adjusted_qc                  <chr> "8", "1", "8", "8", "1…
$ temp_adjusted_error               <dbl> NA, NA, NA, NA, NA, NA…
$ psal                              <dbl> 31.58545, 31.58433, 31…
$ psal_qc                           <chr> "3", "3", "3", "3", "3$ psal_d_pres                       <dbl> -0.05, 0.03, 0.00, 0.0$ psal_adjusted                     <dbl> 31.58900, 31.58900, 31$ psal_adjusted_qc                  <chr> "8", "1", "8", "8", "1…
$ psal_adjusted_error               <dbl> NA, NA, NA, NA, NA, NA…
$ doxy                              <dbl> NA, NA, NA, NA, NA, NA…
$ doxy_qc                           <chr> " ", " ", " ", " ", "$ doxy_d_pres                       <dbl> NA, NA, NA, NA, NA, NA$ doxy_adjusted                     <dbl> NA, NA, NA, NA, NA, NA$ doxy_adjusted_qc                  <chr> " ", " ", " ", " ", "
$ doxy_adjusted_error               <dbl> NA, NA, NA, NA, NA, NA…
$ down_irradiance380                <dbl> 0.004928912, 0.0049844…
$ down_irradiance380_qc             <chr> "1", "1", " ", " ", "8$ down_irradiance380_d_pres         <dbl> 0.02, 0.02, NA, NA, -0$ down_irradiance380_adjusted       <dbl> NA, NA, NA, NA, NA, NA$ down_irradiance380_adjusted_qc    <chr> " ", " ", " ", " ", "
$ down_irradiance380_adjusted_error <dbl> NA, NA, NA, NA, NA, NA…
$ down_irradiance412                <dbl> 0.007823820, 0.0079019…
$ down_irradiance412_qc             <chr> "1", "1", " ", " ", "8$ down_irradiance412_d_pres         <dbl> 0.02, 0.02, NA, NA, -0$ down_irradiance412_adjusted       <dbl> NA, NA, NA, NA, NA, NA$ down_irradiance412_adjusted_qc    <chr> " ", " ", " ", " ", "
$ down_irradiance412_adjusted_error <dbl> NA, NA, NA, NA, NA, NA…
$ down_irradiance490                <dbl> 0.007710332, 0.0077336…
$ down_irradiance490_qc             <chr> "1", "1", " ", " ", "8$ down_irradiance490_d_pres         <dbl> 0.02, 0.02, NA, NA, -0$ down_irradiance490_adjusted       <dbl> NA, NA, NA, NA, NA, NA$ down_irradiance490_adjusted_qc    <chr> " ", " ", " ", " ", "
$ down_irradiance490_adjusted_error <dbl> NA, NA, NA, NA, NA, NA…
$ downwelling_par                   <dbl> 7.388872, 7.413715, NA…
$ downwelling_par_qc                <chr> "1", "1", " ", " ", "8$ downwelling_par_d_pres            <dbl> 0.02, 0.02, NA, NA, -0$ downwelling_par_adjusted          <dbl> NA, NA, NA, NA, NA, NA$ downwelling_par_adjusted_qc       <chr> " ", " ", " ", " ", "
$ downwelling_par_adjusted_error    <dbl> NA, NA, NA, NA, NA, NA…
$ chla                              <dbl> NA, 0.489100, NA, NA, …
$ chla_qc                           <chr> " ", "3", " ", " ", "3$ chla_d_pres                       <dbl> NA, 0.0, NA, NA, 0.0, …
$ chla_adjusted                     <dbl> NA, 0.2993, NA, NA, 0.$ chla_adjusted_qc                  <chr> " ", "5", " ", " ", "5…
$ chla_adjusted_error               <dbl> NA, NA, NA, NA, NA, NA…
$ bbp700                            <dbl> NA, 0.0007781871, NA, …
$ bbp700_qc                         <chr> " ", "2", " ", " ", "2$ bbp700_d_pres                     <dbl> NA, 0.0, NA, NA, 0.0, …
$ bbp700_adjusted                   <dbl> NA, NA, NA, NA, NA, NA$ bbp700_adjusted_qc                <chr> " ", " ", " ", " ", "
$ bbp700_adjusted_error             <dbl> NA, NA, NA, NA, NA, NA…
$ cdom                              <dbl> NA, 0.7080, NA, NA, 0.…
$ cdom_qc                           <chr> " ", "0", " ", " ", "0$ cdom_d_pres                       <dbl> NA, 0.0, NA, NA, 0.0, …
$ cdom_adjusted                     <dbl> NA, NA, NA, NA, NA, NA$ cdom_adjusted_qc                  <chr> " ", " ", " ", " ", "
$ cdom_adjusted_error               <dbl> NA, NA, NA, NA, NA, NA…
$ nitrate                           <dbl> NA, NA, NA, NA, NA, NA…
$ nitrate_qc                        <chr> " ", " ", " ", " ", "$ nitrate_d_pres                    <dbl> NA, NA, NA, NA, NA, NA$ nitrate_adjusted                  <dbl> NA, NA, NA, NA, NA, NA$ nitrate_adjusted_qc               <chr> " ", " ", " ", " ", "
$ nitrate_adjusted_error            <dbl> NA, NA, NA, NA, NA, NA…
$ n_levels                          <int> 4, 5, 6, 7, 8, 9, 10, …
$ n_prof                            <int> 1, 1, 1, 1, 1, 1, 1, 1…
$ cycle_number                      <int> 1, 1, 1, 1, 1, 1, 1, 1…
$ direction                         <chr> "A", "A", "A", "A", "A$ latitude                          <dbl> 72.71641, 72.71641, 72$ longitude                         <dbl> -66.70568, -66.70568, …
$ position_qc                       <chr> "1", "1", "1", "1", "1…
$ config_mission_number             <int> 1, 1, 1, 1, 1, 1, 1, 1…
$ profile_pres_qc                   <chr> "A", "A", "A", "A", "A$ profile_temp_qc                   <chr> "B", "B", "B", "B", "B…
$ profile_psal_qc                   <chr> "B", "B", "B", "B", "B$ profile_doxy_qc                   <chr> "B", "B", "B", "B", "B…
$ profile_down_irradiance380_qc     <chr> "A", "A", "A", "A", "A$ profile_down_irradiance412_qc     <chr> "A", "A", "A", "A", "A…
$ profile_down_irradiance490_qc     <chr> "A", "A", "A", "A", "A$ profile_downwelling_par_qc        <chr> "A", "A", "A", "A", "A…
$ profile_chla_qc                   <chr> "B", "B", "B", "B", "B$ profile_bbp700_qc                 <chr> "A", "A", "A", "A", "A…
$ profile_cdom_qc                   <chr> " ", " ", " ", " ", "$ profile_nitrate_qc                <chr> "A", "A", "A", "A", "A…
$ position                          <lgl> NA, NA, NA, NA, NA, NA…
$ profile_pres                      <lgl> NA, NA, NA, NA, NA, NA…
$ profile_temp                      <lgl> NA, NA, NA, NA, NA, NA…
$ profile_psal                      <lgl> NA, NA, NA, NA, NA, NA…
$ profile_doxy                      <lgl> NA, NA, NA, NA, NA, NA…
$ profile_down_irradiance380        <lgl> NA, NA, NA, NA, NA, NA…
$ profile_down_irradiance412        <lgl> NA, NA, NA, NA, NA, NA…
$ profile_down_irradiance490        <lgl> NA, NA, NA, NA, NA, NA…
$ profile_downwelling_par           <lgl> NA, NA, NA, NA, NA, NA…
$ profile_chla                      <lgl> NA, NA, NA, NA, NA, NA…
$ profile_bbp700                    <lgl> NA, NA, NA, NA, NA, NA…
$ profile_cdom                      <lgl> NA, NA, NA, NA, NA, NA…
$ profile_nitrate                   <lgl> NA, NA, NA, NA, NA, NA…
r$> str(bioargo_dark_corrected[integer(0), ])
tibble [0 × 111] (S3: tbl_df/tbl/data.frame)
 $ filename                         : chr(0) 
 $ floatname                        : chr(0) 
 $ takuse                           : Named chr(0) 
  ..- attr(*, "names")= chr(0) 
 $ date_time                        : 'POSIXct' num(0) 
 - attr(*, "tzone")= chr "UTC"
 $ juld                             : num(0) 
 $ juld_qc                          : chr(0) 
 $ juld_location                    : num(0) 
 $ pres                             : num(0) 
 $ pres_qc                          : chr(0) 
 $ pres_adjusted                    : num(0) 
 $ pres_adjusted_qc                 : chr(0) 
 $ pres_adjusted_error              : num(0) 
 $ temp                             : num(0) 
 $ temp_qc                          : chr(0) 
 $ temp_d_pres                      : num(0) 
 $ temp_adjusted                    : num(0) 
 $ temp_adjusted_qc                 : chr(0) 
 $ temp_adjusted_error              : num(0) 
 $ psal                             : num(0) 
 $ psal_qc                          : chr(0) 
 $ psal_d_pres                      : num(0) 
 $ psal_adjusted                    : num(0) 
 $ psal_adjusted_qc                 : chr(0) 
 $ psal_adjusted_error              : num(0) 
 $ doxy                             : num(0) 
 $ doxy_qc                          : chr(0) 
 $ doxy_d_pres                      : num(0) 
 $ doxy_adjusted                    : num(0) 
 $ doxy_adjusted_qc                 : chr(0) 
 $ doxy_adjusted_error              : num(0) 
 $ down_irradiance380               : num(0) 
 $ down_irradiance380_qc            : chr(0) 
 $ down_irradiance380_d_pres        : num(0) 
 $ down_irradiance380_adjusted      : num(0) 
 $ down_irradiance380_adjusted_qc   : chr(0) 
 $ down_irradiance380_adjusted_error: num(0) 
 $ down_irradiance412               : num(0) 
 $ down_irradiance412_qc            : chr(0) 
 $ down_irradiance412_d_pres        : num(0) 
 $ down_irradiance412_adjusted      : num(0) 
 $ down_irradiance412_adjusted_qc   : chr(0) 
 $ down_irradiance412_adjusted_error: num(0) 
 $ down_irradiance490               : num(0) 
 $ down_irradiance490_qc            : chr(0) 
 $ down_irradiance490_d_pres        : num(0) 
 $ down_irradiance490_adjusted      : num(0) 
 $ down_irradiance490_adjusted_qc   : chr(0) 
 $ down_irradiance490_adjusted_error: num(0) 
 $ downwelling_par                  : num(0) 
 $ downwelling_par_qc               : chr(0) 
 $ downwelling_par_d_pres           : num(0) 
 $ downwelling_par_adjusted         : num(0) 
 $ downwelling_par_adjusted_qc      : chr(0) 
 $ downwelling_par_adjusted_error   : num(0) 
 $ chla                             : num(0) 
 $ chla_qc                          : chr(0) 
 $ chla_d_pres                      : num(0) 
 $ chla_adjusted                    : num(0) 
 $ chla_adjusted_qc                 : chr(0) 
 $ chla_adjusted_error              : num(0) 
 $ bbp700                           : num(0) 
 $ bbp700_qc                        : chr(0) 
 $ bbp700_d_pres                    : num(0) 
 $ bbp700_adjusted                  : num(0) 
 $ bbp700_adjusted_qc               : chr(0) 
 $ bbp700_adjusted_error            : num(0) 
 $ cdom                             : num(0) 
 $ cdom_qc                          : chr(0) 
 $ cdom_d_pres                      : num(0) 
 $ cdom_adjusted                    : num(0) 
 $ cdom_adjusted_qc                 : chr(0) 
 $ cdom_adjusted_error              : num(0) 
 $ nitrate                          : num(0) 
 $ nitrate_qc                       : chr(0) 
 $ nitrate_d_pres                   : num(0) 
 $ nitrate_adjusted                 : num(0) 
 $ nitrate_adjusted_qc              : chr(0) 
 $ nitrate_adjusted_error           : num(0) 
 $ n_levels                         : int(0) 
 $ n_prof                           : int(0) 
 $ cycle_number                     : int(0) 
 $ direction                        : chr(0) 
 $ latitude                         : num(0) 
 $ longitude                        : num(0) 
 $ position_qc                      : chr(0) 
 $ config_mission_number            : int(0) 
 $ profile_pres_qc                  : chr(0) 
 $ profile_temp_qc                  : chr(0) 
 $ profile_psal_qc                  : chr(0) 
 $ profile_doxy_qc                  : chr(0) 
 $ profile_down_irradiance380_qc    : chr(0) 
 $ profile_down_irradiance412_qc    : chr(0) 
 $ profile_down_irradiance490_qc    : chr(0) 
 $ profile_downwelling_par_qc       : chr(0) 
 $ profile_chla_qc                  : chr(0) 
 $ profile_bbp700_qc                : chr(0) 
 $ profile_cdom_qc                  : chr(0) 
 $ profile_nitrate_qc               : chr(0) 
 $ position                         : logi(0) 
  [list output truncated]

@paleolimbot
Copy link
Member

If I'm not mistaken, I believe you're even using a package I wrote to read the NetCDFs!

I don't see any odd column types here but there are certainly a lot of columns and that may be helpful to help us make a reproducer.

@PMassicotte
Copy link
Author

If I'm not mistaken, I believe you're even using a package I wrote to read the NetCDFs!

Yes!

I just tried:

  1. Write as RDS
  2. Read it back in R
  3. Write it using Arrow

This is not crashing. So it looks like that the problem (data type?) disappear if I save/read using another format.

@PMassicotte
Copy link
Author

PMassicotte commented Mar 7, 2023

I think I was able to reproduce it.

library(tidyverse)
library(arrow)


file <- curl::curl_download("https://download849.mediafire.com/r4csstfcwwwgGquvCho4H6GtScoCJac108RL-q6X9MtoWuPDQvZOQAWhxQqlCjLj2RmsyzikhTZ0ijBElIAs5in5whbp-w/7dk60h8gnj4n1qj/bioargo_correction_b.parquet", destfile = tempfile(fileext = ".parquet"))

bioargo <- read_parquet(file)

bioargo

bioargo |>
  group_by(takuse, date_time, n_prof) |>
  filter(pres == max(pres)) |>
  ggplot(aes(x = pres)) +
  geom_histogram(binwidth = 10, color = "white")

bioargo_dark_corrected <- bioargo |>
  group_by(takuse, date_time, n_prof) |>
  mutate(chla = chla - min(chla, na.rm = TRUE)) |>
  ungroup()

write_parquet(bioargo_dark_corrected, tempfile())

Can you confirm?

@paleolimbot
Copy link
Member

Brilliant! I had to download the file separately but this is fantastic.

library(tidyverse)
library(arrow)

# Download from:
# https://download849.mediafire.com/r4csstfcwwwgGquvCho4H6GtScoCJac108RL-q6X9MtoWuPDQvZOQAWhxQqlCjLj2RmsyzikhTZ0ijBElIAs5in5whbp-w/7dk60h8gnj4n1qj/bioargo_correction_b.parquet
file <- "~/Desktop/bioargo_correction_b.parquet"

bioargo <- read_parquet(file)

bioargo

pdf(tempfile())
bioargo |>
  group_by(takuse, date_time, n_prof) |>
  filter(pres == max(pres)) |>
  ggplot(aes(x = pres)) +
  geom_histogram(binwidth = 10, color = "white")
dev.off()

bioargo_dark_corrected <- bioargo |>
  group_by(takuse, date_time, n_prof) |>
  mutate(chla = chla - min(chla, na.rm = TRUE)) |>
  ungroup()

write_parquet(bioargo_dark_corrected, tempfile())

This reprex appears to crash R.
See standard output and standard error for more details.

Standard output and error

 *** caught segfault ***
address 0x18, cause 'invalid permissions'

Traceback:
 1: Table__from_dots(dots, schema, option_use_threads())
 2: Table$create(x, schema = schema)
 3: as_arrow_table.data.frame(x)
 4: as_arrow_table(x)
 5: doTryCatch(return(expr), name, parentenv, handler)
 6: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 7: tryCatchList(expr, classes, parentenv, handlers)
 8: tryCatch(as_arrow_table(x), arrow_no_method_as_arrow_table = function(e) {    abort("Object must be coercible to an Arrow Table using `as_arrow_table()`",         parent = e, call = caller_env(2))})
 9: as_writable_table(x)
10: write_parquet(bioargo_dark_corrected, tempfile())
11: eval(expr, envir, enclos)
12: eval(expr, envir, enclos)
13: eval_with_user_handlers(expr, envir, enclos, user_handlers)
14: withVisible(eval_with_user_handlers(expr, envir, enclos, user_handlers))
15: withCallingHandlers(withVisible(eval_with_user_handlers(expr,     envir, enclos, user_handlers)), warning = wHandler, error = eHandler,     message = mHandler)
16: doTryCatch(return(expr), name, parentenv, handler)
17: tryCatchOne(expr, names, parentenv, handlers[[1L]])
18: tryCatchList(expr, classes, parentenv, handlers)
19: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call, nlines = 1L)        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        sm <- strsplit(conditionMessage(e), "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && isTRUE(getOption("show.error.messages"))) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
20: try(f, silent = TRUE)
21: handle(ev <- withCallingHandlers(withVisible(eval_with_user_handlers(expr,     envir, enclos, user_handlers)), warning = wHandler, error = eHandler,     message = mHandler))
22: timing_fn(handle(ev <- withCallingHandlers(withVisible(eval_with_user_handlers(expr,     envir, enclos, user_handlers)), warning = wHandler, error = eHandler,     message = mHandler)))
23: evaluate_call(expr, parsed$src[[i]], envir = envir, enclos = enclos,     debug = debug, last = i == length(out), use_try = stop_on_error !=         2L, keep_warning = keep_warning, keep_message = keep_message,     output_handler = output_handler, include_timing = include_timing)
24: evaluate::evaluate(...)
25: evaluate(code, envir = env, new_device = FALSE, keep_warning = if (is.numeric(options$warning)) TRUE else options$warning,     keep_message = if (is.numeric(options$message)) TRUE else options$message,     stop_on_error = if (is.numeric(options$error)) options$error else {        if (options$error && options$include)             0L        else 2L    }, output_handler = knit_handlers(options$render, options))
26: in_dir(input_dir(), expr)
27: in_input_dir(evaluate(code, envir = env, new_device = FALSE,     keep_warning = if (is.numeric(options$warning)) TRUE else options$warning,     keep_message = if (is.numeric(options$message)) TRUE else options$message,     stop_on_error = if (is.numeric(options$error)) options$error else {        if (options$error && options$include)             0L        else 2L    }, output_handler = knit_handlers(options$render, options)))
28: eng_r(options)
29: block_exec(params)
30: call_block(x)
31: process_group.block(group)
32: process_group(group)
33: withCallingHandlers(if (tangle) process_tangle(group) else process_group(group),     error = function(e) {        setwd(wd)        cat(res, sep = "\n", file = output %n% "")        message("Quitting from lines ", paste(current_lines(i),             collapse = "-"), " (", knit_concord$get("infile"),             ") ")    })
34: process_file(text, output)
35: knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet)
36: rmarkdown::render(input, quiet = TRUE, envir = globalenv(), encoding = "UTF-8")
37: (function (input) {    rmarkdown::render(input, quiet = TRUE, envir = globalenv(),         encoding = "UTF-8")})(input = base::quote("loyal-rat_reprex.R"))
38: (function (what, args, quote = FALSE, envir = parent.frame()) {    if (!is.list(args))         stop("second argument must be a list")    if (quote)         args <- lapply(args, enquote)    .Internal(do.call(what, args, envir))})(base::quote(function (input) {    rmarkdown::render(input, quiet = TRUE, envir = globalenv(),         encoding = "UTF-8")}), base::quote(list(input = "loyal-rat_reprex.R")), envir = base::quote(<environment>),     quote = base::quote(TRUE))
39: do.call(do.call, c(readRDS("/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-fun-768e1cc53b50"),     list(envir = .GlobalEnv, quote = TRUE)), envir = .GlobalEnv,     quote = TRUE)
40: saveRDS(do.call(do.call, c(readRDS("/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-fun-768e1cc53b50"),     list(envir = .GlobalEnv, quote = TRUE)), envir = .GlobalEnv,     quote = TRUE), file = "/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-res-768e58b90ff1",     compress = FALSE)
41: withCallingHandlers({    NULL    saveRDS(do.call(do.call, c(readRDS("/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-fun-768e1cc53b50"),         list(envir = .GlobalEnv, quote = TRUE)), envir = .GlobalEnv,         quote = TRUE), file = "/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-res-768e58b90ff1",         compress = FALSE)    flush(stdout())    flush(stderr())    NULL    invisible()}, error = function(e) {    {        callr_data <- as.environment("tools:callr")$`__callr_data__`        err <- callr_data$err        if (FALSE) {            assign(".Traceback", .traceback(4), envir = callr_data)            dump.frames("__callr_dump__")            assign(".Last.dump", .GlobalEnv$`__callr_dump__`,                 envir = callr_data)            rm("__callr_dump__", envir = .GlobalEnv)        }        e <- err$process_call(e)        e2 <- err$new_error("error in callr subprocess")        class(e2) <- c("callr_remote_error", class(e2))        e2 <- err$add_trace_back(e2)        cut <- which(e2$trace$scope == "global")[1]        if (!is.na(cut)) {            e2$trace <- e2$trace[-(1:cut), ]        }        saveRDS(list("error", e2, e), file = paste0("/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-res-768e58b90ff1",             ".error"))    }}, interrupt = function(e) {    {        callr_data <- as.environment("tools:callr")$`__callr_data__`        err <- callr_data$err        if (FALSE) {            assign(".Traceback", .traceback(4), envir = callr_data)            dump.frames("__callr_dump__")            assign(".Last.dump", .GlobalEnv$`__callr_dump__`,                 envir = callr_data)            rm("__callr_dump__", envir = .GlobalEnv)        }        e <- err$process_call(e)        e2 <- err$new_error("error in callr subprocess")        class(e2) <- c("callr_remote_error", class(e2))        e2 <- err$add_trace_back(e2)        cut <- which(e2$trace$scope == "global")[1]        if (!is.na(cut)) {            e2$trace <- e2$trace[-(1:cut), ]        }        saveRDS(list("error", e2, e), file = paste0("/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-res-768e58b90ff1",             ".error"))    }}, callr_message = function(e) {    try(signalCondition(e))})
42: doTryCatch(return(expr), name, parentenv, handler)
43: tryCatchOne(expr, names, parentenv, handlers[[1L]])
44: tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
45: doTryCatch(return(expr), name, parentenv, handler)
46: tryCatchOne(tryCatchList(expr, names[-nh], parentenv, handlers[-nh]),     names[nh], parentenv, handlers[[nh]])
47: tryCatchList(expr, classes, parentenv, handlers)
48: tryCatch(withCallingHandlers({    NULL    saveRDS(do.call(do.call, c(readRDS("/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-fun-768e1cc53b50"),         list(envir = .GlobalEnv, quote = TRUE)), envir = .GlobalEnv,         quote = TRUE), file = "/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-res-768e58b90ff1",         compress = FALSE)    flush(stdout())    flush(stderr())    NULL    invisible()}, error = function(e) {    {        callr_data <- as.environment("tools:callr")$`__callr_data__`        err <- callr_data$err        if (FALSE) {            assign(".Traceback", .traceback(4), envir = callr_data)            dump.frames("__callr_dump__")            assign(".Last.dump", .GlobalEnv$`__callr_dump__`,                 envir = callr_data)            rm("__callr_dump__", envir = .GlobalEnv)        }        e <- err$process_call(e)        e2 <- err$new_error("error in callr subprocess")        class(e2) <- c("callr_remote_error", class(e2))        e2 <- err$add_trace_back(e2)        cut <- which(e2$trace$scope == "global")[1]        if (!is.na(cut)) {            e2$trace <- e2$trace[-(1:cut), ]        }        saveRDS(list("error", e2, e), file = paste0("/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-res-768e58b90ff1",             ".error"))    }}, interrupt = function(e) {    {        callr_data <- as.environment("tools:callr")$`__callr_data__`        err <- callr_data$err        if (FALSE) {            assign(".Traceback", .traceback(4), envir = callr_data)            dump.frames("__callr_dump__")            assign(".Last.dump", .GlobalEnv$`__callr_dump__`,                 envir = callr_data)            rm("__callr_dump__", envir = .GlobalEnv)        }        e <- err$process_call(e)        e2 <- err$new_error("error in callr subprocess")        class(e2) <- c("callr_remote_error", class(e2))        e2 <- err$add_trace_back(e2)        cut <- which(e2$trace$scope == "global")[1]        if (!is.na(cut)) {            e2$trace <- e2$trace[-(1:cut), ]        }        saveRDS(list("error", e2, e), file = paste0("/var/folders/p5/sxv05ml96sd1n2p3ssfhzzth0000gn/T//Rtmpod6Iee/callr-res-768e58b90ff1",             ".error"))    }}, callr_message = function(e) {    try(signalCondition(e))}), error = function(e) {    NULL    if (TRUE) {        try(stop(e))    }    else {        invisible()    }}, interrupt = function(e) {    NULL    if (TRUE) {        e    }    else {        invisible()    }})
An irrecoverable exception occurred. R is aborting now ...

@paleolimbot
Copy link
Member

It looks like this is a problem with ALTREP bypass: the array (probably most of them) are already ALTREP coming from Arrow, and the segfault we get is when we try to get the chunked array back:

Console is in 'commands' mode, prefix expressions with '?'.
Attached to process 45901
Stop reason: EXC_BAD_ACCESS (code=1, address=0x18)
bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x18)
  * frame #0: 0x00000001128bd8a8 arrow.so`std::__1::shared_ptr<arrow::DataType>::shared_ptr(this=0x000000016fc26390, __r=nullptr) at shared_ptr.h:846:18
    frame #1: 0x00000001128b9224 arrow.so`std::__1::shared_ptr<arrow::DataType>::shared_ptr(this=0x000000016fc26390, __r=nullptr) at shared_ptr.h:848:1
    frame #2: 0x00000001128b9dd8 arrow.so`arrow::r::InferArrowType(x=0x0000000137c42438) at type_infer.cpp:189:12
    frame #3: 0x00000001128b4980 arrow.so`arrow::r::InferSchemaFromDots(this=(size=111), j=0, x=0x0000000137c42438, name="filename")::$_1::operator()(int, SEXPREC*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) const at table.cpp:179:38
    frame #4: 0x00000001128a8740 arrow.so`void arrow::r::TraverseDots<arrow::r::InferSchemaFromDots(SEXPREC*, SEXPREC*, int, std::__1::shared_ptr<arrow::Schema>&)::$_1>(dots=cpp11::list @ 0x000000016fc266a0, num_fields=111, lambda=(size=111))::$_1) at arrow_types.h:211:9
    frame #5: 0x00000001128a8450 arrow.so`arrow::r::InferSchemaFromDots(lst=0x0000000150258fa0, schema_sxp=0x000000013500d0e0, num_fields=111, schema=nullptr) at table.cpp:182:3
    frame #6: 0x000000011277aeb0 arrow.so`Table__from_dots(lst=0x0000000150258fa0, schema_sxp=0x000000013500d0e0, use_threads=true) at r_to_arrow.cpp:1461:15
    frame #7: 0x00000001124eb658 arrow.so`::_arrow_Table__from_dots(lst_sexp=0x0000000150258fa0, 

@PMassicotte
Copy link
Author

Nice! thank you for your rapid response!

thisisnic pushed a commit that referenced this issue Mar 8, 2023
…ting to access the underlying ChunkedArray (#34489)

### Rationale for this change

When we attempt to re-use an object that Arrow itself created previously by wrapping a chunked array, we will get a crash if this object has been materialized (i.e., R values have been accessed and the ChunkedArray reference deleted). This behaviour changed between 10.0.0 and 11.0.0 because I redid the ALTREP implementation just after the 10.0.0 release.

The following test crashes R on main and 11.0.0 but passes after this PR:

``` r
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(testthat, warn.conflicts = FALSE)
withr::local_namespace("arrow")

test_that("Materialized ALTREP arrays don't cause arrow to crash when attempting to bypass", {
  a_int <- Array$create(c(1L, 2L, 3L))
  b_int <- a_int$as_vector()
  expect_true(is_arrow_altrep(b_int))
  expect_false(test_arrow_altrep_is_materialized(b_int))
  
  # Some operations that use altrep bypass
  expect_equal(infer_type(b_int), int32())
  expect_equal(as_arrow_array(b_int), a_int)
  
  # Still shouldn't have materialized yet
  expect_false(test_arrow_altrep_is_materialized(b_int))
  
  # Force it to materialize and check again
  test_arrow_altrep_force_materialize(b_int)
  expect_true(test_arrow_altrep_is_materialized(b_int))
  expect_equal(infer_type(b_int), int32())
  expect_equal(as_arrow_array(b_int), a_int)
})
#> Test passed 🎉
```

### What changes are included in this PR?

We used a function called `is_arrow_altrep()` to check if we could safely access the ChunkedArray reference; however, *materialized* ALTREP arrays still cause this return `true`. I added a new function `is_unmaterialized_arrow_altrep()` and replaced usage that depended on the ChunkedArray actually existing to use it.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: #34211

Authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
@thisisnic thisisnic added this to the 12.0.0 milestone Mar 8, 2023
@wjones127 wjones127 added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: R Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Projects
None yet
6 participants