-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R][Python] Segfault on session exit after materializing ALTREP character vectors imported from Python #34897
Comments
Weston pointed me at some code in the Python package that checks if Python is finalizing before attempting to do something that might modify reference counts: https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/udf.cc#L52-L58 From the R side, our external pointers have an option to not run the finalizer when the session shuts down ( https://github.com/r-lib/cpp11/blob/main/inst/include/cpp11/external_pointer.hpp#L58 ). We could/should pass |
…ot own any Array references (apache#35812) This was identified and 99% debugged by @ lgautier on rpy2/rpy2-arrow#11 . Thank you! I have no idea why this does anything; however, the `RStringViewer` class *was* holding on to an unnecessary Array reference and this seemed to fix the crash for me. Maybe a circular reference? The reprex I was using (provided by @ lgautier) was: Install fresh deps: ```bash pip3 install pandas pyarrow rpy2-arrow R -e 'install.packages("arrow", repos = "https://cloud.r-project.org/")' ``` Run this python script: ```python import pandas as pd import pyarrow from rpy2.robjects.packages import importr import rpy2.robjects import rpy2_arrow.arrow as pyra base = importr('base') nanoarrow = importr('nanoarrow') code = """ function(df) { # df$col1 # no segfault on exit # I(df$col1) # no segfault on exit # df$col2 # no segfault on exit I(df$col2) # segfault on exit } """ rfunction = rpy2.robjects.r(code) pd_df = pd.DataFrame({ "col1": range(10), "col2":["a" for num in range(10)] }) pd_tbl = pyarrow.Table.from_pandas(pd_df) r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl) r_df = base.as_data_frame(nanoarrow.as_nanoarrow_array_stream(r_tbl)) output = rfunction(r_df) print(output) ``` Before this PR (installing R/arrow from main) I get: ``` (.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" zsh: segmentation fault python reprex-arrow.py ``` After this PR I get: ``` (.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" ``` (with no segfault) I wonder if this also will help with apache#35391 since it's also a segfault involving the Python <-> R bridge. * Closes: apache#34897 Authored-by: Dewey Dunnington <dewey@voltrondata.com> Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>
…ot own any Array references (apache#35812) This was identified and 99% debugged by @ lgautier on rpy2/rpy2-arrow#11 . Thank you! I have no idea why this does anything; however, the `RStringViewer` class *was* holding on to an unnecessary Array reference and this seemed to fix the crash for me. Maybe a circular reference? The reprex I was using (provided by @ lgautier) was: Install fresh deps: ```bash pip3 install pandas pyarrow rpy2-arrow R -e 'install.packages("arrow", repos = "https://cloud.r-project.org/")' ``` Run this python script: ```python import pandas as pd import pyarrow from rpy2.robjects.packages import importr import rpy2.robjects import rpy2_arrow.arrow as pyra base = importr('base') nanoarrow = importr('nanoarrow') code = """ function(df) { # df$col1 # no segfault on exit # I(df$col1) # no segfault on exit # df$col2 # no segfault on exit I(df$col2) # segfault on exit } """ rfunction = rpy2.robjects.r(code) pd_df = pd.DataFrame({ "col1": range(10), "col2":["a" for num in range(10)] }) pd_tbl = pyarrow.Table.from_pandas(pd_df) r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl) r_df = base.as_data_frame(nanoarrow.as_nanoarrow_array_stream(r_tbl)) output = rfunction(r_df) print(output) ``` Before this PR (installing R/arrow from main) I get: ``` (.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" zsh: segmentation fault python reprex-arrow.py ``` After this PR I get: ``` (.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" ``` (with no segfault) I wonder if this also will help with apache#35391 since it's also a segfault involving the Python <-> R bridge. * Closes: apache#34897 Authored-by: Dewey Dunnington <dewey@voltrondata.com> Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>
Describe the bug, including details regarding any error messages, version, and platform.
As described and reproduced nicely in rpy2/rpy2-arrow#11
This is almost certainly a result of #34489 since it seems to involve fully materialized arrays. Otherwise, I'm not sure exactly what would cause this specific to shipping arrays over the C Data interface. The traceback reported is below (I will try to reproduce as well):
I suspect this might have to do with the fact that the session shutting down might lead to some things happening with the GIL and R is trying to release that memory at a very inconvenient time.
Component(s)
R
The text was updated successfully, but these errors were encountered: