-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34897: [R] Ensure that the RStringViewer helper class does not own any Array references #35812
GH-34897: [R] Ensure that the RStringViewer helper class does not own any Array references #35812
Conversation
@lgautier if you would like to test this, you should be able to |
I confirm. It does seem to fix the issue. Thanks. Is there a target version / release date to include this in R's |
We are about to do a patch release and so there is a very good chance it will be in the next week or two. If it can't be included there (in the unlikely event it affects nightly builds or additional tests for the release), it would be a few months. |
Thank you for testing! |
Benchmark runs are scheduled for baseline = e628ca5 and contender = 0adf154. 0adf154 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…ot own any Array references (apache#35812) This was identified and 99% debugged by @ lgautier on rpy2/rpy2-arrow#11 . Thank you! I have no idea why this does anything; however, the `RStringViewer` class *was* holding on to an unnecessary Array reference and this seemed to fix the crash for me. Maybe a circular reference? The reprex I was using (provided by @ lgautier) was: Install fresh deps: ```bash pip3 install pandas pyarrow rpy2-arrow R -e 'install.packages("arrow", repos = "https://cloud.r-project.org/")' ``` Run this python script: ```python import pandas as pd import pyarrow from rpy2.robjects.packages import importr import rpy2.robjects import rpy2_arrow.arrow as pyra base = importr('base') nanoarrow = importr('nanoarrow') code = """ function(df) { # df$col1 # no segfault on exit # I(df$col1) # no segfault on exit # df$col2 # no segfault on exit I(df$col2) # segfault on exit } """ rfunction = rpy2.robjects.r(code) pd_df = pd.DataFrame({ "col1": range(10), "col2":["a" for num in range(10)] }) pd_tbl = pyarrow.Table.from_pandas(pd_df) r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl) r_df = base.as_data_frame(nanoarrow.as_nanoarrow_array_stream(r_tbl)) output = rfunction(r_df) print(output) ``` Before this PR (installing R/arrow from main) I get: ``` (.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" zsh: segmentation fault python reprex-arrow.py ``` After this PR I get: ``` (.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" ``` (with no segfault) I wonder if this also will help with apache#35391 since it's also a segfault involving the Python <-> R bridge. * Closes: apache#34897 Authored-by: Dewey Dunnington <dewey@voltrondata.com> Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>
…ot own any Array references (apache#35812) This was identified and 99% debugged by @ lgautier on rpy2/rpy2-arrow#11 . Thank you! I have no idea why this does anything; however, the `RStringViewer` class *was* holding on to an unnecessary Array reference and this seemed to fix the crash for me. Maybe a circular reference? The reprex I was using (provided by @ lgautier) was: Install fresh deps: ```bash pip3 install pandas pyarrow rpy2-arrow R -e 'install.packages("arrow", repos = "https://cloud.r-project.org/")' ``` Run this python script: ```python import pandas as pd import pyarrow from rpy2.robjects.packages import importr import rpy2.robjects import rpy2_arrow.arrow as pyra base = importr('base') nanoarrow = importr('nanoarrow') code = """ function(df) { # df$col1 # no segfault on exit # I(df$col1) # no segfault on exit # df$col2 # no segfault on exit I(df$col2) # segfault on exit } """ rfunction = rpy2.robjects.r(code) pd_df = pd.DataFrame({ "col1": range(10), "col2":["a" for num in range(10)] }) pd_tbl = pyarrow.Table.from_pandas(pd_df) r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl) r_df = base.as_data_frame(nanoarrow.as_nanoarrow_array_stream(r_tbl)) output = rfunction(r_df) print(output) ``` Before this PR (installing R/arrow from main) I get: ``` (.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" zsh: segmentation fault python reprex-arrow.py ``` After this PR I get: ``` (.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" ``` (with no segfault) I wonder if this also will help with apache#35391 since it's also a segfault involving the Python <-> R bridge. * Closes: apache#34897 Authored-by: Dewey Dunnington <dewey@voltrondata.com> Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>
This was identified and 99% debugged by @lgautier on rpy2/rpy2-arrow#11 . Thank you!
I have no idea why this does anything; however, the
RStringViewer
class was holding on to an unnecessary Array reference and this seemed to fix the crash for me. Maybe a circular reference? The reprex I was using (provided by @lgautier) was:Install fresh deps:
pip3 install pandas pyarrow rpy2-arrow R -e 'install.packages("arrow", repos = "https://cloud.r-project.org/")'
Run this python script:
Before this PR (installing R/arrow from main) I get:
After this PR I get:
(with no segfault)
I wonder if this also will help with #35391 since it's also a segfault involving the Python <-> R bridge.