Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34897: [R] Ensure that the RStringViewer helper class does not own any Array references #35812

Merged

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented May 29, 2023

This was identified and 99% debugged by @lgautier on rpy2/rpy2-arrow#11 . Thank you!

I have no idea why this does anything; however, the RStringViewer class was holding on to an unnecessary Array reference and this seemed to fix the crash for me. Maybe a circular reference? The reprex I was using (provided by @lgautier) was:

Install fresh deps:

pip3 install pandas pyarrow rpy2-arrow
R -e 'install.packages("arrow", repos = "https://cloud.r-project.org/")'

Run this python script:

import pandas as pd
import pyarrow
from rpy2.robjects.packages import importr
import rpy2.robjects
import rpy2_arrow.arrow as pyra
base = importr('base')
nanoarrow = importr('nanoarrow')

code = """
    function(df) {
        # df$col1  # no segfault on exit
        # I(df$col1)  # no segfault on exit
        # df$col2  # no segfault on exit
        I(df$col2)  # segfault on exit
    }
"""
rfunction = rpy2.robjects.r(code)

pd_df = pd.DataFrame({
    "col1": range(10),
    "col2":["a" for num in range(10)]
})
pd_tbl = pyarrow.Table.from_pandas(pd_df)
r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl)
r_df = base.as_data_frame(nanoarrow.as_nanoarrow_array_stream(r_tbl))

output = rfunction(r_df)
print(output)

Before this PR (installing R/arrow from main) I get:

(.venv) dewey@Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py
 [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"

zsh: segmentation fault  python reprex-arrow.py

After this PR I get:

(.venv) dewey@Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py
 [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"

(with no segfault)

I wonder if this also will help with #35391 since it's also a segfault involving the Python <-> R bridge.

@paleolimbot
Copy link
Member Author

@lgautier if you would like to test this, you should be able to remotes::install_github("apache/arrow/r#35812", build = FALSE) to test this branch (the install might take a while but it should work).

@lgautier
Copy link

@lgautier if you would like to test this, you should be able to remotes::install_github("apache/arrow/r#35812", build = FALSE) to test this branch (the install might take a while but it should work).

I confirm. It does seem to fix the issue. Thanks. Is there a target version / release date to include this in R's arrow?

@paleolimbot
Copy link
Member Author

We are about to do a patch release and so there is a very good chance it will be in the next week or two. If it can't be included there (in the unlikely event it affects nightly builds or additional tests for the release), it would be a few months.

@paleolimbot
Copy link
Member Author

Thank you for testing!

@paleolimbot paleolimbot merged commit 0adf154 into apache:main May 30, 2023
10 checks passed
@paleolimbot paleolimbot deleted the r-maybe-fix-string-altrep-crash branch May 31, 2023 15:00
@ursabot
Copy link

ursabot commented May 31, 2023

Benchmark runs are scheduled for baseline = e628ca5 and contender = 0adf154. 0adf154 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.97% ⬆️0.18%] test-mac-arm
[Finished ⬇️0.65% ⬆️0.0%] ursa-i9-9960x
[Failed ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 0adf1543 ec2-t3-xlarge-us-east-2
[Finished] 0adf1543 test-mac-arm
[Finished] 0adf1543 ursa-i9-9960x
[Failed] 0adf1543 ursa-thinkcentre-m75q
[Finished] e628ca50 ec2-t3-xlarge-us-east-2
[Failed] e628ca50 test-mac-arm
[Finished] e628ca50 ursa-i9-9960x
[Finished] e628ca50 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Jun 6, 2023
…ot own any Array references (apache#35812)

This was identified and 99% debugged by @ lgautier on rpy2/rpy2-arrow#11 . Thank you!

I have no idea why this does anything; however, the `RStringViewer` class *was* holding on to an unnecessary Array reference and this seemed to fix the crash for me. Maybe a circular reference? The reprex I was using (provided by @ lgautier) was:

Install fresh deps:

```bash
pip3 install pandas pyarrow rpy2-arrow
R -e 'install.packages("arrow", repos = "https://cloud.r-project.org/")'
```

Run this python script:

```python
import pandas as pd
import pyarrow
from rpy2.robjects.packages import importr
import rpy2.robjects
import rpy2_arrow.arrow as pyra
base = importr('base')
nanoarrow = importr('nanoarrow')

code = """
    function(df) {
        # df$col1  # no segfault on exit
        # I(df$col1)  # no segfault on exit
        # df$col2  # no segfault on exit
        I(df$col2)  # segfault on exit
    }
"""
rfunction = rpy2.robjects.r(code)

pd_df = pd.DataFrame({
    "col1": range(10),
    "col2":["a" for num in range(10)]
})
pd_tbl = pyarrow.Table.from_pandas(pd_df)
r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl)
r_df = base.as_data_frame(nanoarrow.as_nanoarrow_array_stream(r_tbl))

output = rfunction(r_df)
print(output)
```

Before this PR (installing R/arrow from main) I get:

```
(.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py
 [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"

zsh: segmentation fault  python reprex-arrow.py
```

After this PR I get:

```
(.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py
 [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
```

(with no segfault)

I wonder if this also will help with apache#35391 since it's also a segfault involving the Python <-> R bridge.
* Closes: apache#34897

Authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Jun 13, 2023
…ot own any Array references (apache#35812)

This was identified and 99% debugged by @ lgautier on rpy2/rpy2-arrow#11 . Thank you!

I have no idea why this does anything; however, the `RStringViewer` class *was* holding on to an unnecessary Array reference and this seemed to fix the crash for me. Maybe a circular reference? The reprex I was using (provided by @ lgautier) was:

Install fresh deps:

```bash
pip3 install pandas pyarrow rpy2-arrow
R -e 'install.packages("arrow", repos = "https://cloud.r-project.org/")'
```

Run this python script:

```python
import pandas as pd
import pyarrow
from rpy2.robjects.packages import importr
import rpy2.robjects
import rpy2_arrow.arrow as pyra
base = importr('base')
nanoarrow = importr('nanoarrow')

code = """
    function(df) {
        # df$col1  # no segfault on exit
        # I(df$col1)  # no segfault on exit
        # df$col2  # no segfault on exit
        I(df$col2)  # segfault on exit
    }
"""
rfunction = rpy2.robjects.r(code)

pd_df = pd.DataFrame({
    "col1": range(10),
    "col2":["a" for num in range(10)]
})
pd_tbl = pyarrow.Table.from_pandas(pd_df)
r_tbl = pyra.pyarrow_table_to_r_table(pd_tbl)
r_df = base.as_data_frame(nanoarrow.as_nanoarrow_array_stream(r_tbl))

output = rfunction(r_df)
print(output)
```

Before this PR (installing R/arrow from main) I get:

```
(.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py
 [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"

zsh: segmentation fault  python reprex-arrow.py
```

After this PR I get:

```
(.venv) dewey@ Deweys-Mac-mini 2023-05-29_rpy % python reprex-arrow.py
 [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
```

(with no segfault)

I wonder if this also will help with apache#35391 since it's also a segfault involving the Python <-> R bridge.
* Closes: apache#34897

Authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R][Python] Segfault on session exit after materializing ALTREP character vectors imported from Python
3 participants