Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][C++] Crash when reading from closed RecordBatchReader backed by C Stream #34906

Closed
jorisvandenbossche opened this issue Apr 5, 2023 · 4 comments · Fixed by #35016
Closed
Assignees
Labels
Component: C++ Component: Python Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Describe the bug, including details regarding any error messages, version, and platform.

The following snippet reproduces the issue:

import pyarrow as pa
from pyarrow.cffi import ffi

# Prepare input
schema = pa.schema([('col', pa.int64())])
batches = [
    pa.record_batch([[1, 2, 3]], schema),
    pa.record_batch([[None, 5, 6]], schema),
]
reader = pa.RecordBatchReader.from_batches(schema, batches)

# create RecordBatchReader
c_stream = ffi.new("struct ArrowArrayStream*")
ptr_stream = int(ffi.cast("uintptr_t", c_stream))
reader._export_to_c(ptr_stream)
# Delete and recreate C++ object from exported pointer
del reader, batches
reader_new = pa.RecordBatchReader._import_from_c(ptr_stream)

reader_new.close()
reader_new.read_all()

Backtrace:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
#0  0xffffffffffffffff in ?? ()
#1  0x00007ffff4924a27 in arrow::(anonymous namespace)::ExportedArrayStream::GetNext (this=0x7fffffffb8f8, out_array=0x7fffffffb970) at /home/joris/scipy/repos/arrow/cpp/src/arrow/c/bridge.cc:1675
#2  0x00007ffff4924e2b in arrow::(anonymous namespace)::ExportedArrayStream::StaticGetNext (stream=0x5555563e9f88, out_array=0x7fffffffb970) at /home/joris/scipy/repos/arrow/cpp/src/arrow/c/bridge.cc:1711
#3  0x00007ffff4925466 in arrow::(anonymous namespace)::ArrayStreamBatchReader::ReadNext (this=0x5555563e9f80, batch=0x7fffffffba20) at /home/joris/scipy/repos/arrow/cpp/src/arrow/c/bridge.cc:1787
#4  0x00007ffff47b07cf in arrow::RecordBatchReader::ToRecordBatches (this=0x5555563e9f80) at /home/joris/scipy/repos/arrow/cpp/src/arrow/record_batch.cc:340
#5  0x00007ffff47b0a55 in arrow::RecordBatchReader::ToTable (this=0x5555563e9f80) at /home/joris/scipy/repos/arrow/cpp/src/arrow/record_batch.cc:354
#6  0x00007ffff710221c in __pyx_pf_7pyarrow_3lib_17RecordBatchReader_10read_all (__pyx_v_self=0x7fffd89d2460) at /home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-cpython-310/lib.cpp:180605

So either on the C++ side, the ArrayStreamBatchReader::ReadNext should check if the stream was closed or not, or on the Python side we could not try to read if it was closed (although I assume you would run into this when using it from C++ as well, so it should be fixed at that level).

Component(s)

C++, Python

jorisvandenbossche pushed a commit that referenced this issue Apr 11, 2023
…from a closed ArrayStreamBatchReader (#35016)

### Rationale for this change

Segfaults should be avoided, even on improper calling patterns.

### What changes are included in this PR?

Attempting to use a record batch reader, sourced from a C API stream, after it had been closed, will now return an invalid status instead of a segmentation fault.

### Are these changes tested?

Yes, new unit tests were added to regress this usage.

### Are there any user-facing changes?

No
* Closes: #34906

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche added this to the 12.0.0 milestone Apr 11, 2023
@kylebarron
Copy link
Contributor

It may be the same bug, but a really simple repro of this for me is tab-completion on an empty reader object.

> python
Python 3.11.1 (main, Jan 27 2023, 14:02:47) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> reader = pa.RecordBatchReader()
>>> reader.[1]    84350 segmentation fault  python

@jorisvandenbossche
Copy link
Member Author

That's a different issue I think, since this one was specific to a RecordBatchReader backed by a C Stream. And I can still reproduce this on the latest main.

Do you know a way to trigger this from a script? (without the tab-completion action, to be able to run it in a debugger)
I tried dir(reader), but that is not enough.

@kylebarron
Copy link
Contributor

From https://docs.python.org/3.11/tutorial/interactive.html#tab-completion-and-history-editing

Note that this may execute application-defined code if an object with a __getattr__() method is part of the expression

So it looks like this is because .schema is a property, so it tries to get evaluated.

> python
Python 3.11.1 (main, Jan 27 2023, 14:02:47) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> reader = pa.RecordBatchReader()
>>> reader.schema
[1]    7847 segmentation fault  python

If I mock out that method, it works:

> python
Python 3.11.1 (main, Jan 27 2023, 14:02:47) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> class MyReader(pa.RecordBatchReader):
...     @property
...     def schema(self):
...         return None
...
>>> reader = MyReader()
>>> reader.
reader.close(                                 reader.read_all(                              reader.read_pandas(
reader.from_batches(                          reader.read_next_batch(                       reader.schema
reader.iter_batches_with_custom_metadata(     reader.read_next_batch_with_custom_metadata(

@jorisvandenbossche
Copy link
Member Author

Thanks! Backtrace for this:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
__pyx_pf_7pyarrow_3lib_17RecordBatchReader_6schema___get__ (__pyx_v_self=0x7ffff7509e00) at /home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-cpython-310/lib.cpp:181731
181731	                        __pyx_v_c_schema = __pyx_v_self->reader.get()->schema();

#0  __pyx_pf_7pyarrow_3lib_17RecordBatchReader_6schema___get__ (__pyx_v_self=0x7ffff7509e00) at /home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-cpython-310/lib.cpp:181731
#1  0x00007ffff711d698 in __pyx_pw_7pyarrow_3lib_17RecordBatchReader_6schema_1__get__ (__pyx_v_self=<pyarrow.lib.RecordBatchReader at remote 0x7ffff7509e00>)
    at /home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-cpython-310/lib.cpp:181682
#2  0x00007ffff71416b5 in __pyx_getprop_7pyarrow_3lib_17RecordBatchReader_schema (o=<pyarrow.lib.RecordBatchReader at remote 0x7ffff7509e00>, x=0x0)
    at /home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-cpython-310/lib.cpp:206992
#3  0x00005555556982b3 in _PyObject_GenericGetAttrWithDict (obj=<optimized out>, name='schema', dict=<optimized out>, suppress=0) at /usr/local/src/conda/python-3.10.6/Objects/object.c:1253

Now, the docstring of RecordBatchReader also indicates that it shouldn't be called directly ("Do not call this class's constructor directly, use one of the RecordBatchReader.from_* functions instead."). But we don't actually "forbid" doing that, like we do for other of our classes (eg pa.Array() errors).
I am not sure there is a usecase of RecordBatchReader(), so maybe we should already disallow this step (I suppose the segfault comes from accessing the schema on a nullptr reader).

@wjones127 wjones127 added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Apr 26, 2023
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this issue May 11, 2023
…ading from a closed ArrayStreamBatchReader (apache#35016)

### Rationale for this change

Segfaults should be avoided, even on improper calling patterns.

### What changes are included in this PR?

Attempting to use a record batch reader, sourced from a C API stream, after it had been closed, will now return an invalid status instead of a segmentation fault.

### Are these changes tested?

Yes, new unit tests were added to regress this usage.

### Are there any user-facing changes?

No
* Closes: apache#34906

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…ading from a closed ArrayStreamBatchReader (apache#35016)

### Rationale for this change

Segfaults should be avoided, even on improper calling patterns.

### What changes are included in this PR?

Attempting to use a record batch reader, sourced from a C API stream, after it had been closed, will now return an invalid status instead of a segmentation fault.

### Are these changes tested?

Yes, new unit tests were added to regress this usage.

### Are there any user-facing changes?

No
* Closes: apache#34906

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
…ading from a closed ArrayStreamBatchReader (apache#35016)

### Rationale for this change

Segfaults should be avoided, even on improper calling patterns.

### What changes are included in this PR?

Attempting to use a record batch reader, sourced from a C API stream, after it had been closed, will now return an invalid status instead of a segmentation fault.

### Are these changes tested?

Yes, new unit tests were added to regress this usage.

### Are there any user-facing changes?

No
* Closes: apache#34906

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: C++ Component: Python Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Projects
None yet
4 participants