ARROW-421: [Python] PyBytesReader keep refcount on PyBytes object for zero-copy read #278

Closed
wants to merge 1 commit into
from

Projects

None yet

2 participants

@BryanCutler
Contributor

Changed pyarrow::PyBytesReader to hold a reference count on the parent PyBytes object when doing a zero-copy read to prevent premature garbage collection. Without holding a reference count and reading a RecordBatch, for example, the read bytes can be cleaned up before the batch is consumed.

@BryanCutler BryanCutler added an increment to PyBytes object if doing a zero-copy read to pre…
…vent premature garbage collection
9b492b2
@BryanCutler
Contributor

@wesm , I'm not sure it this is the right fix but it seemed to correct the problem I was seeing when reading large record batches during the Spark Dataset conversion. Let me know if I'm way off here, thanks!

@wesm

Let me take a closer look at fixing this tomorrow

@@ -205,12 +205,22 @@ Status PyOutputStream::Write(const uint8_t* data, int64_t nbytes) {
PyBytesReader::PyBytesReader(PyObject* obj)
: arrow::io::BufferReader(reinterpret_cast<const uint8_t*>(PyBytes_AS_STRING(obj)),
PyBytes_GET_SIZE(obj)),
- obj_(obj) {
+ obj_(obj),
+ hasZeroCopyRef_(false) {
@wesm
wesm Jan 11, 2017 Member

Class members are named with lower_with_underscores_. See https://google.github.io/styleguide/cppguide.html#Variable_Names

+Status PyBytesReader::Read(int64_t nbytes, std::shared_ptr<arrow::Buffer>* out) {
+ Status status = arrow::io::BufferReader::Read(nbytes, out);
+ if (!hasZeroCopyRef_) {
+ Py_INCREF(obj_);
@wesm
wesm Jan 11, 2017 Member

This will cause obj_ to never be garbage collected

@wesm
Member
wesm commented Jan 11, 2017

Put up a patch. I had previously added code to allow the C++ BufferReader to handle shared ownership, so needed to utilize that here

@asfgit asfgit pushed a commit that closed this pull request Jan 11, 2017
@wesm wesm ARROW-421: [Python] Retain parent reference in PyBytesReader
Pass Buffer to BufferReader so that zero-copy slices retain reference to PyBytesBuffer, which prevents the bytes object from being garbage collected prematurely. Also added some helper tools for inspecting Arrow Buffer objects in Python.

Close #278

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #279 from wesm/ARROW-421 and squashes the following commits:

acf730e [Wes McKinney] Rename method
50c195a [Wes McKinney] Fix accidental typo
ef20185 [Wes McKinney] Pass Buffer to BufferReader so that zero-copy slices retain reference to PyBytesBuffer, which prevents the bytes object from being garbage collected prematurely
7d3e2a3
@asfgit asfgit closed this in 7d3e2a3 Jan 11, 2017
@BryanCutler
Contributor

I see, thanks for the fix. I'll try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment