Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36614: [MATLAB] Subclass arrow::Buffer to keep MATLAB data backing arrow::Arrays alive #36615

Merged
merged 6 commits into from
Jul 12, 2023

Conversation

sgilmore10
Copy link
Member

@sgilmore10 sgilmore10 commented Jul 11, 2023

Rationale for this change

When building arrow.array.<Numeric>Arrays from native MATLAB arrays, we avoid copying by

  1. Wrapping the MATLAB data inside a non-owning arrow::Buffer
  2. Constructing an arrow::ArrayData object from the arrow::Buffer
  3. Constructing the arrow::Array from the arrow::Data object.

Because the Array's underlying Buffer does not have ownership of its backing data, we have been storing the original MATLAB array as a property called MatlabArray on the MATLAB arrow.array.NumericArray class.

This solution is not ideal because the backing MATLAB array is kept separate from the actual std::shared_ptr<arrow::Array>, which is stored within the C++ proxy objects (e.g. arrow::matlab::array::proxy::NumericArray<float64>). A better solution would be to create a new subclass of arrow::Buffer called MatlabBuffer, which will keep the original MATLAB array alive by taking it in as an input parameter and storing it as a member variable.

What changes are included in this PR?

  1. Removed the MatlabArray property from the MATLAB class matlab.io.NumericArray
  2. Added a private member variable called mda_array to arrow::matlab::array::proxy::NumericArray<CType> and arrow::matlab::array::proxy::TimestampArray. mda_array is a matlab::data::Array object. This member variable stores the MATLAB array that owns the memory the arrow::Array objects are backed by.

Are these changes tested?

Existing tests used.

Are there any user-facing changes?

No.

@pitrou
Copy link
Member

pitrou commented Jul 11, 2023

From the POV of Arrow C++, the idiomatic way to solve this issue would be to define your own Buffer subclass that keeps the required MATLAB object(s) alive. This is what we already do in PyArrow when we want to view a Numpy array as a PyArrow array:

class ARROW_PYTHON_EXPORT NumPyBuffer : public Buffer {
public:
explicit NumPyBuffer(PyObject* arr);
virtual ~NumPyBuffer();
private:
PyObject* arr_;
};

NumPyBuffer::NumPyBuffer(PyObject* ao) : Buffer(nullptr, 0) {
PyAcquireGIL lock;
arr_ = ao;
Py_INCREF(ao);
if (PyArray_Check(ao)) {
PyArrayObject* ndarray = reinterpret_cast<PyArrayObject*>(ao);
auto ptr = reinterpret_cast<uint8_t*>(PyArray_DATA(ndarray));
data_ = const_cast<const uint8_t*>(ptr);
size_ = PyArray_SIZE(ndarray) * PyArray_DESCR(ndarray)->elsize;
capacity_ = size_;
is_mutable_ = !!(PyArray_FLAGS(ndarray) & NPY_ARRAY_WRITEABLE);
}
}
NumPyBuffer::~NumPyBuffer() {
PyAcquireGIL lock;
Py_XDECREF(arr_);
}

@sgilmore10
Copy link
Member Author

Thanks for the suggestion @pitrou! That is definitely a better approach.

@sgilmore10
Copy link
Member Author

Based on @pitrou suggestion, I defined a Buffer subclass called MatlabBuffer that keeps the MATLAB data alive. Thanks again for the feedback!

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Could you update the PR title/description before we merge this?

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Jul 11, 2023
@sgilmore10 sgilmore10 changed the title GH-36614: [MATLAB] Store backing MATLAB array within C++ proxy objects instead of MATLAB classes GH-36614: [MATLAB] Subclass arrow::Buffer to keep MATLAB data backing arrow::Arrays alive Jul 12, 2023
@sgilmore10
Copy link
Member Author

+1

Could you update the PR title/description before we merge this?

Updated!

Copy link
Member

@kevingurney kevingurney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Thanks for making this change, @sgilmore10!

return reinterpret_cast<const uint8_t*>(valid_elements_iterator.operator->());
}
} // anonymous namespace
TimestampArray::TimestampArray(std::shared_ptr<arrow::TimestampArray> timestamp_array)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should go with the convention of naming these input Array subclass constructor arguments array for consistency across the various subclasses.

@kevingurney kevingurney merged commit d8a3360 into apache:main Jul 12, 2023
8 checks passed
@kevingurney kevingurney removed the awaiting merge Awaiting merge label Jul 12, 2023
chelseajonesr pushed a commit to chelseajonesr/arrow that referenced this pull request Jul 20, 2023
…acking arrow::Arrays alive (apache#36615)

### Rationale for this change

When building `arrow.array.<Numeric>Arrays` from native MATLAB arrays, we avoid copying by

1. Wrapping the MATLAB data inside a non-owning `arrow::Buffer`
2. Constructing an `arrow::ArrayData` object from the `arrow::Buffer`
3. Constructing the `arrow::Array` from the `arrow::Data` object.

Because the `Array`'s underlying `Buffer` does not have ownership of its backing data, we have been storing the original MATLAB array as a property called `MatlabArray` on the MATLAB `arrow.array.NumericArray` class.

This solution is not ideal because the backing MATLAB array is kept separate from the actual `std::shared_ptr<arrow::Array>`, which is stored within the C++ proxy objects (e.g. `arrow::matlab::array::proxy::NumericArray<float64>`).  A better solution would be to create a new subclass of `arrow::Buffer` called `MatlabBuffer`, which will keep the original MATLAB array alive by taking it in as an input parameter and storing it as a member variable. 

### What changes are included in this PR?

1. Removed the `MatlabArray` property from the MATLAB class `matlab.io.NumericArray` 
2. Added a private member variable called `mda_array` to `arrow::matlab::array::proxy::NumericArray<CType>` and `arrow::matlab::array::proxy::TimestampArray`. `mda_array` is a `matlab::data::Array` object. This member variable stores the MATLAB array that owns the memory the `arrow::Array` objects are backed by.

### Are these changes tested?

Existing tests used.

### Are there any user-facing changes?
No.

* Closes: apache#36614

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit d8a3360.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

kevingurney pushed a commit that referenced this pull request Aug 4, 2023
…row.tabular.RecordBatch` (#37015)

### Rationale for this change

In our initial implementation of the MATLAB class `arrow.tabular.RecordBatch`, we included a property property called `ArrowArrays`, which is a `cell` array whose elements are scalar `arrow.array.Array` objects.  These arrays correspond to the columns in the `arrow::RecordBatch` that the MATLAB class wraps.  The purpose of `ArrowArrays` was to enable  zero-copy construction of the `arrow::Array` objects backing the `arrow::RecordBatch` from MATLAB arrays. The `ArrowArrays` property was necessary to ensure the MATLAB data from which the `arrow::Array` columns were constructed don't get deallocated before they are done being used.
 
However, we no longer need the `ArrowArrays` property on `arrow.tabular.RecordBatch` because of #36615, in which we implemented `arrow::matlab::buffer::MatlabBuffer`. This class inherits from `arrow::Buffer` and  stores a reference to the MATLAB data it wraps, ensuring that the wrapped MATLAB data is kept alive as long as the buffer is around. 

We now only create `arrow::Array` objects from `arrow::matlab::buffer::MatlabBuffer` objects - instead of `arrow::Buffer` objects. As a result, the `ArrowArrays` property is no longer necessary on `arrow.tabular.RecordBatch` because the `arrow::Array` columns within the `arrow::RecordBatch` are all backed by `arrow::matlab::buffer::MatlabBuffer` objects. The backing MATLAB data is kept alive as long as the arrays and their buffers are kept alive. 

### What changes are included in this PR?

1. Removed the `ArrowArrays` property from `arrow.tabular.RecordBatch`. 
2. Because `ArrowArrays` is no longer a property, we had to add a new method to `arrow::matlab::tabular::proxy::RecordBatch` called `getColumnByIndex()`. This method creates a proxy object around the specified `arrrow::Array`. The MATLAB method `RecordBatch/column(index)` uses `getColumnByIndex()`.
3. Added a new function called `wrap`, which accepts an `arrow::Array` and returns an `arrow::matlab::array::proxy::Array`. While working on `getColumnByIndex()`, we realized there would be multiple places in the interface where we will need to create proxies around `arrow::Array` objects. We wrote the `wrap` utility function to reduce duplicated code in the future. Currently, only `getColumnByIndex()` utilizes `wrap`.

### Are these changes tested?

Yes.

1. The existing unit tests in `tRecordBatch.m` cover these changes.
2. Added a few more test cases to `tRecordBatch.m` to cover error conditions when a bad `index` value is provided to `arrow.tabular.RecordBatch/column(index)`.

### Are there any user-facing changes?

No.

### Future Directions

1. In a followup PR, we will add support for indexing columns by names (similar to `Schema.field(fieldName)` - #37013).
2. We also plan on adding an index validation function within the `arrow.internal.validate` package.  
3. We also plan on adding a convenience constructor `arrow.recordbatch()` and changing the constructor of `arrow.tabular.RecordBatch` to expect a `libmexclass.proxy.Proxy` object. 

* Closes: #37012 

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this pull request Aug 20, 2023
…acking arrow::Arrays alive (apache#36615)

### Rationale for this change

When building `arrow.array.<Numeric>Arrays` from native MATLAB arrays, we avoid copying by

1. Wrapping the MATLAB data inside a non-owning `arrow::Buffer`
2. Constructing an `arrow::ArrayData` object from the `arrow::Buffer`
3. Constructing the `arrow::Array` from the `arrow::Data` object.

Because the `Array`'s underlying `Buffer` does not have ownership of its backing data, we have been storing the original MATLAB array as a property called `MatlabArray` on the MATLAB `arrow.array.NumericArray` class.

This solution is not ideal because the backing MATLAB array is kept separate from the actual `std::shared_ptr<arrow::Array>`, which is stored within the C++ proxy objects (e.g. `arrow::matlab::array::proxy::NumericArray<float64>`).  A better solution would be to create a new subclass of `arrow::Buffer` called `MatlabBuffer`, which will keep the original MATLAB array alive by taking it in as an input parameter and storing it as a member variable. 

### What changes are included in this PR?

1. Removed the `MatlabArray` property from the MATLAB class `matlab.io.NumericArray` 
2. Added a private member variable called `mda_array` to `arrow::matlab::array::proxy::NumericArray<CType>` and `arrow::matlab::array::proxy::TimestampArray`. `mda_array` is a `matlab::data::Array` object. This member variable stores the MATLAB array that owns the memory the `arrow::Array` objects are backed by.

### Are these changes tested?

Existing tests used.

### Are there any user-facing changes?
No.

* Closes: apache#36614

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
@sgilmore10 sgilmore10 deleted the GH-36614 branch August 21, 2023 18:13
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…om `arrow.tabular.RecordBatch` (apache#37015)

### Rationale for this change

In our initial implementation of the MATLAB class `arrow.tabular.RecordBatch`, we included a property property called `ArrowArrays`, which is a `cell` array whose elements are scalar `arrow.array.Array` objects.  These arrays correspond to the columns in the `arrow::RecordBatch` that the MATLAB class wraps.  The purpose of `ArrowArrays` was to enable  zero-copy construction of the `arrow::Array` objects backing the `arrow::RecordBatch` from MATLAB arrays. The `ArrowArrays` property was necessary to ensure the MATLAB data from which the `arrow::Array` columns were constructed don't get deallocated before they are done being used.
 
However, we no longer need the `ArrowArrays` property on `arrow.tabular.RecordBatch` because of apache#36615, in which we implemented `arrow::matlab::buffer::MatlabBuffer`. This class inherits from `arrow::Buffer` and  stores a reference to the MATLAB data it wraps, ensuring that the wrapped MATLAB data is kept alive as long as the buffer is around. 

We now only create `arrow::Array` objects from `arrow::matlab::buffer::MatlabBuffer` objects - instead of `arrow::Buffer` objects. As a result, the `ArrowArrays` property is no longer necessary on `arrow.tabular.RecordBatch` because the `arrow::Array` columns within the `arrow::RecordBatch` are all backed by `arrow::matlab::buffer::MatlabBuffer` objects. The backing MATLAB data is kept alive as long as the arrays and their buffers are kept alive. 

### What changes are included in this PR?

1. Removed the `ArrowArrays` property from `arrow.tabular.RecordBatch`. 
2. Because `ArrowArrays` is no longer a property, we had to add a new method to `arrow::matlab::tabular::proxy::RecordBatch` called `getColumnByIndex()`. This method creates a proxy object around the specified `arrrow::Array`. The MATLAB method `RecordBatch/column(index)` uses `getColumnByIndex()`.
3. Added a new function called `wrap`, which accepts an `arrow::Array` and returns an `arrow::matlab::array::proxy::Array`. While working on `getColumnByIndex()`, we realized there would be multiple places in the interface where we will need to create proxies around `arrow::Array` objects. We wrote the `wrap` utility function to reduce duplicated code in the future. Currently, only `getColumnByIndex()` utilizes `wrap`.

### Are these changes tested?

Yes.

1. The existing unit tests in `tRecordBatch.m` cover these changes.
2. Added a few more test cases to `tRecordBatch.m` to cover error conditions when a bad `index` value is provided to `arrow.tabular.RecordBatch/column(index)`.

### Are there any user-facing changes?

No.

### Future Directions

1. In a followup PR, we will add support for indexing columns by names (similar to `Schema.field(fieldName)` - apache#37013).
2. We also plan on adding an index validation function within the `arrow.internal.validate` package.  
3. We also plan on adding a convenience constructor `arrow.recordbatch()` and changing the constructor of `arrow.tabular.RecordBatch` to expect a `libmexclass.proxy.Proxy` object. 

* Closes: apache#37012 

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MATLAB] Subclass arrow::Buffer to keep MATLAB data backing arrow::Arrays alive
4 participants