-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36614: [MATLAB] Subclass arrow::Buffer to keep MATLAB data backing arrow::Arrays alive #36615
Conversation
From the POV of Arrow C++, the idiomatic way to solve this issue would be to define your own arrow/python/pyarrow/src/arrow/python/numpy_convert.h Lines 42 to 49 in c09b8c8
arrow/python/pyarrow/src/arrow/python/numpy_convert.cc Lines 40 to 58 in c09b8c8
|
Thanks for the suggestion @pitrou! That is definitely a better approach. |
Based on @pitrou suggestion, I defined a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Could you update the PR title/description before we merge this?
Updated! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thanks for making this change, @sgilmore10!
return reinterpret_cast<const uint8_t*>(valid_elements_iterator.operator->()); | ||
} | ||
} // anonymous namespace | ||
TimestampArray::TimestampArray(std::shared_ptr<arrow::TimestampArray> timestamp_array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should go with the convention of naming these input Array
subclass constructor arguments array
for consistency across the various subclasses.
…acking arrow::Arrays alive (apache#36615) ### Rationale for this change When building `arrow.array.<Numeric>Arrays` from native MATLAB arrays, we avoid copying by 1. Wrapping the MATLAB data inside a non-owning `arrow::Buffer` 2. Constructing an `arrow::ArrayData` object from the `arrow::Buffer` 3. Constructing the `arrow::Array` from the `arrow::Data` object. Because the `Array`'s underlying `Buffer` does not have ownership of its backing data, we have been storing the original MATLAB array as a property called `MatlabArray` on the MATLAB `arrow.array.NumericArray` class. This solution is not ideal because the backing MATLAB array is kept separate from the actual `std::shared_ptr<arrow::Array>`, which is stored within the C++ proxy objects (e.g. `arrow::matlab::array::proxy::NumericArray<float64>`). A better solution would be to create a new subclass of `arrow::Buffer` called `MatlabBuffer`, which will keep the original MATLAB array alive by taking it in as an input parameter and storing it as a member variable. ### What changes are included in this PR? 1. Removed the `MatlabArray` property from the MATLAB class `matlab.io.NumericArray` 2. Added a private member variable called `mda_array` to `arrow::matlab::array::proxy::NumericArray<CType>` and `arrow::matlab::array::proxy::TimestampArray`. `mda_array` is a `matlab::data::Array` object. This member variable stores the MATLAB array that owns the memory the `arrow::Array` objects are backed by. ### Are these changes tested? Existing tests used. ### Are there any user-facing changes? No. * Closes: apache#36614 Authored-by: Sarah Gilmore <sgilmore@mathworks.com> Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit d8a3360. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
…row.tabular.RecordBatch` (#37015) ### Rationale for this change In our initial implementation of the MATLAB class `arrow.tabular.RecordBatch`, we included a property property called `ArrowArrays`, which is a `cell` array whose elements are scalar `arrow.array.Array` objects. These arrays correspond to the columns in the `arrow::RecordBatch` that the MATLAB class wraps. The purpose of `ArrowArrays` was to enable zero-copy construction of the `arrow::Array` objects backing the `arrow::RecordBatch` from MATLAB arrays. The `ArrowArrays` property was necessary to ensure the MATLAB data from which the `arrow::Array` columns were constructed don't get deallocated before they are done being used. However, we no longer need the `ArrowArrays` property on `arrow.tabular.RecordBatch` because of #36615, in which we implemented `arrow::matlab::buffer::MatlabBuffer`. This class inherits from `arrow::Buffer` and stores a reference to the MATLAB data it wraps, ensuring that the wrapped MATLAB data is kept alive as long as the buffer is around. We now only create `arrow::Array` objects from `arrow::matlab::buffer::MatlabBuffer` objects - instead of `arrow::Buffer` objects. As a result, the `ArrowArrays` property is no longer necessary on `arrow.tabular.RecordBatch` because the `arrow::Array` columns within the `arrow::RecordBatch` are all backed by `arrow::matlab::buffer::MatlabBuffer` objects. The backing MATLAB data is kept alive as long as the arrays and their buffers are kept alive. ### What changes are included in this PR? 1. Removed the `ArrowArrays` property from `arrow.tabular.RecordBatch`. 2. Because `ArrowArrays` is no longer a property, we had to add a new method to `arrow::matlab::tabular::proxy::RecordBatch` called `getColumnByIndex()`. This method creates a proxy object around the specified `arrrow::Array`. The MATLAB method `RecordBatch/column(index)` uses `getColumnByIndex()`. 3. Added a new function called `wrap`, which accepts an `arrow::Array` and returns an `arrow::matlab::array::proxy::Array`. While working on `getColumnByIndex()`, we realized there would be multiple places in the interface where we will need to create proxies around `arrow::Array` objects. We wrote the `wrap` utility function to reduce duplicated code in the future. Currently, only `getColumnByIndex()` utilizes `wrap`. ### Are these changes tested? Yes. 1. The existing unit tests in `tRecordBatch.m` cover these changes. 2. Added a few more test cases to `tRecordBatch.m` to cover error conditions when a bad `index` value is provided to `arrow.tabular.RecordBatch/column(index)`. ### Are there any user-facing changes? No. ### Future Directions 1. In a followup PR, we will add support for indexing columns by names (similar to `Schema.field(fieldName)` - #37013). 2. We also plan on adding an index validation function within the `arrow.internal.validate` package. 3. We also plan on adding a convenience constructor `arrow.recordbatch()` and changing the constructor of `arrow.tabular.RecordBatch` to expect a `libmexclass.proxy.Proxy` object. * Closes: #37012 Authored-by: Sarah Gilmore <sgilmore@mathworks.com> Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
…acking arrow::Arrays alive (apache#36615) ### Rationale for this change When building `arrow.array.<Numeric>Arrays` from native MATLAB arrays, we avoid copying by 1. Wrapping the MATLAB data inside a non-owning `arrow::Buffer` 2. Constructing an `arrow::ArrayData` object from the `arrow::Buffer` 3. Constructing the `arrow::Array` from the `arrow::Data` object. Because the `Array`'s underlying `Buffer` does not have ownership of its backing data, we have been storing the original MATLAB array as a property called `MatlabArray` on the MATLAB `arrow.array.NumericArray` class. This solution is not ideal because the backing MATLAB array is kept separate from the actual `std::shared_ptr<arrow::Array>`, which is stored within the C++ proxy objects (e.g. `arrow::matlab::array::proxy::NumericArray<float64>`). A better solution would be to create a new subclass of `arrow::Buffer` called `MatlabBuffer`, which will keep the original MATLAB array alive by taking it in as an input parameter and storing it as a member variable. ### What changes are included in this PR? 1. Removed the `MatlabArray` property from the MATLAB class `matlab.io.NumericArray` 2. Added a private member variable called `mda_array` to `arrow::matlab::array::proxy::NumericArray<CType>` and `arrow::matlab::array::proxy::TimestampArray`. `mda_array` is a `matlab::data::Array` object. This member variable stores the MATLAB array that owns the memory the `arrow::Array` objects are backed by. ### Are these changes tested? Existing tests used. ### Are there any user-facing changes? No. * Closes: apache#36614 Authored-by: Sarah Gilmore <sgilmore@mathworks.com> Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
…om `arrow.tabular.RecordBatch` (apache#37015) ### Rationale for this change In our initial implementation of the MATLAB class `arrow.tabular.RecordBatch`, we included a property property called `ArrowArrays`, which is a `cell` array whose elements are scalar `arrow.array.Array` objects. These arrays correspond to the columns in the `arrow::RecordBatch` that the MATLAB class wraps. The purpose of `ArrowArrays` was to enable zero-copy construction of the `arrow::Array` objects backing the `arrow::RecordBatch` from MATLAB arrays. The `ArrowArrays` property was necessary to ensure the MATLAB data from which the `arrow::Array` columns were constructed don't get deallocated before they are done being used. However, we no longer need the `ArrowArrays` property on `arrow.tabular.RecordBatch` because of apache#36615, in which we implemented `arrow::matlab::buffer::MatlabBuffer`. This class inherits from `arrow::Buffer` and stores a reference to the MATLAB data it wraps, ensuring that the wrapped MATLAB data is kept alive as long as the buffer is around. We now only create `arrow::Array` objects from `arrow::matlab::buffer::MatlabBuffer` objects - instead of `arrow::Buffer` objects. As a result, the `ArrowArrays` property is no longer necessary on `arrow.tabular.RecordBatch` because the `arrow::Array` columns within the `arrow::RecordBatch` are all backed by `arrow::matlab::buffer::MatlabBuffer` objects. The backing MATLAB data is kept alive as long as the arrays and their buffers are kept alive. ### What changes are included in this PR? 1. Removed the `ArrowArrays` property from `arrow.tabular.RecordBatch`. 2. Because `ArrowArrays` is no longer a property, we had to add a new method to `arrow::matlab::tabular::proxy::RecordBatch` called `getColumnByIndex()`. This method creates a proxy object around the specified `arrrow::Array`. The MATLAB method `RecordBatch/column(index)` uses `getColumnByIndex()`. 3. Added a new function called `wrap`, which accepts an `arrow::Array` and returns an `arrow::matlab::array::proxy::Array`. While working on `getColumnByIndex()`, we realized there would be multiple places in the interface where we will need to create proxies around `arrow::Array` objects. We wrote the `wrap` utility function to reduce duplicated code in the future. Currently, only `getColumnByIndex()` utilizes `wrap`. ### Are these changes tested? Yes. 1. The existing unit tests in `tRecordBatch.m` cover these changes. 2. Added a few more test cases to `tRecordBatch.m` to cover error conditions when a bad `index` value is provided to `arrow.tabular.RecordBatch/column(index)`. ### Are there any user-facing changes? No. ### Future Directions 1. In a followup PR, we will add support for indexing columns by names (similar to `Schema.field(fieldName)` - apache#37013). 2. We also plan on adding an index validation function within the `arrow.internal.validate` package. 3. We also plan on adding a convenience constructor `arrow.recordbatch()` and changing the constructor of `arrow.tabular.RecordBatch` to expect a `libmexclass.proxy.Proxy` object. * Closes: apache#37012 Authored-by: Sarah Gilmore <sgilmore@mathworks.com> Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
Rationale for this change
When building
arrow.array.<Numeric>Arrays
from native MATLAB arrays, we avoid copying byarrow::Buffer
arrow::ArrayData
object from thearrow::Buffer
arrow::Array
from thearrow::Data
object.Because the
Array
's underlyingBuffer
does not have ownership of its backing data, we have been storing the original MATLAB array as a property calledMatlabArray
on the MATLABarrow.array.NumericArray
class.This solution is not ideal because the backing MATLAB array is kept separate from the actual
std::shared_ptr<arrow::Array>
, which is stored within the C++ proxy objects (e.g.arrow::matlab::array::proxy::NumericArray<float64>
). A better solution would be to create a new subclass ofarrow::Buffer
calledMatlabBuffer
, which will keep the original MATLAB array alive by taking it in as an input parameter and storing it as a member variable.What changes are included in this PR?
MatlabArray
property from the MATLAB classmatlab.io.NumericArray
mda_array
toarrow::matlab::array::proxy::NumericArray<CType>
andarrow::matlab::array::proxy::TimestampArray
.mda_array
is amatlab::data::Array
object. This member variable stores the MATLAB array that owns the memory thearrow::Array
objects are backed by.Are these changes tested?
Existing tests used.
Are there any user-facing changes?
No.
arrow::Buffer
to keep MATLAB data backingarrow::Arrays
alive #36614