Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37012: [MATLAB] Remove the private property ArrowArrays from arrow.tabular.RecordBatch #37012

Closed
sgilmore10 opened this issue Aug 3, 2023 · 1 comment · Fixed by #37015

Comments

@sgilmore10
Copy link
Member

sgilmore10 commented Aug 3, 2023

Describe the enhancement requested

In our initial implementation of the MATLAB class arrow.tabular.RecordBatch, we included a property property called ArrowArrays, which is a cell array whose elements are scalar arrow.array.Array objects. These arrays are the columns in the arrow::RecordBatch that the MATLAB class wraps. The purpose of ArrowArrays was to enable zero-copy construction of the arrow::Array objects backing the arrow::RecordBatch. The ArrowArrays property was necessary to ensure the MATLAB data from which the arrow::Array columns were constructed didn't go out of scope.

However, we no longer need the ArrowArrays property on arrow.tabular.RecordBatch because of #36615, in which we implemented arrow::matlab::buffer::MatlabBuffer. This class inherits from arrow::Buffer and stores a reference to the MATLAB data it wraps. Doing so ensures the wrapped MATLAB data is kept alive as long as the buffer is around.

We now only create arrow::Array objects from arrow::matlab::buffer::MatlabBuffer objects - instead of arrow::Buffer objects. As a result, the ArrowArrays property is no longer necessary on arrow.tabular.RecordBatch because the arrow::Array columns within the arrow::RecordBatch are all backed by arrow::matlab::buffer::MatlabBuffer objects. The backing MATLAB data is kept alive as long as the arrays and their buffers are alive.

Component(s)

MATLAB

@sgilmore10
Copy link
Member Author

take

@sgilmore10 sgilmore10 changed the title [MATLAB] Add utility to wrap existing arrow::Array objects within proxy objects. [MATLAB] Add utility to create Array proxies from existing arrow::Array objects. Aug 3, 2023
@sgilmore10 sgilmore10 changed the title [MATLAB] Add utility to create Array proxies from existing arrow::Array objects. [MATLAB] Remove the private ArrowArrays property from arrow.tabular.RecordBatch Aug 3, 2023
@sgilmore10 sgilmore10 changed the title [MATLAB] Remove the private ArrowArrays property from arrow.tabular.RecordBatch GH-37012: [MATLAB] Remove the private ArrowArrays property from arrow.tabular.RecordBatch Aug 4, 2023
@sgilmore10 sgilmore10 changed the title GH-37012: [MATLAB] Remove the private ArrowArrays property from arrow.tabular.RecordBatch GH-37012: [MATLAB] Remove the private property ArrowArrays from arrow.tabular.RecordBatch Aug 4, 2023
@kevingurney kevingurney added this to the 14.0.0 milestone Aug 4, 2023
kevingurney pushed a commit that referenced this issue Aug 4, 2023
…row.tabular.RecordBatch` (#37015)

### Rationale for this change

In our initial implementation of the MATLAB class `arrow.tabular.RecordBatch`, we included a property property called `ArrowArrays`, which is a `cell` array whose elements are scalar `arrow.array.Array` objects.  These arrays correspond to the columns in the `arrow::RecordBatch` that the MATLAB class wraps.  The purpose of `ArrowArrays` was to enable  zero-copy construction of the `arrow::Array` objects backing the `arrow::RecordBatch` from MATLAB arrays. The `ArrowArrays` property was necessary to ensure the MATLAB data from which the `arrow::Array` columns were constructed don't get deallocated before they are done being used.
 
However, we no longer need the `ArrowArrays` property on `arrow.tabular.RecordBatch` because of #36615, in which we implemented `arrow::matlab::buffer::MatlabBuffer`. This class inherits from `arrow::Buffer` and  stores a reference to the MATLAB data it wraps, ensuring that the wrapped MATLAB data is kept alive as long as the buffer is around. 

We now only create `arrow::Array` objects from `arrow::matlab::buffer::MatlabBuffer` objects - instead of `arrow::Buffer` objects. As a result, the `ArrowArrays` property is no longer necessary on `arrow.tabular.RecordBatch` because the `arrow::Array` columns within the `arrow::RecordBatch` are all backed by `arrow::matlab::buffer::MatlabBuffer` objects. The backing MATLAB data is kept alive as long as the arrays and their buffers are kept alive. 

### What changes are included in this PR?

1. Removed the `ArrowArrays` property from `arrow.tabular.RecordBatch`. 
2. Because `ArrowArrays` is no longer a property, we had to add a new method to `arrow::matlab::tabular::proxy::RecordBatch` called `getColumnByIndex()`. This method creates a proxy object around the specified `arrrow::Array`. The MATLAB method `RecordBatch/column(index)` uses `getColumnByIndex()`.
3. Added a new function called `wrap`, which accepts an `arrow::Array` and returns an `arrow::matlab::array::proxy::Array`. While working on `getColumnByIndex()`, we realized there would be multiple places in the interface where we will need to create proxies around `arrow::Array` objects. We wrote the `wrap` utility function to reduce duplicated code in the future. Currently, only `getColumnByIndex()` utilizes `wrap`.

### Are these changes tested?

Yes.

1. The existing unit tests in `tRecordBatch.m` cover these changes.
2. Added a few more test cases to `tRecordBatch.m` to cover error conditions when a bad `index` value is provided to `arrow.tabular.RecordBatch/column(index)`.

### Are there any user-facing changes?

No.

### Future Directions

1. In a followup PR, we will add support for indexing columns by names (similar to `Schema.field(fieldName)` - #37013).
2. We also plan on adding an index validation function within the `arrow.internal.validate` package.  
3. We also plan on adding a convenience constructor `arrow.recordbatch()` and changing the constructor of `arrow.tabular.RecordBatch` to expect a `libmexclass.proxy.Proxy` object. 

* Closes: #37012 

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…om `arrow.tabular.RecordBatch` (apache#37015)

### Rationale for this change

In our initial implementation of the MATLAB class `arrow.tabular.RecordBatch`, we included a property property called `ArrowArrays`, which is a `cell` array whose elements are scalar `arrow.array.Array` objects.  These arrays correspond to the columns in the `arrow::RecordBatch` that the MATLAB class wraps.  The purpose of `ArrowArrays` was to enable  zero-copy construction of the `arrow::Array` objects backing the `arrow::RecordBatch` from MATLAB arrays. The `ArrowArrays` property was necessary to ensure the MATLAB data from which the `arrow::Array` columns were constructed don't get deallocated before they are done being used.
 
However, we no longer need the `ArrowArrays` property on `arrow.tabular.RecordBatch` because of apache#36615, in which we implemented `arrow::matlab::buffer::MatlabBuffer`. This class inherits from `arrow::Buffer` and  stores a reference to the MATLAB data it wraps, ensuring that the wrapped MATLAB data is kept alive as long as the buffer is around. 

We now only create `arrow::Array` objects from `arrow::matlab::buffer::MatlabBuffer` objects - instead of `arrow::Buffer` objects. As a result, the `ArrowArrays` property is no longer necessary on `arrow.tabular.RecordBatch` because the `arrow::Array` columns within the `arrow::RecordBatch` are all backed by `arrow::matlab::buffer::MatlabBuffer` objects. The backing MATLAB data is kept alive as long as the arrays and their buffers are kept alive. 

### What changes are included in this PR?

1. Removed the `ArrowArrays` property from `arrow.tabular.RecordBatch`. 
2. Because `ArrowArrays` is no longer a property, we had to add a new method to `arrow::matlab::tabular::proxy::RecordBatch` called `getColumnByIndex()`. This method creates a proxy object around the specified `arrrow::Array`. The MATLAB method `RecordBatch/column(index)` uses `getColumnByIndex()`.
3. Added a new function called `wrap`, which accepts an `arrow::Array` and returns an `arrow::matlab::array::proxy::Array`. While working on `getColumnByIndex()`, we realized there would be multiple places in the interface where we will need to create proxies around `arrow::Array` objects. We wrote the `wrap` utility function to reduce duplicated code in the future. Currently, only `getColumnByIndex()` utilizes `wrap`.

### Are these changes tested?

Yes.

1. The existing unit tests in `tRecordBatch.m` cover these changes.
2. Added a few more test cases to `tRecordBatch.m` to cover error conditions when a bad `index` value is provided to `arrow.tabular.RecordBatch/column(index)`.

### Are there any user-facing changes?

No.

### Future Directions

1. In a followup PR, we will add support for indexing columns by names (similar to `Schema.field(fieldName)` - apache#37013).
2. We also plan on adding an index validation function within the `arrow.internal.validate` package.  
3. We also plan on adding a convenience constructor `arrow.recordbatch()` and changing the constructor of `arrow.tabular.RecordBatch` to expect a `libmexclass.proxy.Proxy` object. 

* Closes: apache#37012 

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants