New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MATLAB] Add MATLAB arrow.tabular.RecordBatch
class
#36072
Comments
take |
kevingurney
added a commit
to mathworks/arrow
that referenced
this issue
Jun 20, 2023
kou
added a commit
that referenced
this issue
Jun 23, 2023
### Rationale for this change Now that the MATLAB interface supports some basic `arrow.array.Array` types, it would be helpful to start building out the tabular types (e.g. `RecordBatch` and `Table`) in parallel. This pull request contains a basic implementation of `arrow.tabular.RecordBatch` (name subject to change). ### What changes are included in this PR? 1. Added new `arrow.tabular.RecordBatch` class that can be constructed from a MATLAB `table`. 2. Added new test class `tRecordBatch`. ### Are these changes tested? Yes. 1. Added new test class `tRecordBatch` containing basic tests for the `arrow.tabular.RecordBatch` class. ### Are there any user-facing changes? Yes. 1. Added new class `arrow.tabular.RecordBatch`. **Example**: ```matlab >> matlabTable = table(uint64([1,2,3]'), [true false true]', [0.1, 0.2, 0.3]', VariableNames=["UInt64", "Boolean", "Float64"]) matlabTable = 3x3 table UInt64 Boolean Float64 ______ _______ _______ 1 true 0.1 2 false 0.2 3 true 0.3 >> arrowRecordBatch = arrow.tabular.RecordBatch(matlabTable) arrowRecordBatch = UInt64: [ 1, 2, 3 ] Boolean: [ true, false, true ] Float64: [ 0.1, 0.2, 0.3 ] >> convertedMatlabTable = table(arrowRecordBatch) convertedMatlabTable = 3x3 table UInt64 Boolean Float64 ______ _______ _______ 1 true 0.1 2 false 0.2 3 true 0.3 >> isequal(matlabTable, convertedMatlabTable) ans = logical 1 ``` 2. Added properties `NumColumns` and `ColumnNames` to `arrow.tabular.RecordBatch`: **Example**: ```matlab >> arrowRecordBatch.NumColumns ans = int32 3 >> arrowRecordBatch.ColumnNames ans = 1x3 string array "UInt64" "Boolean" "Float64" ``` 3. Added `column(i)` method to `arrow.tabular.RecordBatch` to retrieve the `i`th column of a `RecordBatch` as an `arrow.array.Array`. **Example**: ```matlab >> arrowUInt64Array = arrowRecordBatch.column(1) arrowUInt64Array = [ 1, 2, 3 ] >> class(arrowUInt64Array) ans = 'arrow.array.UInt64Array' >> arrowBooleanArray = arrowRecordBatch.column(2) arrowBooleanArray = [ true, false, true ] >> class(arrowBooleanArray) ans = 'arrow.array.UInt64Array' >> arrowFloat64Array = arrowRecordBatch.column(3) arrowFloat64Array = [ 0.1, 0.2, 0.3 ] >> class(arrowFloat64Array) ans = 'arrow.array.Float64Array' ``` 4. Added `toMATLAB` and `table` conversion methods to convert from a `RecordBatch` to a MATLAB `table`. ### Future Directions 1. Implement C++ logic for `toMATLAB` when the Arrow memory for a `RecordBatch` did originate from a MATLAB array (e.g. read from a Parquet file or somewhere else). 2. Add more supported construction interfaces (e.g. `arrow.tabular.RecordBatch(array1, ..., arrayN)`, arrow.tabular.RecordBatch.fromArrays(arrays)`, etc.). 3. Create an `arrow.tabular.Schema` class. Expose this as a public property on the `RecordBatch` class. Create related `arrow.type.Field` and `arrow.type.Type` classes. 4. Create an `arrow.tabular.Table` and related `arrow.array.ChunkedArray` class. 5. Add more `arrow.array.Array` types (e.g. `StringArray`, `TimestampArray`, `Time64Array`). 6. Create a basic workflow example of serializing a `RecordBatch` to disk using an I/O function (e.g. Parquet writing). ### Notes 1. Thanks @ sgilmore10 for your help with this pull request! 2. While writing the tests for `RecordBatch`, we stumbled upon a set of [accidentally committed diff markers] in `UInt64Array.m` or `tUInt64Array.m`. We removed these diff markers in this PR to unblock the `RecordBatch` tests. The unfortunate thing is that this wasn't caught before because MATLAB was simply ignoring the test file `tUInt64Array.m` because it had a syntax error in it. We could choose to explicitly list out all test files in the MATLAB CI workflows to try and avoid similar situations in the future, but this might get unwieldy to maintain over time as we add more tests. We are happy to hear any suggestions from other community members related to this topic. * Closes: #36072 Lead-authored-by: Kevin Gurney <kgurney@mathworks.com> Co-authored-by: Kevin Gurney <kevin.p.gurney@gmail.com> Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the enhancement requested
Now that the MATLAB interface supports some basic
arrow.array.Array
types, it would be helpful to start building out the tabular types (i.e.RecordBatch
andTable
) in parallel.To start, we could create a basic implementation for
arrow.tabular.RecordBatch
(name subject to change).After that, we could add
arrow.tabular.Table
, which users could construct from multipleRecordBatch
objects.Component(s)
MATLAB
The text was updated successfully, but these errors were encountered: