Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MATLAB] Add MATLAB arrow.tabular.RecordBatch class #36072

Closed
kevingurney opened this issue Jun 14, 2023 · 1 comment · Fixed by #36190
Closed

[MATLAB] Add MATLAB arrow.tabular.RecordBatch class #36072

kevingurney opened this issue Jun 14, 2023 · 1 comment · Fixed by #36190

Comments

@kevingurney
Copy link
Member

Describe the enhancement requested

Now that the MATLAB interface supports some basic arrow.array.Array types, it would be helpful to start building out the tabular types (i.e. RecordBatch and Table) in parallel.

To start, we could create a basic implementation for arrow.tabular.RecordBatch (name subject to change).

After that, we could add arrow.tabular.Table, which users could construct from multiple RecordBatch objects.

Component(s)

MATLAB

@kevingurney
Copy link
Member Author

take

kevingurney added a commit to mathworks/arrow that referenced this issue Jun 20, 2023
kou added a commit that referenced this issue Jun 23, 2023
### Rationale for this change

Now that the MATLAB interface supports some basic `arrow.array.Array` types, it would be helpful to start building out the tabular types (e.g. `RecordBatch` and `Table`) in parallel.

This pull request contains a basic implementation of `arrow.tabular.RecordBatch` (name subject to change).

### What changes are included in this PR?

1. Added new `arrow.tabular.RecordBatch` class that can be constructed from a MATLAB `table`.
2. Added new test class `tRecordBatch`.

### Are these changes tested?

Yes.

1. Added new test class `tRecordBatch` containing basic tests for the `arrow.tabular.RecordBatch` class.

### Are there any user-facing changes?

Yes.

1. Added new class `arrow.tabular.RecordBatch`.

**Example**:

```matlab
>> matlabTable = table(uint64([1,2,3]'), [true false true]', [0.1, 0.2, 0.3]', VariableNames=["UInt64", "Boolean", "Float64"])

matlabTable =

  3x3 table

    UInt64    Boolean    Float64
    ______    _______    _______

      1        true        0.1  
      2        false       0.2  
      3        true        0.3  

>> arrowRecordBatch = arrow.tabular.RecordBatch(matlabTable)

arrowRecordBatch = 

UInt64:   [
    1,
    2,
    3
  ]
Boolean:   [
    true,
    false,
    true
  ]
Float64:   [
    0.1,
    0.2,
    0.3
  ]

>> convertedMatlabTable = table(arrowRecordBatch)    

convertedMatlabTable =

  3x3 table

    UInt64    Boolean    Float64
    ______    _______    _______

      1        true        0.1  
      2        false       0.2  
      3        true        0.3  

>> isequal(matlabTable, convertedMatlabTable)

ans =

  logical

   1
```

2. Added properties `NumColumns` and `ColumnNames` to `arrow.tabular.RecordBatch`:

**Example**:

```matlab
>> arrowRecordBatch.NumColumns 

ans =

  int32

   3

>> arrowRecordBatch.ColumnNames

ans = 

  1x3 string array

    "UInt64"    "Boolean"    "Float64"
```

3. Added `column(i)` method to `arrow.tabular.RecordBatch` to retrieve the `i`th column of a `RecordBatch` as an `arrow.array.Array`.

**Example**:

```matlab
>> arrowUInt64Array = arrowRecordBatch.column(1) 

arrowUInt64Array = 

[
  1,
  2,
  3
]
>> class(arrowUInt64Array)

ans =

    'arrow.array.UInt64Array'

>> arrowBooleanArray = arrowRecordBatch.column(2)

arrowBooleanArray = 

[
  true,
  false,
  true
]

>> class(arrowBooleanArray)

ans =

    'arrow.array.UInt64Array'

>> arrowFloat64Array = arrowRecordBatch.column(3)

arrowFloat64Array = 

[
  0.1,
  0.2,
  0.3
]

>> class(arrowFloat64Array)

ans =

    'arrow.array.Float64Array'
```

4. Added `toMATLAB` and `table` conversion methods to convert from a `RecordBatch` to a MATLAB `table`.

### Future Directions

1. Implement C++ logic for `toMATLAB` when the Arrow memory for a `RecordBatch` did originate from a MATLAB array (e.g. read from a Parquet file or somewhere else).
2. Add more supported construction interfaces (e.g. `arrow.tabular.RecordBatch(array1, ..., arrayN)`, arrow.tabular.RecordBatch.fromArrays(arrays)`, etc.).
3. Create an `arrow.tabular.Schema` class. Expose this as a public property on the `RecordBatch` class. Create related `arrow.type.Field` and `arrow.type.Type` classes.
4. Create an `arrow.tabular.Table` and related `arrow.array.ChunkedArray` class.
5. Add more `arrow.array.Array` types (e.g. `StringArray`, `TimestampArray`, `Time64Array`).
6. Create a basic workflow example of serializing a `RecordBatch` to disk using an I/O function (e.g. Parquet writing).

### Notes

1. Thanks @ sgilmore10 for your help with this pull request!
2. While writing the tests for `RecordBatch`, we stumbled upon a set of [accidentally committed diff markers] in `UInt64Array.m` or `tUInt64Array.m`. We removed these diff markers in this PR to unblock the `RecordBatch` tests. The unfortunate thing is that this wasn't caught before because MATLAB was simply ignoring the test file `tUInt64Array.m` because it had a syntax error in it. We could choose to explicitly list out all test files in the MATLAB CI workflows to try and avoid similar situations in the future, but this might get unwieldy to maintain over time as we add more tests. We are happy to hear any suggestions from other community members related to this topic.
* Closes: #36072

Lead-authored-by: Kevin Gurney <kgurney@mathworks.com>
Co-authored-by: Kevin Gurney <kevin.p.gurney@gmail.com>
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@kou kou added this to the 13.0.0 milestone Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants