[MATLAB] Add support for indexing `RecordBatch` columns by `Field` name #37473

kevingurney · 2023-08-30T17:52:15Z

Describe the enhancement requested

Currently, arrow.tabular.Schema supports indexing by Field name. However, arrow.tabular.RecordBatch does not.

The ability to index columns in a RecordBatch by Field name would be a helpful usability improvement.

Example

>> t = array2table(rand(3))

t =

  3×3 table

     Var1       Var2       Var3  
    _______    _______    _______

    0.96489    0.95717    0.14189
    0.15761    0.48538    0.42176
    0.97059    0.80028    0.91574

>> rb = arrow.recordBatch(t)

rb = 

Var1:   [
    0.9648885351992765,
    0.15761308167754828,
    0.9705927817606157
  ]
Var2:   [
    0.9571669482429456,
    0.4853756487228412,
    0.8002804688888001
  ]
Var3:   [
    0.14188633862721534,
    0.421761282626275,
    0.9157355251890671
  ]
 
>> rb.column("Var1")

ans = 

[
  0.9571669482429456,
  0.4853756487228412,
  0.8002804688888001
]

Component(s)

MATLAB

The text was updated successfully, but these errors were encountered:

…`Field` name (#37475) ### Rationale for this change Currently, `arrow.tabular.Schema` supports indexing by `Field` name. However, `arrow.tabular.RecordBatch` does not. This pull request adds the ability to index columns in a `RecordBatch` by `Field` name. ### What changes are included in this PR? 1. Added support for indexing columns in a `RecordBatch` by `Field` name via the `column` method. **Example** ```matlab >> recordBatch = arrow.tabular.RecordBatch.fromArrays(... arrow.array([1, 2, 3]), ... arrow.array(["A", "B", "C"]), ... arrow.array([true, false, true]), ... ColumnNames=["A", "B", "C"] ... ) recordBatch = A: [ 1, 2, 3 ] B: [ "A", "B", "C" ] C: [ true, false, true ] >> recordBatch.column("B") ans = [ "A", "B", "C" ] >> recordBatch.column("C") ans = [ true, false, true ] ``` 2. Removed comments about vectorizing `field` method of `Schema` and `column` method of `RecordBatch`. After further consideration, we believe it would make more sense to only allow these methods to accept scalar inputs. We could revisit support for vectorization if we overload the parenthesis operator (e.g. `recordBatch(rows, columns)`) in the future to return another `RecordBatch`/`Schema` that only includes the selected columns/fields. 3. Fixed typo in `tSchema.m`. ### Are these changes tested? Yes. 1. Added tests for indexing by column name using the `column` method to `tRecordBatch.m`. ### Are there any user-facing changes? Yes. 1. Users can now index `RecordBatch` columns by name using the syntax `column(name)`. ### Future Directions 1. Consider overloading parentheses-based indexing on `RecordBatch` and `Schema`. * Closes: #37473 Lead-authored-by: Kevin Gurney <kgurney@mathworks.com> Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com> Signed-off-by: Kevin Gurney <kgurney@mathworks.com>

…ns by `Field` name (apache#37475) ### Rationale for this change Currently, `arrow.tabular.Schema` supports indexing by `Field` name. However, `arrow.tabular.RecordBatch` does not. This pull request adds the ability to index columns in a `RecordBatch` by `Field` name. ### What changes are included in this PR? 1. Added support for indexing columns in a `RecordBatch` by `Field` name via the `column` method. **Example** ```matlab >> recordBatch = arrow.tabular.RecordBatch.fromArrays(... arrow.array([1, 2, 3]), ... arrow.array(["A", "B", "C"]), ... arrow.array([true, false, true]), ... ColumnNames=["A", "B", "C"] ... ) recordBatch = A: [ 1, 2, 3 ] B: [ "A", "B", "C" ] C: [ true, false, true ] >> recordBatch.column("B") ans = [ "A", "B", "C" ] >> recordBatch.column("C") ans = [ true, false, true ] ``` 2. Removed comments about vectorizing `field` method of `Schema` and `column` method of `RecordBatch`. After further consideration, we believe it would make more sense to only allow these methods to accept scalar inputs. We could revisit support for vectorization if we overload the parenthesis operator (e.g. `recordBatch(rows, columns)`) in the future to return another `RecordBatch`/`Schema` that only includes the selected columns/fields. 3. Fixed typo in `tSchema.m`. ### Are these changes tested? Yes. 1. Added tests for indexing by column name using the `column` method to `tRecordBatch.m`. ### Are there any user-facing changes? Yes. 1. Users can now index `RecordBatch` columns by name using the syntax `column(name)`. ### Future Directions 1. Consider overloading parentheses-based indexing on `RecordBatch` and `Schema`. * Closes: apache#37473 Lead-authored-by: Kevin Gurney <kgurney@mathworks.com> Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com> Signed-off-by: Kevin Gurney <kgurney@mathworks.com>

kevingurney added the Type: enhancement label Aug 30, 2023

kevingurney self-assigned this Aug 30, 2023

github-actions bot added the Component: MATLAB label Aug 30, 2023

github-actions bot mentioned this issue Aug 30, 2023

GH-37473: [MATLAB] Add support for indexing RecordBatch columns by Field name #37475

Merged

kevingurney closed this as completed in #37475 Aug 30, 2023

kevingurney added this to the 14.0.0 milestone Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MATLAB] Add support for indexing `RecordBatch` columns by `Field` name #37473

[MATLAB] Add support for indexing `RecordBatch` columns by `Field` name #37473

kevingurney commented Aug 30, 2023

[MATLAB] Add support for indexing RecordBatch columns by Field name #37473

[MATLAB] Add support for indexing RecordBatch columns by Field name #37473

Comments

kevingurney commented Aug 30, 2023

Describe the enhancement requested

Component(s)

[MATLAB] Add support for indexing `RecordBatch` columns by `Field` name #37473

[MATLAB] Add support for indexing `RecordBatch` columns by `Field` name #37473