Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MATLAB] Add support for indexing RecordBatch columns by Field name #37473

Closed
kevingurney opened this issue Aug 30, 2023 · 0 comments · Fixed by #37475
Closed

[MATLAB] Add support for indexing RecordBatch columns by Field name #37473

kevingurney opened this issue Aug 30, 2023 · 0 comments · Fixed by #37475

Comments

@kevingurney
Copy link
Member

Describe the enhancement requested

Currently, arrow.tabular.Schema supports indexing by Field name. However, arrow.tabular.RecordBatch does not.

The ability to index columns in a RecordBatch by Field name would be a helpful usability improvement.

Example

>> t = array2table(rand(3))

t =

  3×3 table

     Var1       Var2       Var3  
    _______    _______    _______

    0.96489    0.95717    0.14189
    0.15761    0.48538    0.42176
    0.97059    0.80028    0.91574

>> rb = arrow.recordBatch(t)

rb = 

Var1:   [
    0.9648885351992765,
    0.15761308167754828,
    0.9705927817606157
  ]
Var2:   [
    0.9571669482429456,
    0.4853756487228412,
    0.8002804688888001
  ]
Var3:   [
    0.14188633862721534,
    0.421761282626275,
    0.9157355251890671
  ]
 
>> rb.column("Var1")

ans = 

[
  0.9571669482429456,
  0.4853756487228412,
  0.8002804688888001
]

Component(s)

MATLAB

@kevingurney kevingurney self-assigned this Aug 30, 2023
kevingurney added a commit that referenced this issue Aug 30, 2023
…`Field` name (#37475)

### Rationale for this change

Currently, `arrow.tabular.Schema` supports indexing by `Field` name. However, `arrow.tabular.RecordBatch` does not.

This pull request adds the ability to index columns in a `RecordBatch` by `Field` name.

### What changes are included in this PR?

1. Added support for indexing columns in a `RecordBatch` by `Field` name via the `column` method.

**Example**
```matlab
>> recordBatch = arrow.tabular.RecordBatch.fromArrays(...
       arrow.array([1, 2, 3]), ...
       arrow.array(["A", "B", "C"]), ...
       arrow.array([true, false, true]), ...
       ColumnNames=["A", "B", "C"] ...
   )

recordBatch = 

A:   [
    1,
    2,
    3
  ]
B:   [
    "A",
    "B",
    "C"
  ]
C:   [
    true,
    false,
    true
  ]

>> recordBatch.column("B")

ans = 

[
  "A",
  "B",
  "C"
]

>> recordBatch.column("C")

ans = 

[
  true,
  false,
  true
]
``` 
2. Removed comments about vectorizing `field` method of `Schema` and `column` method of `RecordBatch`. After further consideration, we believe it would make more sense to only allow these methods to accept scalar inputs. We could revisit support for vectorization if we overload the parenthesis operator (e.g. `recordBatch(rows, columns)`)  in the future to return another `RecordBatch`/`Schema` that only includes the selected columns/fields.
3. Fixed typo in `tSchema.m`.

### Are these changes tested?

Yes.

1. Added tests for indexing by column name using the `column` method to `tRecordBatch.m`.

### Are there any user-facing changes?

Yes.

1. Users can now index `RecordBatch` columns by name using the syntax `column(name)`.

### Future Directions

1. Consider overloading parentheses-based indexing on `RecordBatch` and `Schema`. 
* Closes: #37473

Lead-authored-by: Kevin Gurney <kgurney@mathworks.com>
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
@kevingurney kevingurney added this to the 14.0.0 milestone Aug 30, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…ns by `Field` name (apache#37475)

### Rationale for this change

Currently, `arrow.tabular.Schema` supports indexing by `Field` name. However, `arrow.tabular.RecordBatch` does not.

This pull request adds the ability to index columns in a `RecordBatch` by `Field` name.

### What changes are included in this PR?

1. Added support for indexing columns in a `RecordBatch` by `Field` name via the `column` method.

**Example**
```matlab
>> recordBatch = arrow.tabular.RecordBatch.fromArrays(...
       arrow.array([1, 2, 3]), ...
       arrow.array(["A", "B", "C"]), ...
       arrow.array([true, false, true]), ...
       ColumnNames=["A", "B", "C"] ...
   )

recordBatch = 

A:   [
    1,
    2,
    3
  ]
B:   [
    "A",
    "B",
    "C"
  ]
C:   [
    true,
    false,
    true
  ]

>> recordBatch.column("B")

ans = 

[
  "A",
  "B",
  "C"
]

>> recordBatch.column("C")

ans = 

[
  true,
  false,
  true
]
``` 
2. Removed comments about vectorizing `field` method of `Schema` and `column` method of `RecordBatch`. After further consideration, we believe it would make more sense to only allow these methods to accept scalar inputs. We could revisit support for vectorization if we overload the parenthesis operator (e.g. `recordBatch(rows, columns)`)  in the future to return another `RecordBatch`/`Schema` that only includes the selected columns/fields.
3. Fixed typo in `tSchema.m`.

### Are these changes tested?

Yes.

1. Added tests for indexing by column name using the `column` method to `tRecordBatch.m`.

### Are there any user-facing changes?

Yes.

1. Users can now index `RecordBatch` columns by name using the syntax `column(name)`.

### Future Directions

1. Consider overloading parentheses-based indexing on `RecordBatch` and `Schema`. 
* Closes: apache#37473

Lead-authored-by: Kevin Gurney <kgurney@mathworks.com>
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…ns by `Field` name (apache#37475)

### Rationale for this change

Currently, `arrow.tabular.Schema` supports indexing by `Field` name. However, `arrow.tabular.RecordBatch` does not.

This pull request adds the ability to index columns in a `RecordBatch` by `Field` name.

### What changes are included in this PR?

1. Added support for indexing columns in a `RecordBatch` by `Field` name via the `column` method.

**Example**
```matlab
>> recordBatch = arrow.tabular.RecordBatch.fromArrays(...
       arrow.array([1, 2, 3]), ...
       arrow.array(["A", "B", "C"]), ...
       arrow.array([true, false, true]), ...
       ColumnNames=["A", "B", "C"] ...
   )

recordBatch = 

A:   [
    1,
    2,
    3
  ]
B:   [
    "A",
    "B",
    "C"
  ]
C:   [
    true,
    false,
    true
  ]

>> recordBatch.column("B")

ans = 

[
  "A",
  "B",
  "C"
]

>> recordBatch.column("C")

ans = 

[
  true,
  false,
  true
]
``` 
2. Removed comments about vectorizing `field` method of `Schema` and `column` method of `RecordBatch`. After further consideration, we believe it would make more sense to only allow these methods to accept scalar inputs. We could revisit support for vectorization if we overload the parenthesis operator (e.g. `recordBatch(rows, columns)`)  in the future to return another `RecordBatch`/`Schema` that only includes the selected columns/fields.
3. Fixed typo in `tSchema.m`.

### Are these changes tested?

Yes.

1. Added tests for indexing by column name using the `column` method to `tRecordBatch.m`.

### Are there any user-facing changes?

Yes.

1. Users can now index `RecordBatch` columns by name using the syntax `column(name)`.

### Future Directions

1. Consider overloading parentheses-based indexing on `RecordBatch` and `Schema`. 
* Closes: apache#37473

Lead-authored-by: Kevin Gurney <kgurney@mathworks.com>
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant