Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37041: [MATLAB] Implement Feather V1 Reader using new MATLAB Interface APIs #37044

Merged
merged 11 commits into from
Aug 7, 2023

Conversation

kevingurney
Copy link
Member

@kevingurney kevingurney commented Aug 7, 2023

Rationale for this change

Now that we've have the basic building blocks for tabular IO in the MATLAB Interface (Array, Schema, RecordBatch), we can implement a Feather V1 reader in terms of the new APIs.

This is a follow up to #37043, where a new Feather V1 internal Writer object was added.

What changes are included in this PR?

  1. Added a new class called arrow.internal.io.feather.Reader which can be used to read Feather V1 files. It has one public property named Filename and one public method named read.

Example Usage:

>> T = array2table(rand(3))       

T =

  3x3 table

     Var1        Var2       Var3  
    _______    ________    _______

    0.79221    0.035712    0.67874
    0.95949     0.84913    0.75774
    0.65574     0.93399    0.74313

>> filename = "test.feather";

>> featherwrite(filename, T)

>> reader = arrow.internal.io.feather.Reader(filename)

reader = 

  Reader with properties:

    Filename: "test.feather"

>> T = reader.read()

T =

  3x3 table

     Var1        Var2       Var3  
    _______    ________    _______

    0.79221    0.035712    0.67874
    0.95949     0.84913    0.75774
    0.65574     0.93399    0.74313

Are these changes tested?

Yes.

  1. Added Reader to feather/tRoundTrip.m.

Are there any user-facing changes?

No.

These are only internal objects right now.

Future Directions

  1. Re-implement featherread in terms of the new Reader object.
  2. Remove legacy feather code and infrastructure.

Notes

  1. For conciseness, I renamed the C++ Proxy class FeatherWriter to Writer since it is already inside of a feather namespace / "package".

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 7, 2023
@kevingurney
Copy link
Member Author

Linting failures appear unrelated to these changes.

@kevingurney
Copy link
Member Author

+1

@kevingurney kevingurney merged commit 152be67 into apache:main Aug 7, 2023
8 of 9 checks passed
@kevingurney kevingurney removed the awaiting changes Awaiting changes label Aug 7, 2023
@kevingurney kevingurney deleted the GH-37041 branch August 7, 2023 20:27
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 152be67.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

kevingurney added a commit that referenced this pull request Aug 14, 2023
…l.io.feather.Reader` (#37163)

### Rationale for this change

Now that #37044 is merged, we can re-implement `featherread` in terms of the new `arrow.internal.io.feather.Reader` class.

Once this change is made, we can delete the legacy build infrastructure and `featherread` MEX code.

### What changes are included in this PR?

1. Reimplemented `featherread` in terms of the new `arrow.internal.io.feather.Reader` class.
2. We tried to maintain compatibility with the old code as much as possible, but since `featherread` is now implemented in terms of `RecordBatch`, there are some minor changes in behavior and support for some new data types (e.g. `Boolean`, `String`, `Timestamp`) that are introduced by these changes.
3. Updated `arrow/matlab/io/feather/proxy/reader.cc` to prevent a `nullptr` dereference that was occurring when reading a Feather V1 file created from an empty table by using `Table::CombineChunksToBatch` rather than a `TableBatchReader`.

**Example**
```matlab
>> tWrite = table(["A"; "B"; "C"], [true; false; true], [1; 2; 3], VariableNames=["String", "Boolean", "Float64"])

tWrite =

  3x3 table

    String    Boolean    Float64
    ______    _______    _______

     "A"       true         1   
     "B"       false        2   
     "C"       true         3   

>> featherwrite("test.feather", tWrite)

>> tRead = featherread("test.feather")

tRead =

  3x3 table

    String    Boolean    Float64
    ______    _______    _______

     "A"       true         1   
     "B"       false        2   
     "C"       true         3   

>> isequaln(tWrite, tRead)

ans =

  logical

   1
```

### Are these changes tested?

Yes.

1. Updated the existing `tfeather.m` and `tfeathermex.m` tests to reflect the new behavior of `featherread`. This mainly consists of error message ID changes.
2. Added a new test to verify that all MATLAB types supported by `arrow.tabular.RecordBatch` can be round-tripped to a Feather V1 file.
4. Added a new test to verify that a MATLAB `table` with Unicode `Variablenames` can be round-tripped to a Feather V1 file.  

### Are there any user-facing changes?

Yes.

1. Now that `featherread` is implemented in terms of `arrow.internal.io.feather.Reader` and `arrow.tabular.RecordBatch`, it supports reading more types like `Boolean`, `String`, `Timestamp`, etc. **Note**: We updated the code to cast `logical`/`Boolean` type columns containing null values to `double` and substitute null values with `NaN`. This mirrors the existing behavior of `featherread` for integer type columns containing null values. 
2. There are some minor error message ID changes.
4. Cell arrays of strings with a single element (e.g. `{'filename.feather'}`) are now supported as a valid `filename` for `featherread`.

### Future Directions

1. In the future, we may want to consider no longer casting columns with integer/logical type containing null values to `double` and substituting null values with `NaN`. This behavior isn't ideal in all cases (it can be lossy for types like `uint64`). This change would break compatibility.
2. Delete legacy Feather V1 code and build infrastructure.

### Notes

1. Thank you @ sgilmore10 for your help with this pull request!

* Closes: #37046

Authored-by: Kevin Gurney <kgurney@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
… Interface APIs (apache#37044)

### Rationale for this change

Now that we've have the basic building blocks for tabular IO in the MATLAB Interface (Array, Schema, RecordBatch), we can implement a Feather V1 reader in terms of the new APIs.

This is a follow up to apache#37043, where a new Feather V1 internal `Writer` object was added.

### What changes are included in this PR?

1. Added a new class called arrow.internal.io.feather.Reader which can be used to read Feather V1 files. It has one public property named `Filename` and one public method named `read`.

**Example Usage:**

```matlab
>> T = array2table(rand(3))       

T =

  3x3 table

     Var1        Var2       Var3  
    _______    ________    _______

    0.79221    0.035712    0.67874
    0.95949     0.84913    0.75774
    0.65574     0.93399    0.74313

>> filename = "test.feather";

>> featherwrite(filename, T)

>> reader = arrow.internal.io.feather.Reader(filename)

reader = 

  Reader with properties:

    Filename: "test.feather"

>> T = reader.read()

T =

  3x3 table

     Var1        Var2       Var3  
    _______    ________    _______

    0.79221    0.035712    0.67874
    0.95949     0.84913    0.75774
    0.65574     0.93399    0.74313
```

### Are these changes tested?

Yes.

1. Added `Reader` to `feather/tRoundTrip.m`.

### Are there any user-facing changes?

No.

These are only internal objects right now. 

### Future Directions

1. Re-implement `featherread` in terms of the new `Reader` object.
2. Remove legacy feather code and infrastructure.

### Notes

1. For conciseness, I renamed the C++ Proxy class `FeatherWriter` to `Writer` since it is already inside of a `feather` namespace / "package".
* Closes: apache#37041

Authored-by: Kevin Gurney <kgurney@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…nternal.io.feather.Reader` (apache#37163)

### Rationale for this change

Now that apache#37044 is merged, we can re-implement `featherread` in terms of the new `arrow.internal.io.feather.Reader` class.

Once this change is made, we can delete the legacy build infrastructure and `featherread` MEX code.

### What changes are included in this PR?

1. Reimplemented `featherread` in terms of the new `arrow.internal.io.feather.Reader` class.
2. We tried to maintain compatibility with the old code as much as possible, but since `featherread` is now implemented in terms of `RecordBatch`, there are some minor changes in behavior and support for some new data types (e.g. `Boolean`, `String`, `Timestamp`) that are introduced by these changes.
3. Updated `arrow/matlab/io/feather/proxy/reader.cc` to prevent a `nullptr` dereference that was occurring when reading a Feather V1 file created from an empty table by using `Table::CombineChunksToBatch` rather than a `TableBatchReader`.

**Example**
```matlab
>> tWrite = table(["A"; "B"; "C"], [true; false; true], [1; 2; 3], VariableNames=["String", "Boolean", "Float64"])

tWrite =

  3x3 table

    String    Boolean    Float64
    ______    _______    _______

     "A"       true         1   
     "B"       false        2   
     "C"       true         3   

>> featherwrite("test.feather", tWrite)

>> tRead = featherread("test.feather")

tRead =

  3x3 table

    String    Boolean    Float64
    ______    _______    _______

     "A"       true         1   
     "B"       false        2   
     "C"       true         3   

>> isequaln(tWrite, tRead)

ans =

  logical

   1
```

### Are these changes tested?

Yes.

1. Updated the existing `tfeather.m` and `tfeathermex.m` tests to reflect the new behavior of `featherread`. This mainly consists of error message ID changes.
2. Added a new test to verify that all MATLAB types supported by `arrow.tabular.RecordBatch` can be round-tripped to a Feather V1 file.
4. Added a new test to verify that a MATLAB `table` with Unicode `Variablenames` can be round-tripped to a Feather V1 file.  

### Are there any user-facing changes?

Yes.

1. Now that `featherread` is implemented in terms of `arrow.internal.io.feather.Reader` and `arrow.tabular.RecordBatch`, it supports reading more types like `Boolean`, `String`, `Timestamp`, etc. **Note**: We updated the code to cast `logical`/`Boolean` type columns containing null values to `double` and substitute null values with `NaN`. This mirrors the existing behavior of `featherread` for integer type columns containing null values. 
2. There are some minor error message ID changes.
4. Cell arrays of strings with a single element (e.g. `{'filename.feather'}`) are now supported as a valid `filename` for `featherread`.

### Future Directions

1. In the future, we may want to consider no longer casting columns with integer/logical type containing null values to `double` and substituting null values with `NaN`. This behavior isn't ideal in all cases (it can be lossy for types like `uint64`). This change would break compatibility.
2. Delete legacy Feather V1 code and build infrastructure.

### Notes

1. Thank you @ sgilmore10 for your help with this pull request!

* Closes: apache#37046

Authored-by: Kevin Gurney <kgurney@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MATLAB] Implement Feather V1 Reader using new MATLAB Interface APIs
2 participants