Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37042: [MATLAB] Implement Feather V1 Writer using new MATLAB Interface APIs #37043

Merged
merged 17 commits into from
Aug 7, 2023

Conversation

sgilmore10
Copy link
Member

@sgilmore10 sgilmore10 commented Aug 7, 2023

Rationale for this change

Now that we've have the basic building blocks for tabular IO in the MATLAB Interface (Array, Schema, RecordBatch), we can implement a Feather V1 writer in terms of the new APIs.

This is the first in a series of pull requests in which we will work on replacing the legacy feather V1 infrastructure with a new implementation that use the MATLAB Interface APIs. A side effect of doing this work is that we can eventually delete a lot of legacy build infrastructure and code.

What changes are included in this PR?

  1. Added a new class called arrow.internal.io.feather.Writer which can be used to write feather V1 files. It has one public property named Filename and one public method write.

Below is an example of its usage:

>> T = table([1; 2; 3], single([10; 11; 12]));

T =

  3×2 table

    Var1    Var2
    ____    ____

     1       10 
     2       11 
     3       12 

>> filename = "/tmp/table.feather";
>> writer = arrow.internal.io.feather.Writer(filename)

writer = 

  Writer with properties:

    Filename: "/tmp/table.feather"

>> writer.write(T);
  1. Added an unwrap method to proxy::RecordBatch so that the FeatherWriter::write method can access the underlying RecordBatch from the proxy.
  2. Changed the SetAccess and GetAccess of the Proxy property on arrow.tabular.RecordBatch to private and public, respectively.

Are these changes tested?

Yes, added a new test file called tRoundTrip.m in the matlab/test/arrow/io/feather folder.

Are there any user-facing changes?

No.

Future Directions

  1. Add a new class for reading feather V1 files (See [MATLAB] Implement Feather V1 Reader using new MATLAB Interface APIs #37041).
  2. Integrate this class in the public featherwrite function.
  3. Once this class is integrated with featherwrite, we can delete the legacy build infrastructure and source code.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Aug 7, 2023
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Aug 7, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting merge Awaiting merge awaiting changes Awaiting changes labels Aug 7, 2023
@sgilmore10 sgilmore10 changed the title GH-37042: Implement Feather V1 Writer using new MATLAB Interface APIs GH-37042: [MATLAB] Implement Feather V1 Writer using new MATLAB Interface APIs Aug 7, 2023
@sgilmore10
Copy link
Member Author

The Dev / Lint job failed due to the an issue in /arrow/cpp/src/arrow/acero/aggregate_internal.h. Not related to my changes.

@kevingurney
Copy link
Member

+1

@kevingurney kevingurney merged commit 71329ce into apache:main Aug 7, 2023
8 of 9 checks passed
@kevingurney kevingurney removed the awaiting change review Awaiting change review label Aug 7, 2023
kevingurney added a commit that referenced this pull request Aug 7, 2023
…face APIs (#37044)

### Rationale for this change

Now that we've have the basic building blocks for tabular IO in the MATLAB Interface (Array, Schema, RecordBatch), we can implement a Feather V1 reader in terms of the new APIs.

This is a follow up to #37043, where a new Feather V1 internal `Writer` object was added.

### What changes are included in this PR?

1. Added a new class called arrow.internal.io.feather.Reader which can be used to read Feather V1 files. It has one public property named `Filename` and one public method named `read`.

**Example Usage:**

```matlab
>> T = array2table(rand(3))       

T =

  3x3 table

     Var1        Var2       Var3  
    _______    ________    _______

    0.79221    0.035712    0.67874
    0.95949     0.84913    0.75774
    0.65574     0.93399    0.74313

>> filename = "test.feather";

>> featherwrite(filename, T)

>> reader = arrow.internal.io.feather.Reader(filename)

reader = 

  Reader with properties:

    Filename: "test.feather"

>> T = reader.read()

T =

  3x3 table

     Var1        Var2       Var3  
    _______    ________    _______

    0.79221    0.035712    0.67874
    0.95949     0.84913    0.75774
    0.65574     0.93399    0.74313
```

### Are these changes tested?

Yes.

1. Added `Reader` to `feather/tRoundTrip.m`.

### Are there any user-facing changes?

No.

These are only internal objects right now. 

### Future Directions

1. Re-implement `featherread` in terms of the new `Reader` object.
2. Remove legacy feather code and infrastructure.

### Notes

1. For conciseness, I renamed the C++ Proxy class `FeatherWriter` to `Writer` since it is already inside of a `feather` namespace / "package".
* Closes: #37041

Authored-by: Kevin Gurney <kgurney@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
kevingurney pushed a commit that referenced this pull request Aug 7, 2023
…io.feather.Writer (#37047)

### Rationale for this change

Now that #37043 is merged, we can re-implement `featherwrite` in terms of the new `arrow.internal.io.feather.Writer` class. Once this change is made, we can delete the legacy build infrastructure and featherwrite MEX code. 

### What changes are included in this PR?

1. Re-implemented `featherwrite` using `arrow.internal.io.feather.Writer`. 

### Are these changes tested?

1. Yes, the existing tests in `tfeather.m` cover these changes.
2. I had to update some of the expected error message IDs in `tfeather.m` because the new implementation throws errors with different IDs. 
3. `featherwrite` used to export the real part of MATLAB complex numeric arrays. The new version of `featherwrite` now errors if the input table contains complex data because feather/Arrow itself does not support complex numeric data. We think this is the right decision. Writing out only the real part is lossy.

### Are there any user-facing changes?

Yes, `featherwrite` no longer supports writing complex numeric arrays.

### Future Directions

1. Once this PR is merged, we will remove the legacy build infrastructure and MEX code. 
* Closes: #37045

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 71329ce.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

@sgilmore10 sgilmore10 deleted the GH-37042 branch August 21, 2023 18:12
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
… Interface APIs (apache#37043)

### Rationale for this change

Now that we've have the basic building blocks for tabular IO in the MATLAB Interface (`Array`, `Schema`, `RecordBatch`), we can implement a Feather V1 writer in terms of the new APIs.

This is the first in a series of pull requests in which we will work on replacing the legacy feather V1 infrastructure with a new implementation that use the MATLAB Interface APIs. A side effect of doing this work is that we can eventually delete a lot of legacy build infrastructure and code.

### What changes are included in this PR?

1. Added a new class called `arrow.internal.io.feather.Writer` which can be used to write feather V1 files. It has one public property named `Filename` and one public method `write`. 

Below is an example of its usage:

```matlab
>> T = table([1; 2; 3], single([10; 11; 12]));

T =

  3×2 table

    Var1    Var2
    ____    ____

     1       10 
     2       11 
     3       12 

>> filename = "/tmp/table.feather";
>> writer = arrow.internal.io.feather.Writer(filename)

writer = 

  Writer with properties:

    Filename: "/tmp/table.feather"

>> writer.write(T);

```

2. Added an `unwrap` method to `proxy::RecordBatch` so that the `FeatherWriter::write` method can access the underlying `RecordBatch` from the proxy.
3.  Changed the `SetAccess` and `GetAccess` of the `Proxy` property on `arrow.tabular.RecordBatch` to `private` and `public`, respectively. 

### Are these changes tested?

Yes, added a new test file called `tRoundTrip.m` in the `matlab/test/arrow/io/feather` folder. 

### Are there any user-facing changes?

No. 

### Future Directions

1. Add a new class for reading feather V1 files (See apache#37041).
2. Integrate this class in the public `featherwrite` function. 
5. Once this class is integrated with `featherwrite`, we can delete the legacy build infrastructure and source code.
* Closes: apache#37042 

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
… Interface APIs (apache#37044)

### Rationale for this change

Now that we've have the basic building blocks for tabular IO in the MATLAB Interface (Array, Schema, RecordBatch), we can implement a Feather V1 reader in terms of the new APIs.

This is a follow up to apache#37043, where a new Feather V1 internal `Writer` object was added.

### What changes are included in this PR?

1. Added a new class called arrow.internal.io.feather.Reader which can be used to read Feather V1 files. It has one public property named `Filename` and one public method named `read`.

**Example Usage:**

```matlab
>> T = array2table(rand(3))       

T =

  3x3 table

     Var1        Var2       Var3  
    _______    ________    _______

    0.79221    0.035712    0.67874
    0.95949     0.84913    0.75774
    0.65574     0.93399    0.74313

>> filename = "test.feather";

>> featherwrite(filename, T)

>> reader = arrow.internal.io.feather.Reader(filename)

reader = 

  Reader with properties:

    Filename: "test.feather"

>> T = reader.read()

T =

  3x3 table

     Var1        Var2       Var3  
    _______    ________    _______

    0.79221    0.035712    0.67874
    0.95949     0.84913    0.75774
    0.65574     0.93399    0.74313
```

### Are these changes tested?

Yes.

1. Added `Reader` to `feather/tRoundTrip.m`.

### Are there any user-facing changes?

No.

These are only internal objects right now. 

### Future Directions

1. Re-implement `featherread` in terms of the new `Reader` object.
2. Remove legacy feather code and infrastructure.

### Notes

1. For conciseness, I renamed the C++ Proxy class `FeatherWriter` to `Writer` since it is already inside of a `feather` namespace / "package".
* Closes: apache#37041

Authored-by: Kevin Gurney <kgurney@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…ernal.io.feather.Writer (apache#37047)

### Rationale for this change

Now that apache#37043 is merged, we can re-implement `featherwrite` in terms of the new `arrow.internal.io.feather.Writer` class. Once this change is made, we can delete the legacy build infrastructure and featherwrite MEX code. 

### What changes are included in this PR?

1. Re-implemented `featherwrite` using `arrow.internal.io.feather.Writer`. 

### Are these changes tested?

1. Yes, the existing tests in `tfeather.m` cover these changes.
2. I had to update some of the expected error message IDs in `tfeather.m` because the new implementation throws errors with different IDs. 
3. `featherwrite` used to export the real part of MATLAB complex numeric arrays. The new version of `featherwrite` now errors if the input table contains complex data because feather/Arrow itself does not support complex numeric data. We think this is the right decision. Writing out only the real part is lossy.

### Are there any user-facing changes?

Yes, `featherwrite` no longer supports writing complex numeric arrays.

### Future Directions

1. Once this PR is merged, we will remove the legacy build infrastructure and MEX code. 
* Closes: apache#37045

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MATLAB] Implement Feather V1 Writer using new MATLAB Interface APIs
2 participants