Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37046: [MATLAB] Implement featherread in terms of arrow.internal.io.feather.Reader #37163

Merged
merged 7 commits into from
Aug 14, 2023

Conversation

kevingurney
Copy link
Member

@kevingurney kevingurney commented Aug 14, 2023

Rationale for this change

Now that #37044 is merged, we can re-implement featherread in terms of the new arrow.internal.io.feather.Reader class.

Once this change is made, we can delete the legacy build infrastructure and featherread MEX code.

What changes are included in this PR?

  1. Reimplemented featherread in terms of the new arrow.internal.io.feather.Reader class.
  2. We tried to maintain compatibility with the old code as much as possible, but since featherread is now implemented in terms of RecordBatch, there are some minor changes in behavior and support for some new data types (e.g. Boolean, String, Timestamp) that are introduced by these changes.
  3. Updated arrow/matlab/io/feather/proxy/reader.cc to prevent a nullptr dereference that was occurring when reading a Feather V1 file created from an empty table by using Table::CombineChunksToBatch rather than a TableBatchReader.

Example

>> tWrite = table(["A"; "B"; "C"], [true; false; true], [1; 2; 3], VariableNames=["String", "Boolean", "Float64"])

tWrite =

  3x3 table

    String    Boolean    Float64
    ______    _______    _______

     "A"       true         1   
     "B"       false        2   
     "C"       true         3   

>> featherwrite("test.feather", tWrite)

>> tRead = featherread("test.feather")

tRead =

  3x3 table

    String    Boolean    Float64
    ______    _______    _______

     "A"       true         1   
     "B"       false        2   
     "C"       true         3   

>> isequaln(tWrite, tRead)

ans =

  logical

   1

Are these changes tested?

Yes.

  1. Updated the existing tfeather.m and tfeathermex.m tests to reflect the new behavior of featherread. This mainly consists of error message ID changes.
  2. Added a new test to verify that all MATLAB types supported by arrow.tabular.RecordBatch can be round-tripped to a Feather V1 file.
  3. Added a new test to verify that a MATLAB table with Unicode Variablenames can be round-tripped to a Feather V1 file.

Are there any user-facing changes?

Yes.

  1. Now that featherread is implemented in terms of arrow.internal.io.feather.Reader and arrow.tabular.RecordBatch, it supports reading more types like Boolean, String, Timestamp, etc. Note: We updated the code to cast logical/Boolean type columns containing null values to double and substitute null values with NaN. This mirrors the existing behavior of featherread for integer type columns containing null values.
  2. There are some minor error message ID changes.
  3. Cell arrays of strings with a single element (e.g. {'filename.feather'}) are now supported as a valid filename for featherread.

Future Directions

  1. In the future, we may want to consider no longer casting columns with integer/logical type containing null values to double and substituting null values with NaN. This behavior isn't ideal in all cases (it can be lossy for types like uint64). This change would break compatibility.
  2. Delete legacy Feather V1 code and build infrastructure.

Notes

  1. Thank you @sgilmore10 for your help with this pull request!

@kevingurney kevingurney marked this pull request as ready for review August 14, 2023 20:40
@kevingurney kevingurney requested a review from kou as a code owner August 14, 2023 20:40
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 14, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 14, 2023
@kevingurney
Copy link
Member Author

+1

@kevingurney kevingurney merged commit 95db0df into apache:main Aug 14, 2023
9 checks passed
@kevingurney kevingurney removed the awaiting changes Awaiting changes label Aug 14, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 95db0df.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

kevingurney added a commit that referenced this pull request Aug 16, 2023
…de (#37204)

### Rationale for this change

Now that `featherread` and `featherwrite` have been re-implemented in terms of the new MATLAB Interface APIs (#37163 and #37047), we can remove the unused feather V1 MEX infrastructure and code. 

### What changes are included in this PR?

1. Deleted the following source and header files that are specific to the feather V1 MEX implementation: 
    - `arrow/matlab/src/cpp/arrow/matlab/feather/feather_functions.[cc][h]`
    - `arrow/matlab/src/cpp/arrow/matlab/feather/feather_reader.[cc][h]`
    - `arrow/matlab/src/cpp/arrow/matlab/feather/feather_writer.[cc][h]`
    - `arrow/matlab/src/cpp/arrow/matlab/feather/matlab_traits.h`
    - `arrow/matlab/src/cpp/arrow/matlab/feather/util/handle_status.[cc][h]`
    - `arrow/matlab/src/cpp/arrow/matlab/feather/util/unicode_conversion.[cc][h]`
    - `arrow/matlab/src/cpp/arrow/matlab/mex/call.cc`
    - `arrow/matlab/src/cpp/arrow/matlab/mex/mex_functions.h`
    - `arrow/matlab/src/cpp/arrow/matlab/mex/mex_util.[cc][h]`
    - `arrow/matlab/src/cpp/arrow/matlab/mex/mex_util_test.cc`
    - `arrow/matlab/src/cpp/arrow/matlab/api/visibility.h`

2. Deleted the following feather V1 MEX-specific build infrastructure files: 
    - `arrow/matlab/build_support/common_vars.m`
    - `arrow/matlab/build_support/compile.m`
    - `arrow/matlab/build_support/test.m`

3. Removed all feather V1 MEX-specific logic from the `arrow/matlab/CMakeLists.txt` file.

### Are these changes tested?

No tests are needed. The old feather V1 MEX specific implementation is unused code.

### Are there any user-facing changes?

No.

### Future Directions

1. Review the back-log of stale tasks/issues that are no longer actionable and close them. For example, #27758 has already been implemented and submitted with a different issue attached. 
* Closes: #37203

Lead-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: Kevin Gurney <kgurney@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
@kevingurney kevingurney deleted the GH-37046 branch August 21, 2023 18:13
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…nternal.io.feather.Reader` (apache#37163)

### Rationale for this change

Now that apache#37044 is merged, we can re-implement `featherread` in terms of the new `arrow.internal.io.feather.Reader` class.

Once this change is made, we can delete the legacy build infrastructure and `featherread` MEX code.

### What changes are included in this PR?

1. Reimplemented `featherread` in terms of the new `arrow.internal.io.feather.Reader` class.
2. We tried to maintain compatibility with the old code as much as possible, but since `featherread` is now implemented in terms of `RecordBatch`, there are some minor changes in behavior and support for some new data types (e.g. `Boolean`, `String`, `Timestamp`) that are introduced by these changes.
3. Updated `arrow/matlab/io/feather/proxy/reader.cc` to prevent a `nullptr` dereference that was occurring when reading a Feather V1 file created from an empty table by using `Table::CombineChunksToBatch` rather than a `TableBatchReader`.

**Example**
```matlab
>> tWrite = table(["A"; "B"; "C"], [true; false; true], [1; 2; 3], VariableNames=["String", "Boolean", "Float64"])

tWrite =

  3x3 table

    String    Boolean    Float64
    ______    _______    _______

     "A"       true         1   
     "B"       false        2   
     "C"       true         3   

>> featherwrite("test.feather", tWrite)

>> tRead = featherread("test.feather")

tRead =

  3x3 table

    String    Boolean    Float64
    ______    _______    _______

     "A"       true         1   
     "B"       false        2   
     "C"       true         3   

>> isequaln(tWrite, tRead)

ans =

  logical

   1
```

### Are these changes tested?

Yes.

1. Updated the existing `tfeather.m` and `tfeathermex.m` tests to reflect the new behavior of `featherread`. This mainly consists of error message ID changes.
2. Added a new test to verify that all MATLAB types supported by `arrow.tabular.RecordBatch` can be round-tripped to a Feather V1 file.
4. Added a new test to verify that a MATLAB `table` with Unicode `Variablenames` can be round-tripped to a Feather V1 file.  

### Are there any user-facing changes?

Yes.

1. Now that `featherread` is implemented in terms of `arrow.internal.io.feather.Reader` and `arrow.tabular.RecordBatch`, it supports reading more types like `Boolean`, `String`, `Timestamp`, etc. **Note**: We updated the code to cast `logical`/`Boolean` type columns containing null values to `double` and substitute null values with `NaN`. This mirrors the existing behavior of `featherread` for integer type columns containing null values. 
2. There are some minor error message ID changes.
4. Cell arrays of strings with a single element (e.g. `{'filename.feather'}`) are now supported as a valid `filename` for `featherread`.

### Future Directions

1. In the future, we may want to consider no longer casting columns with integer/logical type containing null values to `double` and substituting null values with `NaN`. This behavior isn't ideal in all cases (it can be lossy for types like `uint64`). This change would break compatibility.
2. Delete legacy Feather V1 code and build infrastructure.

### Notes

1. Thank you @ sgilmore10 for your help with this pull request!

* Closes: apache#37046

Authored-by: Kevin Gurney <kgurney@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…and code (apache#37204)

### Rationale for this change

Now that `featherread` and `featherwrite` have been re-implemented in terms of the new MATLAB Interface APIs (apache#37163 and apache#37047), we can remove the unused feather V1 MEX infrastructure and code. 

### What changes are included in this PR?

1. Deleted the following source and header files that are specific to the feather V1 MEX implementation: 
    - `arrow/matlab/src/cpp/arrow/matlab/feather/feather_functions.[cc][h]`
    - `arrow/matlab/src/cpp/arrow/matlab/feather/feather_reader.[cc][h]`
    - `arrow/matlab/src/cpp/arrow/matlab/feather/feather_writer.[cc][h]`
    - `arrow/matlab/src/cpp/arrow/matlab/feather/matlab_traits.h`
    - `arrow/matlab/src/cpp/arrow/matlab/feather/util/handle_status.[cc][h]`
    - `arrow/matlab/src/cpp/arrow/matlab/feather/util/unicode_conversion.[cc][h]`
    - `arrow/matlab/src/cpp/arrow/matlab/mex/call.cc`
    - `arrow/matlab/src/cpp/arrow/matlab/mex/mex_functions.h`
    - `arrow/matlab/src/cpp/arrow/matlab/mex/mex_util.[cc][h]`
    - `arrow/matlab/src/cpp/arrow/matlab/mex/mex_util_test.cc`
    - `arrow/matlab/src/cpp/arrow/matlab/api/visibility.h`

2. Deleted the following feather V1 MEX-specific build infrastructure files: 
    - `arrow/matlab/build_support/common_vars.m`
    - `arrow/matlab/build_support/compile.m`
    - `arrow/matlab/build_support/test.m`

3. Removed all feather V1 MEX-specific logic from the `arrow/matlab/CMakeLists.txt` file.

### Are these changes tested?

No tests are needed. The old feather V1 MEX specific implementation is unused code.

### Are there any user-facing changes?

No.

### Future Directions

1. Review the back-log of stale tasks/issues that are no longer actionable and close them. For example, apache#27758 has already been implemented and submitted with a different issue attached. 
* Closes: apache#37203

Lead-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: Kevin Gurney <kgurney@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MATLAB] Implement featherread in terms of arrow.internal.io.feather.Reader
2 participants