-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37041: [MATLAB] Implement Feather V1 Reader using new MATLAB Interface APIs #37044
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2. Error if not Feather V1 file.
github-actions
bot
added
Component: MATLAB
awaiting committer review
Awaiting committer review
labels
Aug 7, 2023
sgilmore10
reviewed
Aug 7, 2023
sgilmore10
approved these changes
Aug 7, 2023
github-actions
bot
added
awaiting changes
Awaiting changes
and removed
awaiting committer review
Awaiting committer review
labels
Aug 7, 2023
github-actions
bot
added
awaiting change review
Awaiting change review
awaiting changes
Awaiting changes
and removed
awaiting changes
Awaiting changes
awaiting change review
Awaiting change review
labels
Aug 7, 2023
Linting failures appear unrelated to these changes. |
+1 |
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 152be67. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
kevingurney
added a commit
that referenced
this pull request
Aug 14, 2023
…l.io.feather.Reader` (#37163) ### Rationale for this change Now that #37044 is merged, we can re-implement `featherread` in terms of the new `arrow.internal.io.feather.Reader` class. Once this change is made, we can delete the legacy build infrastructure and `featherread` MEX code. ### What changes are included in this PR? 1. Reimplemented `featherread` in terms of the new `arrow.internal.io.feather.Reader` class. 2. We tried to maintain compatibility with the old code as much as possible, but since `featherread` is now implemented in terms of `RecordBatch`, there are some minor changes in behavior and support for some new data types (e.g. `Boolean`, `String`, `Timestamp`) that are introduced by these changes. 3. Updated `arrow/matlab/io/feather/proxy/reader.cc` to prevent a `nullptr` dereference that was occurring when reading a Feather V1 file created from an empty table by using `Table::CombineChunksToBatch` rather than a `TableBatchReader`. **Example** ```matlab >> tWrite = table(["A"; "B"; "C"], [true; false; true], [1; 2; 3], VariableNames=["String", "Boolean", "Float64"]) tWrite = 3x3 table String Boolean Float64 ______ _______ _______ "A" true 1 "B" false 2 "C" true 3 >> featherwrite("test.feather", tWrite) >> tRead = featherread("test.feather") tRead = 3x3 table String Boolean Float64 ______ _______ _______ "A" true 1 "B" false 2 "C" true 3 >> isequaln(tWrite, tRead) ans = logical 1 ``` ### Are these changes tested? Yes. 1. Updated the existing `tfeather.m` and `tfeathermex.m` tests to reflect the new behavior of `featherread`. This mainly consists of error message ID changes. 2. Added a new test to verify that all MATLAB types supported by `arrow.tabular.RecordBatch` can be round-tripped to a Feather V1 file. 4. Added a new test to verify that a MATLAB `table` with Unicode `Variablenames` can be round-tripped to a Feather V1 file. ### Are there any user-facing changes? Yes. 1. Now that `featherread` is implemented in terms of `arrow.internal.io.feather.Reader` and `arrow.tabular.RecordBatch`, it supports reading more types like `Boolean`, `String`, `Timestamp`, etc. **Note**: We updated the code to cast `logical`/`Boolean` type columns containing null values to `double` and substitute null values with `NaN`. This mirrors the existing behavior of `featherread` for integer type columns containing null values. 2. There are some minor error message ID changes. 4. Cell arrays of strings with a single element (e.g. `{'filename.feather'}`) are now supported as a valid `filename` for `featherread`. ### Future Directions 1. In the future, we may want to consider no longer casting columns with integer/logical type containing null values to `double` and substituting null values with `NaN`. This behavior isn't ideal in all cases (it can be lossy for types like `uint64`). This change would break compatibility. 2. Delete legacy Feather V1 code and build infrastructure. ### Notes 1. Thank you @ sgilmore10 for your help with this pull request! * Closes: #37046 Authored-by: Kevin Gurney <kgurney@mathworks.com> Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
loicalleyne
pushed a commit
to loicalleyne/arrow
that referenced
this pull request
Nov 13, 2023
… Interface APIs (apache#37044) ### Rationale for this change Now that we've have the basic building blocks for tabular IO in the MATLAB Interface (Array, Schema, RecordBatch), we can implement a Feather V1 reader in terms of the new APIs. This is a follow up to apache#37043, where a new Feather V1 internal `Writer` object was added. ### What changes are included in this PR? 1. Added a new class called arrow.internal.io.feather.Reader which can be used to read Feather V1 files. It has one public property named `Filename` and one public method named `read`. **Example Usage:** ```matlab >> T = array2table(rand(3)) T = 3x3 table Var1 Var2 Var3 _______ ________ _______ 0.79221 0.035712 0.67874 0.95949 0.84913 0.75774 0.65574 0.93399 0.74313 >> filename = "test.feather"; >> featherwrite(filename, T) >> reader = arrow.internal.io.feather.Reader(filename) reader = Reader with properties: Filename: "test.feather" >> T = reader.read() T = 3x3 table Var1 Var2 Var3 _______ ________ _______ 0.79221 0.035712 0.67874 0.95949 0.84913 0.75774 0.65574 0.93399 0.74313 ``` ### Are these changes tested? Yes. 1. Added `Reader` to `feather/tRoundTrip.m`. ### Are there any user-facing changes? No. These are only internal objects right now. ### Future Directions 1. Re-implement `featherread` in terms of the new `Reader` object. 2. Remove legacy feather code and infrastructure. ### Notes 1. For conciseness, I renamed the C++ Proxy class `FeatherWriter` to `Writer` since it is already inside of a `feather` namespace / "package". * Closes: apache#37041 Authored-by: Kevin Gurney <kgurney@mathworks.com> Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
loicalleyne
pushed a commit
to loicalleyne/arrow
that referenced
this pull request
Nov 13, 2023
…nternal.io.feather.Reader` (apache#37163) ### Rationale for this change Now that apache#37044 is merged, we can re-implement `featherread` in terms of the new `arrow.internal.io.feather.Reader` class. Once this change is made, we can delete the legacy build infrastructure and `featherread` MEX code. ### What changes are included in this PR? 1. Reimplemented `featherread` in terms of the new `arrow.internal.io.feather.Reader` class. 2. We tried to maintain compatibility with the old code as much as possible, but since `featherread` is now implemented in terms of `RecordBatch`, there are some minor changes in behavior and support for some new data types (e.g. `Boolean`, `String`, `Timestamp`) that are introduced by these changes. 3. Updated `arrow/matlab/io/feather/proxy/reader.cc` to prevent a `nullptr` dereference that was occurring when reading a Feather V1 file created from an empty table by using `Table::CombineChunksToBatch` rather than a `TableBatchReader`. **Example** ```matlab >> tWrite = table(["A"; "B"; "C"], [true; false; true], [1; 2; 3], VariableNames=["String", "Boolean", "Float64"]) tWrite = 3x3 table String Boolean Float64 ______ _______ _______ "A" true 1 "B" false 2 "C" true 3 >> featherwrite("test.feather", tWrite) >> tRead = featherread("test.feather") tRead = 3x3 table String Boolean Float64 ______ _______ _______ "A" true 1 "B" false 2 "C" true 3 >> isequaln(tWrite, tRead) ans = logical 1 ``` ### Are these changes tested? Yes. 1. Updated the existing `tfeather.m` and `tfeathermex.m` tests to reflect the new behavior of `featherread`. This mainly consists of error message ID changes. 2. Added a new test to verify that all MATLAB types supported by `arrow.tabular.RecordBatch` can be round-tripped to a Feather V1 file. 4. Added a new test to verify that a MATLAB `table` with Unicode `Variablenames` can be round-tripped to a Feather V1 file. ### Are there any user-facing changes? Yes. 1. Now that `featherread` is implemented in terms of `arrow.internal.io.feather.Reader` and `arrow.tabular.RecordBatch`, it supports reading more types like `Boolean`, `String`, `Timestamp`, etc. **Note**: We updated the code to cast `logical`/`Boolean` type columns containing null values to `double` and substitute null values with `NaN`. This mirrors the existing behavior of `featherread` for integer type columns containing null values. 2. There are some minor error message ID changes. 4. Cell arrays of strings with a single element (e.g. `{'filename.feather'}`) are now supported as a valid `filename` for `featherread`. ### Future Directions 1. In the future, we may want to consider no longer casting columns with integer/logical type containing null values to `double` and substituting null values with `NaN`. This behavior isn't ideal in all cases (it can be lossy for types like `uint64`). This change would break compatibility. 2. Delete legacy Feather V1 code and build infrastructure. ### Notes 1. Thank you @ sgilmore10 for your help with this pull request! * Closes: apache#37046 Authored-by: Kevin Gurney <kgurney@mathworks.com> Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Now that we've have the basic building blocks for tabular IO in the MATLAB Interface (Array, Schema, RecordBatch), we can implement a Feather V1 reader in terms of the new APIs.
This is a follow up to #37043, where a new Feather V1 internal
Writer
object was added.What changes are included in this PR?
Filename
and one public method namedread
.Example Usage:
Are these changes tested?
Yes.
Reader
tofeather/tRoundTrip.m
.Are there any user-facing changes?
No.
These are only internal objects right now.
Future Directions
featherread
in terms of the newReader
object.Notes
FeatherWriter
toWriter
since it is already inside of afeather
namespace / "package".