-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37046: [MATLAB] Implement featherread
in terms of arrow.internal.io.feather.Reader
#37163
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
round-tripping tables with Unicode variable names.
sgilmore10
reviewed
Aug 14, 2023
sgilmore10
approved these changes
Aug 14, 2023
+1 |
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 95db0df. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
This was referenced Aug 16, 2023
kevingurney
added a commit
that referenced
this pull request
Aug 16, 2023
…de (#37204) ### Rationale for this change Now that `featherread` and `featherwrite` have been re-implemented in terms of the new MATLAB Interface APIs (#37163 and #37047), we can remove the unused feather V1 MEX infrastructure and code. ### What changes are included in this PR? 1. Deleted the following source and header files that are specific to the feather V1 MEX implementation: - `arrow/matlab/src/cpp/arrow/matlab/feather/feather_functions.[cc][h]` - `arrow/matlab/src/cpp/arrow/matlab/feather/feather_reader.[cc][h]` - `arrow/matlab/src/cpp/arrow/matlab/feather/feather_writer.[cc][h]` - `arrow/matlab/src/cpp/arrow/matlab/feather/matlab_traits.h` - `arrow/matlab/src/cpp/arrow/matlab/feather/util/handle_status.[cc][h]` - `arrow/matlab/src/cpp/arrow/matlab/feather/util/unicode_conversion.[cc][h]` - `arrow/matlab/src/cpp/arrow/matlab/mex/call.cc` - `arrow/matlab/src/cpp/arrow/matlab/mex/mex_functions.h` - `arrow/matlab/src/cpp/arrow/matlab/mex/mex_util.[cc][h]` - `arrow/matlab/src/cpp/arrow/matlab/mex/mex_util_test.cc` - `arrow/matlab/src/cpp/arrow/matlab/api/visibility.h` 2. Deleted the following feather V1 MEX-specific build infrastructure files: - `arrow/matlab/build_support/common_vars.m` - `arrow/matlab/build_support/compile.m` - `arrow/matlab/build_support/test.m` 3. Removed all feather V1 MEX-specific logic from the `arrow/matlab/CMakeLists.txt` file. ### Are these changes tested? No tests are needed. The old feather V1 MEX specific implementation is unused code. ### Are there any user-facing changes? No. ### Future Directions 1. Review the back-log of stale tasks/issues that are no longer actionable and close them. For example, #27758 has already been implemented and submitted with a different issue attached. * Closes: #37203 Lead-authored-by: Sarah Gilmore <sgilmore@mathworks.com> Co-authored-by: Kevin Gurney <kgurney@mathworks.com> Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
loicalleyne
pushed a commit
to loicalleyne/arrow
that referenced
this pull request
Nov 13, 2023
…nternal.io.feather.Reader` (apache#37163) ### Rationale for this change Now that apache#37044 is merged, we can re-implement `featherread` in terms of the new `arrow.internal.io.feather.Reader` class. Once this change is made, we can delete the legacy build infrastructure and `featherread` MEX code. ### What changes are included in this PR? 1. Reimplemented `featherread` in terms of the new `arrow.internal.io.feather.Reader` class. 2. We tried to maintain compatibility with the old code as much as possible, but since `featherread` is now implemented in terms of `RecordBatch`, there are some minor changes in behavior and support for some new data types (e.g. `Boolean`, `String`, `Timestamp`) that are introduced by these changes. 3. Updated `arrow/matlab/io/feather/proxy/reader.cc` to prevent a `nullptr` dereference that was occurring when reading a Feather V1 file created from an empty table by using `Table::CombineChunksToBatch` rather than a `TableBatchReader`. **Example** ```matlab >> tWrite = table(["A"; "B"; "C"], [true; false; true], [1; 2; 3], VariableNames=["String", "Boolean", "Float64"]) tWrite = 3x3 table String Boolean Float64 ______ _______ _______ "A" true 1 "B" false 2 "C" true 3 >> featherwrite("test.feather", tWrite) >> tRead = featherread("test.feather") tRead = 3x3 table String Boolean Float64 ______ _______ _______ "A" true 1 "B" false 2 "C" true 3 >> isequaln(tWrite, tRead) ans = logical 1 ``` ### Are these changes tested? Yes. 1. Updated the existing `tfeather.m` and `tfeathermex.m` tests to reflect the new behavior of `featherread`. This mainly consists of error message ID changes. 2. Added a new test to verify that all MATLAB types supported by `arrow.tabular.RecordBatch` can be round-tripped to a Feather V1 file. 4. Added a new test to verify that a MATLAB `table` with Unicode `Variablenames` can be round-tripped to a Feather V1 file. ### Are there any user-facing changes? Yes. 1. Now that `featherread` is implemented in terms of `arrow.internal.io.feather.Reader` and `arrow.tabular.RecordBatch`, it supports reading more types like `Boolean`, `String`, `Timestamp`, etc. **Note**: We updated the code to cast `logical`/`Boolean` type columns containing null values to `double` and substitute null values with `NaN`. This mirrors the existing behavior of `featherread` for integer type columns containing null values. 2. There are some minor error message ID changes. 4. Cell arrays of strings with a single element (e.g. `{'filename.feather'}`) are now supported as a valid `filename` for `featherread`. ### Future Directions 1. In the future, we may want to consider no longer casting columns with integer/logical type containing null values to `double` and substituting null values with `NaN`. This behavior isn't ideal in all cases (it can be lossy for types like `uint64`). This change would break compatibility. 2. Delete legacy Feather V1 code and build infrastructure. ### Notes 1. Thank you @ sgilmore10 for your help with this pull request! * Closes: apache#37046 Authored-by: Kevin Gurney <kgurney@mathworks.com> Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
loicalleyne
pushed a commit
to loicalleyne/arrow
that referenced
this pull request
Nov 13, 2023
…and code (apache#37204) ### Rationale for this change Now that `featherread` and `featherwrite` have been re-implemented in terms of the new MATLAB Interface APIs (apache#37163 and apache#37047), we can remove the unused feather V1 MEX infrastructure and code. ### What changes are included in this PR? 1. Deleted the following source and header files that are specific to the feather V1 MEX implementation: - `arrow/matlab/src/cpp/arrow/matlab/feather/feather_functions.[cc][h]` - `arrow/matlab/src/cpp/arrow/matlab/feather/feather_reader.[cc][h]` - `arrow/matlab/src/cpp/arrow/matlab/feather/feather_writer.[cc][h]` - `arrow/matlab/src/cpp/arrow/matlab/feather/matlab_traits.h` - `arrow/matlab/src/cpp/arrow/matlab/feather/util/handle_status.[cc][h]` - `arrow/matlab/src/cpp/arrow/matlab/feather/util/unicode_conversion.[cc][h]` - `arrow/matlab/src/cpp/arrow/matlab/mex/call.cc` - `arrow/matlab/src/cpp/arrow/matlab/mex/mex_functions.h` - `arrow/matlab/src/cpp/arrow/matlab/mex/mex_util.[cc][h]` - `arrow/matlab/src/cpp/arrow/matlab/mex/mex_util_test.cc` - `arrow/matlab/src/cpp/arrow/matlab/api/visibility.h` 2. Deleted the following feather V1 MEX-specific build infrastructure files: - `arrow/matlab/build_support/common_vars.m` - `arrow/matlab/build_support/compile.m` - `arrow/matlab/build_support/test.m` 3. Removed all feather V1 MEX-specific logic from the `arrow/matlab/CMakeLists.txt` file. ### Are these changes tested? No tests are needed. The old feather V1 MEX specific implementation is unused code. ### Are there any user-facing changes? No. ### Future Directions 1. Review the back-log of stale tasks/issues that are no longer actionable and close them. For example, apache#27758 has already been implemented and submitted with a different issue attached. * Closes: apache#37203 Lead-authored-by: Sarah Gilmore <sgilmore@mathworks.com> Co-authored-by: Kevin Gurney <kgurney@mathworks.com> Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Now that #37044 is merged, we can re-implement
featherread
in terms of the newarrow.internal.io.feather.Reader
class.Once this change is made, we can delete the legacy build infrastructure and
featherread
MEX code.What changes are included in this PR?
featherread
in terms of the newarrow.internal.io.feather.Reader
class.featherread
is now implemented in terms ofRecordBatch
, there are some minor changes in behavior and support for some new data types (e.g.Boolean
,String
,Timestamp
) that are introduced by these changes.arrow/matlab/io/feather/proxy/reader.cc
to prevent anullptr
dereference that was occurring when reading a Feather V1 file created from an empty table by usingTable::CombineChunksToBatch
rather than aTableBatchReader
.Example
Are these changes tested?
Yes.
tfeather.m
andtfeathermex.m
tests to reflect the new behavior offeatherread
. This mainly consists of error message ID changes.arrow.tabular.RecordBatch
can be round-tripped to a Feather V1 file.table
with UnicodeVariablenames
can be round-tripped to a Feather V1 file.Are there any user-facing changes?
Yes.
featherread
is implemented in terms ofarrow.internal.io.feather.Reader
andarrow.tabular.RecordBatch
, it supports reading more types likeBoolean
,String
,Timestamp
, etc. Note: We updated the code to castlogical
/Boolean
type columns containing null values todouble
and substitute null values withNaN
. This mirrors the existing behavior offeatherread
for integer type columns containing null values.{'filename.feather'}
) are now supported as a validfilename
forfeatherread
.Future Directions
double
and substituting null values withNaN
. This behavior isn't ideal in all cases (it can be lossy for types likeuint64
). This change would break compatibility.Notes
featherread
in terms ofarrow.internal.io.feather.Reader
#37046