-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MATLAB] Add arrow.array.StringArray
class
#36250
Comments
take |
kou
added a commit
that referenced
this issue
Jul 7, 2023
### Rationale for this change Thanks to @ sgilmore10's [recent changes to enable UTF-8 <-> UTF-16 string conversions](#36167), we can now add support for creating Arrow `String` arrays (UTF-8 encoded) from MATLAB `string` arrays (UTF-16 encoded). ### What changes are included in this PR? 1. Added new `arrow.array.StringArray` class that can be constructed from MATLAB [`string`](https://www.mathworks.com/help/matlab/ref/string.html?s_tid=doc_ta) and [`cellstr`](https://www.mathworks.com/help/matlab/ref/cellstr.html) types. **Note**: We explicitly decided to *not* support [`char`](https://www.mathworks.com/help/matlab/ref/char.html?s_tid=doc_ta) arrays for the time being. 2. Factored out code for extracting "raw" `const uint8_t*` from a MATLAB `logical` Data Array into a new function `bit::unpacked_as_ptr` so that it can be reused across multiple Array `Proxy` classes. See #36335. 3. Added new `arrow.type.StringType` type class and associated `arrow.type.ID.String` enum value. 4. Enabled support for creating `RecordBatch` objects from MATLAB `table`s containing `string` data. 5. Updated `arrow::matlab::array::proxy::Array::toString` code to convert from UTF-8 to UTF-16 for display in MATLAB. **Examples** *Most MATLAB `string` arrays round-trip* ```matlab >> matlabArray = ["A"; "B"; "C"] matlabArray = 3x1 string array "A" "B" "C" >> arrowArray = arrow.array.StringArray(matlabArray) arrowArray = [ "A", "B", "C" ] >> matlabArrayRoundTrip = toMATLAB(arrowArray) matlabArrayRoundTrip = 3x1 string array "A" "B" "C" >> isequal(matlabArray, matlabArrayRoundTrip) ans = logical 1 ``` *MATLAB `string(missing)` Values get mapped to `null` by default* ```matlab >> matlabArray = ["A"; string(missing); "C"] matlabArray = 3x1 string array "A" <missing> "C" >> arrowArray = arrow.array.StringArray(matlabArray) arrowArray = [ "A", null, "C" ] >> matlabArrayRoundTrip = toMATLAB(arrowArray) matlabArrayRoundTrip = 3x1 string array "A" <missing> "C" >> isequaln(matlabArray, matlabArrayRoundTrip) ans = logical 1 ``` *Unicode characters round-trip* ```matlab >> matlabArray = ["😊"; "🌲"; "➞"] matlabArray = 3×1 string array "😊" "🌲" "➞" >> arrowArray = arrow.array.StringArray(matlabArray) arrowArray = [ "😊", "🌲", "➞" ] >> matlabArrayRoundTrip = toMATLAB(arrowArray) matlabArrayRoundTrip = 3×1 string array "😊" "🌲" "➞" ``` *Create `StringArray` from `cellstr`* ```matlab >> matlabArray = {'red'; 'green'; 'blue'} matlabArray = 3×1 cell array {'red' } {'green'} {'blue' } >> arrowArray = arrow.array.StringArray(matlabArray) arrowArray = [ "red", "green", "blue" ] >> matlabArrayRoundTrip = toMATLAB(arrowArray) matlabArrayRoundTrip = 3×1 string array "red" "green" "blue" ``` *Create `RecordBatch` from MATLAB `string` data* ```matlab >> matlabTable = table(["😊"; "🌲"; "➞"]) matlabTable = 3×1 table Var1 ____ "😊" "🌲" "➞" >> arrowRecordBatch = arrow.tabular.RecordBatch(matlabTable) arrowRecordBatch = Var1: [ "😊", "🌲", "➞" ] >> matlabTableRoundTrip = toMATLAB(arrowRecordBatch) matlabTableRoundTrip = 3×1 table Var1 ____ "😊" "🌲" "➞" >> isequaln(matlabTable, matlabTableRoundTrip) ans = logical 1 ``` ### Are these changes tested? Yes. 1. Added new `tStringArray` test class. 2. Added new `tStringType` test class. 3. Extended `tRecordBatch` test class to verify support for MATLAB `table`s which contain `string` data (see above). ### Are there any user-facing changes? Yes. 1. Users can now create `arrow.array.StringArray` objects from MATLAB `string` arrays and `cellstr`s. 2. Users can now create `arrow.type.StringType` objects. 3. Users can now construct `RecordBatch` objects from MATLAB `table`s that contain `string` data. ### Future Directions 1. The implementation of this initial version of `StringArray` is relatively simple in that it does not include a `BinaryArray` class hierarchy. In the future, we will likely want to refactor `StringArray` to inherit from a more general abstract `BinaryArray` class hierarchy. 2. Following on from 1., we will ideally want to add support for `LargeStringArray`, `BinaryArray`, and `LargeBinaryArray`, and `FixedLengthBinaryArray` by creating common infrastructure for representing binary types. This initial version of `StringArray` helps to solidify the user-facing design and provide a shorter term solution to working with `string` data, since it is quite common. 3. It may make sense to change the `arrow.type.Type` hierarchy (e.g. `arrow.type.StringType`) in the future to delegate to C++ `Proxy` classes under the hood. See: #36363. 4. Use `bit::unpacked_as_ptr` in other classes. See #36335. 5. Look for more ways to optimize the conversion from MATLAB UTF-16 encoded string data to Arrow UTF-8 encoded string data (e.g. by avoiding unnecessary data copies). ### Notes 1. Thank you @ sgilmore10 for your help with this pull request! * Closes: #36250 Lead-authored-by: Kevin Gurney <kgurney@mathworks.com> Co-authored-by: Kevin Gurney <kevin.p.gurney@gmail.com> Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Sarah Gilmore <silgmore@mathworks.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the enhancement requested
Thanks to @sgilmore10's recent changes to enable UTF-8 <-> UTF-16 string conversions, we can now add support for creating Arrow
String
arrays (UTF-8 encoded) from MATLABstring
arrays (UTF-16 encoded).We will also want to add support for
arrow.array.LargeStringArray
.Example:
Component(s)
MATLAB
The text was updated successfully, but these errors were encountered: