Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36166: [C++][MATLAB] Add utility to convert UTF-8 strings to UTF-16 and UTF-16 strings to UTF-8 #36167

Merged
merged 2 commits into from
Jun 19, 2023

Conversation

sgilmore10
Copy link
Member

@sgilmore10 sgilmore10 commented Jun 19, 2023

Rationale for this change

MATLAB uses UTF-16 encoded strings, but arrow uses UTF-8. We need a way to convert between the two encodings.

What changes are included in this PR?

Added two new utility functions:

  1. std::string UTF16StringToUTF8(const std::basic_string<char16_t>& source)
  2. std::basic_string<char16_t> UTF8StringToUTF16(const std::string& source)

Are these changes tested?

Added two test cases to utf8_util_test.cc:

  1. UTF16StringToUTF8
  2. UTF8StringToUTF16

Are there any user-facing changes?

No, these APIs are intended for developers.

Future Directions

In a followup PR, we will update the MATLAB Interface source code to use these utilities when converting between UTF16 and UTF8 encoded strings.

2. Add UTF16StringToUTF8 utility
@sgilmore10
Copy link
Member Author

The CI failures seem unrelated to these changes.

@kou kou changed the title GH-36166: [C++] [MATLAB]: Add utility to convert UTF-8 strings to UTF-16 and UTF-16 strings to UTF-8 GH-36166: [C++][MATLAB] Add utility to convert UTF-8 strings to UTF-16 and UTF-16 strings to UTF-8 Jun 19, 2023
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Jun 19, 2023
@kou kou merged commit bd7455f into apache:main Jun 19, 2023
@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit bd7455f0.

There were 4 benchmark results indicating a performance regression:

The full Conbench report has more details.

@@ -164,5 +176,21 @@ ARROW_EXPORT Result<std::string> WideStringToUTF8(const std::wstring& source) {
}
}

ARROW_EXPORT Result<std::string> UTF16StringToUTF8(const std::u16string& source) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, ARROW_EXPORT is only useful on declarations, not definitions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. I missed this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's my bad. I copied the line from the header file but forgot to delete ARROW_EXPORT. I'll submit a followup pull request.

CheckOk({0, 'x'}, {0, u'x'});

CheckInvalid("\xff");
CheckInvalid("h\xc3");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would have been nice to add a lone surrogate test here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, I can add one in a followup pull request.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jun 20, 2023
pitrou added a commit that referenced this pull request Jun 29, 2023
…ringToUTF16 (#36383)

### Rationale for this change

This is a followup PR to #36167 that addresses feedback left after the PR was merged.

### What changes are included in this PR?

1. Added a test point verifying `UTF8StringToUTF16` returns an `Invalid` status if given a UTF-8 encoded string that contains a lone high or low code point.
2. Removed `ARROW_EXPORT` from definitions of `UTF8StringToUTF16` and `UTF16StringToUTF18`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* Closes: #36173

Lead-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: sgilmore10 <74676073+sgilmore10@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
kou added a commit that referenced this pull request Jul 7, 2023
### Rationale for this change

Thanks to @ sgilmore10's [recent changes to enable UTF-8 <-> UTF-16 string conversions](#36167), we can now add support for creating Arrow `String` arrays (UTF-8 encoded) from MATLAB `string` arrays (UTF-16 encoded).

### What changes are included in this PR?

1. Added new `arrow.array.StringArray` class that can be constructed from MATLAB [`string`](https://www.mathworks.com/help/matlab/ref/string.html?s_tid=doc_ta) and [`cellstr`](https://www.mathworks.com/help/matlab/ref/cellstr.html) types. **Note**: We explicitly decided to *not* support [`char`](https://www.mathworks.com/help/matlab/ref/char.html?s_tid=doc_ta) arrays for the time being.
2. Factored out code for extracting "raw" `const uint8_t*` from a MATLAB `logical` Data Array into a new function `bit::unpacked_as_ptr` so that it can be reused across multiple Array `Proxy` classes. See #36335.
3. Added new `arrow.type.StringType` type class and associated `arrow.type.ID.String` enum value.
4. Enabled support for creating `RecordBatch` objects from MATLAB `table`s containing `string` data.
5. Updated `arrow::matlab::array::proxy::Array::toString` code to convert from UTF-8 to UTF-16 for display in MATLAB.

**Examples**

*Most MATLAB `string` arrays round-trip*

```matlab
>> matlabArray = ["A"; "B"; "C"]

matlabArray = 

  3x1 string array

    "A"
    "B"
    "C"

>> arrowArray = arrow.array.StringArray(matlabArray)

arrowArray = 

[
  "A",
  "B",
  "C"
]

>> matlabArrayRoundTrip = toMATLAB(arrowArray)          

matlabArrayRoundTrip = 

  3x1 string array

    "A"
    "B"
    "C"

>> isequal(matlabArray, matlabArrayRoundTrip)

ans =

  logical

   1
```

*MATLAB `string(missing)` Values get mapped to `null` by default*

```matlab
>> matlabArray = ["A"; string(missing); "C"]

matlabArray = 

  3x1 string array

    "A"
    <missing>
    "C"

>> arrowArray = arrow.array.StringArray(matlabArray) 

arrowArray = 

[
  "A",
  null,
  "C"
]

>> matlabArrayRoundTrip = toMATLAB(arrowArray) 

matlabArrayRoundTrip = 

  3x1 string array

    "A"
    <missing>
    "C"

>> isequaln(matlabArray, matlabArrayRoundTrip)

ans =

  logical

   1

```

*Unicode characters round-trip*

```matlab
>> matlabArray = ["😊"; "🌲"; "➞"]

matlabArray = 

  3×1 string array

    "😊"
    "🌲"
    "➞"

>> arrowArray = arrow.array.StringArray(matlabArray)

arrowArray = 

[
  "😊",
  "🌲",
  "➞"
]

>> matlabArrayRoundTrip = toMATLAB(arrowArray)

matlabArrayRoundTrip = 

  3×1 string array

    "😊"
    "🌲"
    "➞"
```

*Create `StringArray` from `cellstr`*

```matlab
>> matlabArray = {'red'; 'green'; 'blue'}

matlabArray =

  3×1 cell array

    {'red'  }
    {'green'}
    {'blue' }

>> arrowArray = arrow.array.StringArray(matlabArray)

arrowArray = 

[
  "red",
  "green",
  "blue"
]

>> matlabArrayRoundTrip = toMATLAB(arrowArray)

matlabArrayRoundTrip = 

  3×1 string array

    "red"
    "green"
    "blue"
```

*Create `RecordBatch` from MATLAB `string` data*

```matlab
>> matlabTable = table(["😊"; "🌲"; "➞"])

matlabTable =

  3×1 table

    Var1
    ____

    "😊"
    "🌲"
    "➞" 

>> arrowRecordBatch = arrow.tabular.RecordBatch(matlabTable)

arrowRecordBatch = 

Var1:   [
    "😊",
    "🌲",
    "➞"
  ]

>> matlabTableRoundTrip = toMATLAB(arrowRecordBatch)

matlabTableRoundTrip =

  3×1 table

    Var1
    ____

    "😊"
    "🌲"
    "➞" 

>> isequaln(matlabTable, matlabTableRoundTrip)

ans =

  logical

   1
```

### Are these changes tested?

Yes.

1. Added new `tStringArray` test class.
2. Added new `tStringType` test class.
3. Extended `tRecordBatch` test class to verify support for MATLAB `table`s which contain `string` data (see above).

### Are there any user-facing changes?

Yes.

1. Users can now create `arrow.array.StringArray` objects from MATLAB `string` arrays and `cellstr`s.
2. Users can now create `arrow.type.StringType` objects.
3. Users can now construct `RecordBatch` objects from MATLAB `table`s that contain `string` data.

### Future Directions

1. The implementation of this initial version of `StringArray` is relatively simple in that it does not include a `BinaryArray` class hierarchy. In the future, we will likely want to refactor `StringArray` to inherit from a more general abstract `BinaryArray` class hierarchy.
2. Following on from 1., we will ideally want to add support for `LargeStringArray`, `BinaryArray`, and `LargeBinaryArray`, and `FixedLengthBinaryArray` by creating common infrastructure for representing binary types. This initial version of `StringArray` helps to solidify the user-facing design and provide a shorter term solution to working with `string` data, since it is quite common.
3. It may make sense to change the `arrow.type.Type` hierarchy (e.g. `arrow.type.StringType`) in the future to delegate to C++ `Proxy` classes under the hood. See: #36363.
4. Use `bit::unpacked_as_ptr` in other classes. See #36335.
5. Look for more ways to optimize the conversion from MATLAB UTF-16 encoded string data to Arrow UTF-8 encoded string data (e.g. by avoiding unnecessary data copies).

### Notes

1. Thank you @ sgilmore10 for your help with this pull request!
* Closes: #36250

Lead-authored-by: Kevin Gurney <kgurney@mathworks.com>
Co-authored-by: Kevin Gurney <kevin.p.gurney@gmail.com>
Co-authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Sarah Gilmore <silgmore@mathworks.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@sgilmore10 sgilmore10 deleted the GH-36166 branch August 21, 2023 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][MATLAB] Add utility to convert UTF-8 strings to UTF-16 and UTF-16 strings to UTF-8
3 participants