Support reading fields with non-ASCII names #10796

rui-mo · 2024-08-21T07:44:26Z

Description

Currently the tokenizer only considers ASCII characters, and to read fields with non-ASCII names results in an exception with the message 'Invalid subfield path'.

A reproducible unit test is available at rui-mo@8dc1ea4. To make this test work, a temporary change was made to the 'isUnquotedSubscriptCharacter' function by removing 'isalnum'.

To fix this issue, shall we support UTF-8 characters in the Tokenizer, for example, by replacing 'isalnum' with 'u_isalnum' in the ICU library?

The text was updated successfully, but these errors were encountered:

Yuhta · 2024-08-21T14:10:40Z

Gluten should parse the field names into Subfield directly without using the Tokenizer (which we will rename into PrestoTokenizer).

rui-mo · 2024-08-22T05:08:07Z

@Yuhta Thanks for your feedback. I will take a look.

rui-mo · 2024-09-25T01:55:32Z

This issue can be solved with the solution recommended above. Thanks.

rui-mo added the enhancement New feature or request label Aug 21, 2024

rui-mo mentioned this issue Aug 21, 2024

[VL] Column name containing parts of Cyrillic cannot be read correctly apache/incubator-gluten#6843

Open

rui-mo closed this as completed Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reading fields with non-ASCII names #10796

Support reading fields with non-ASCII names #10796

rui-mo commented Aug 21, 2024 •

edited

Loading

Yuhta commented Aug 21, 2024

rui-mo commented Aug 22, 2024

rui-mo commented Sep 25, 2024

Support reading fields with non-ASCII names #10796

Support reading fields with non-ASCII names #10796

Comments

rui-mo commented Aug 21, 2024 • edited Loading

Description

Yuhta commented Aug 21, 2024

rui-mo commented Aug 22, 2024

rui-mo commented Sep 25, 2024

rui-mo commented Aug 21, 2024 •

edited

Loading