Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading fields with non-ASCII names #10796

Closed
rui-mo opened this issue Aug 21, 2024 · 3 comments
Closed

Support reading fields with non-ASCII names #10796

rui-mo opened this issue Aug 21, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@rui-mo
Copy link
Collaborator

rui-mo commented Aug 21, 2024

Description

Currently the tokenizer only considers ASCII characters, and to read fields with non-ASCII names results in an exception with the message 'Invalid subfield path'.

A reproducible unit test is available at rui-mo@8dc1ea4. To make this test work, a temporary change was made to the 'isUnquotedSubscriptCharacter' function by removing 'isalnum'.

To fix this issue, shall we support UTF-8 characters in the Tokenizer, for example, by replacing 'isalnum' with 'u_isalnum' in the ICU library?

@Yuhta
Copy link
Contributor

Yuhta commented Aug 21, 2024

Gluten should parse the field names into Subfield directly without using the Tokenizer (which we will rename into PrestoTokenizer).

@rui-mo
Copy link
Collaborator Author

rui-mo commented Aug 22, 2024

@Yuhta Thanks for your feedback. I will take a look.

@rui-mo
Copy link
Collaborator Author

rui-mo commented Sep 25, 2024

This issue can be solved with the solution recommended above. Thanks.

@rui-mo rui-mo closed this as completed Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants