-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Customize field separators #7252
Comments
@Yuhta Could you please take a look at this issue? Thanks. |
I think it can be a part of the connector config or query config. Is it different from query to query, or is it depending on the connector? |
In Spark, it does not vary from query to query. Field name containing dot is allowed. |
I see so it is specific to certain files, I would put it as a connector config and set that in all connectors that reading these files |
When you generates subfield prunings, you will need to use the customized separator though |
@Yuhta Thank you. I will check how to add this config. |
Hello @rui-mo |
@zhli1142015 No update from my side recently. I plan to try a config as suggested when I get free time. |
Sure, let me work on this. |
Hello @rui-mo and @Yuhta
Spark output:
Gluten-Velox output:
|
@zhli1142015 Thank you. I agree we need to allow customized tokenizer in Velox through. One question, with SparkTokenizer, how can we decide which one to use in the code? Or should the default one be changed accordingly? |
We should only use SparkTokenizer when running spark queries and it should be registered during Gluten |
@zhli1142015 How hard is it to make the current |
Hello @Yuhta ,
If we want to make now |
@zhli1142015 The tokenizer is needed to pass down subfield pruning information, so spark would still need it if it wants to do subfield pruning. We need to find out some character values that can be used as field separators in Spark to do this, and fix the tokenizer to work with these customized separators. I also see that we can take |
@rui-mo @zhli1142015 I'm recently working on similar issues and encounter this as well in Prestissimo. #10785 In my understanding, whether to allow dot"." or other special chars to be part of the column name is not a Spark vs Presto thing. How the column names can be defined is per table format specifications. For Hive, it should allow all unicode chars except dot (.) and colon (:) as specified in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL: Table names and column names are case insensitive but SerDe and property names are case sensitive.
According to https://issues.apache.org/jira/browse/HIVE-10120, column names like
Iceberg normalize special chars into some other ascii chars (see apache/iceberg#10120) . For example, an Iceberg table with TEST:A1B2.RAW.ABC-GG-1-A column is transformed into TEST_x3AA1B2_x2ERAW_x2EABC_x2DGG_x2D1_x2DA. This behavior is different from Hive so should use specialized Tokenizer. So it seems to me that
If it's really impossible to make the users not to use dots in the column names, may be we can force them to use it under backticks, while keeping plain dot as delimiters. e.g.
This is not conforming to Hive spec 100%, but is discussible. |
@yingsu00 Thanks for sharing this knowledge. Spark allows dot in the column name, but as #10693 does, the child scan creation will not use Tokenizer, and Gluten can pass Subfield created from the path element to Velox so as to avoid the use of Tokenizer. |
Description
Previously, field separators were added as parameter of Subfield with #6014. The only usage of Subfield in Gluten is to create SubfieldFilters like SubstraitToVeloxPlan.cpp#L718-L726. We found it is not enough to simply customize separators there to disable tokenizing field name by dot.
There are many usages of Subfield in Velox scan, including HiveDataSource.cpp#L762, TableHandle.cpp#L193-L194 and ScanSpec.cpp#L422 etc. We cannot control them so the default separators are still used.
To make it work, now we need to modify the default separators in a hard-coded way by changing
dot
to'\0'
. Do you think it is possible to add a flag in Velox flags, so people can change the default content of separators?The text was updated successfully, but these errors were encountered: