-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-47101][SQL] Allow comma to be used in top-level column names and remove check nested type definition in HiveExternalCatalog.verifyDataSchema
#45180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ith hive column name rules
…ith hive column name rules
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I believe this is an improvement instead of a bug fix to provide a better Hive compatibility, @yaooqinn . If you don't mind, could you fix the PR description?
- Does this comply with other RDBMSes too? I'm curious if this is another Hive esoteric feature or not.
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
Outdated
Show resolved
Hide resolved
| errorClass = "INVALID_HIVE_COLUMN_TYPE", | ||
| parameters = Map( | ||
| "invalidChars" -> "',', ':', ';'", | ||
| "detailMessage" -> msg, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid embedding arbitrary text as parameters.
- If you want provide more details, just put the cause exception as
causetoAnalysisException. - clients might reassemble error messages from parameters and show it in different languages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the information
This does not change the parser layer, which means we already have the capability to handle special characters in the column names. This schema verification happens only in the hive catalog, while v1 in-memory, v2 jdbc, and other catalogs are free to use any character in column names. We do not intend to comply with user behavior on Hive which we already do, but rather with underlying restrictions when calling HMS APIs. |
|
Got it. Thank you for the clarification~ |
HiveExternalCatalog.verifyDataSchema comply with hive column name rules
|
Also, cc @cloud-fan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this offloading actually good and safe at Apache Spark layer in a long-term perspective?
TypeInfoUtils.getTypeInfoFromTypeString(f.dataType.catalogString)
- Although this is for data schema, after this PR, are we consistent in
partition column name? - Although Apache Spark already provides slight different logics for data sources and hive tables, are we going to become more consistent with Apache Parquet and Apache ORC data source tables after this PR?
|
Hi @dongjoon-hyun.
It might be necessary to verify that commas can be safely used in partition names, as they are allowed in column names. |
|
Thank you. Could you revise the PR title to specifically narrow down to the following additional contribution instead of saying
|
|
Ur, for the above example, it looks like unsafe in URL (S3 or Web URL based Hadoop-compatible file system). Can we use |
https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html#object-key-guidelines I see Comma(",") and Equals ("=") in the same group of |
HiveExternalCatalog.verifyDataSchema comply with hive column name rulesTypeInfoUtils.getTypeInfoFromTypeString to check nested type definition in HiveExternalCatalog.verifyDataSchema
|
Hi @dongjoon-hyun, I updated the title and PR description; please check if they are clearer or too earful. |
| // Checks top-level column names | ||
| case _ if f.name.contains(",") => | ||
| try { | ||
| TypeInfoUtils.getTypeInfoFromTypeString(f.dataType.catalogString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it do? I can't find it in the previous code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tokenizes the input, such as string, struct<ab:int>, and then parses it to org.apache.hadoop.hive.serde2.typeinfo.TypeInfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this new check now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're absolutely right. We don't need this check, then it seems that we can remove verifyDataSchema entirely
TypeInfoUtils.getTypeInfoFromTypeString to check nested type definition in HiveExternalCatalog.verifyDataSchemaHiveExternalCatalog.verifyDataSchema
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
Outdated
Show resolved
Hide resolved
| exception = intercept[SparkException] { | ||
| sql(s"CREATE TABLE t (a $typ) USING hive") | ||
| }, | ||
| errorClass = "CANNOT_RECOGNIZE_HIVE_TYPE", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for my education, where do we throw this error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In HiveClientImpl.getSparkSQLDataType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. We can reuse the existing error class in this case.
| // delimiter characters | ||
| Seq(",", ":").foreach { c => | ||
| val typ = s"array<struct<`abc${c}xyz`:int>>" | ||
| val replaced = typ.replaceAll("`", "").replaceAll("(?<=struct<|,)([^,<:]+)(?=:)", "`$1`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this replace rule came from Hive? Can we have a link?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it's clearer to write the string literal of the replaced value, instead of using this complex regex.
| sql(s"CREATE TABLE t (a $typ) USING hive") | ||
| }, | ||
| errorClass = "INVALID_HIVE_COLUMN_NAME", | ||
| errorClass = "_LEGACY_ERROR_TEMP_3065", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this PR switch from INVALID_HIVE_COLUMN_NAME to _LEGACY_ERROR_TEMP_3065?
Can we exclude the deletion of INVALID_HIVE_COLUMN_NAME from this PR?
- docs/sql-error-conditions.md
- common/utils/src/main/resources/error/error-classes.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INVALID_HIVE_COLUMN_NAME is not necessary anymore. 1) the restrictions for column names have been removed in this PR. 2) Nested field names belong to the data type part. For these two reasons, INVALID_HIVE_COLUMN_NAME could be removed.
_LEGACY_ERROR_TEMP_3065 is thrown by org.apache.spark.sql.hive.HiveExternalCatalog#withClient. It's hard to distinguish one Hive error from another for metastore API calls.
| "tableName" -> "`spark_catalog`.`default`.`t1`", | ||
| "columnName" -> "`DATE '2018-01-01' + make_dt_interval(0, id, 0, 0`.`000000)`") | ||
| ) | ||
| sql("CREATE TABLE t1 STORED AS parquet SELECT id as `a,b` FROM range(1)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this simpler version.
However, if you don't mind, shall we keep the existing test case, too?
SELECT id, DATE'2018-01-01' + MAKE_DT_INTERVAL(0, id) FROM RANGE(0, 10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dongjoon-hyun, this is changed via request from @cloud-fan #45180 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it~
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @yaooqinn .
|
Merged to master for Apache Spark 4.0.0. |
|
Thank you @dongjoon-hyun @cloud-fan @MaxGekk |
| withTable("t") { | ||
| checkError( | ||
| exception = intercept[SparkException] { | ||
| sql(s"CREATE TABLE t (a $typ) USING hive") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for parquet tables, do we still have this error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still fine
…nd remove check nested type definition in `HiveExternalCatalog.verifyDataSchema` ### What changes were proposed in this pull request? > In Hive 0.13 and later, column names can contain any [Unicode](http://en.wikipedia.org/wiki/List_of_Unicode_characters) character (see [HIVE-6013](https://issues.apache.org/jira/browse/HIVE-6013)), however, dot (.) and colon (:) yield errors on querying, so they are disallowed in Hive 1.2.0 (see [HIVE-10120](https://issues.apache.org/jira/browse/HIVE-10120)). Any column name that is specified within backticks (`) is treated literally. Within a backtick string, use double backticks (``) to represent a backtick character. Backtick quotation also enables the use of reserved keywords for table and column identifiers. According to Hive Doc, the column names have the flexibility to contain any character from the Unicode set. This PR makes HiveExternalCatalog.verifyDataSchema: - Allow comma to be used in top-level column names - remove check invalid characters in nested type definition for hard-coded ",:;", which turns out to be incomplete. for example, "^%", etc., are not allowed. They are all delayed to Hive API calls instead. ### Why are the changes needed? improvement ### Does this PR introduce _any_ user-facing change? yes, some special characters are now allowed and errors for some invalid characters now throw Spark Errors instead of Hive Meta Errors ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#45180 from yaooqinn/SPARK-47101. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

What changes were proposed in this pull request?
According to Hive Doc, the column names have the flexibility to contain any character from the Unicode set.
This PR makes HiveExternalCatalog.verifyDataSchema:
Why are the changes needed?
improvement
Does this PR introduce any user-facing change?
yes, some special characters are now allowed and errors for some invalid characters now throw Spark Errors instead of Hive Meta Errors
How was this patch tested?
new tests
Was this patch authored or co-authored using generative AI tooling?
no