[SPARK-47101][SQL] Allow comma to be used in top-level column names and remove check nested type definition in `HiveExternalCatalog.verifyDataSchema` #45180

yaooqinn · 2024-02-20T10:01:41Z

What changes were proposed in this pull request?

In Hive 0.13 and later, column names can contain any Unicode character (see HIVE-6013), however, dot (.) and colon (:) yield errors on querying, so they are disallowed in Hive 1.2.0 (see HIVE-10120). Any column name that is specified within backticks (`) is treated literally. Within a backtick string, use double backticks (``) to represent a backtick character. Backtick quotation also enables the use of reserved keywords for table and column identifiers.

According to Hive Doc, the column names have the flexibility to contain any character from the Unicode set.

This PR makes HiveExternalCatalog.verifyDataSchema:

Allow comma to be used in top-level column names
remove check invalid characters in nested type definition for hard-coded ",:;", which turns out to be incomplete. for example, "^%", etc., are not allowed. They are all delayed to Hive API calls instead.

Why are the changes needed?

improvement

Does this PR introduce any user-facing change?

yes, some special characters are now allowed and errors for some invalid characters now throw Spark Errors instead of Hive Meta Errors

How was this patch tested?

new tests

Was this patch authored or co-authored using generative AI tooling?

no

…ith hive column name rules

dongjoon-hyun

I believe this is an improvement instead of a bug fix to provide a better Hive compatibility, @yaooqinn . If you don't mind, could you fix the PR description?

Does this comply with other RDBMSes too? I'm curious if this is another Hive esoteric feature or not.

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

MaxGekk · 2024-02-20T16:53:13Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+          errorClass = "INVALID_HIVE_COLUMN_TYPE",
          parameters = Map(
-            "invalidChars" -> "',', ':', ';'",
+            "detailMessage" -> msg,


We should avoid embedding arbitrary text as parameters.

If you want provide more details, just put the cause exception as cause to AnalysisException.

clients might reassemble error messages from parameters and show it in different languages.

Thank you for the information

yaooqinn · 2024-02-21T02:05:29Z

Does this comply with other RDBMSes too? I'm curious if this is another Hive esoteric feature or not.

This does not change the parser layer, which means we already have the capability to handle special characters in the column names. This schema verification happens only in the hive catalog, while v1 in-memory, v2 jdbc, and other catalogs are free to use any character in column names.

We do not intend to comply with user behavior on Hive which we already do, but rather with underlying restrictions when calling HMS APIs.

dongjoon-hyun · 2024-02-21T02:15:27Z

Got it. Thank you for the clarification~

dongjoon-hyun · 2024-02-21T03:24:54Z

Also, cc @cloud-fan

dongjoon-hyun

Is this offloading actually good and safe at Apache Spark layer in a long-term perspective?

TypeInfoUtils.getTypeInfoFromTypeString(f.dataType.catalogString)

Although this is for data schema, after this PR, are we consistent in partition column name?
Although Apache Spark already provides slight different logics for data sources and hive tables, are we going to become more consistent with Apache Parquet and Apache ORC data source tables after this PR?

yaooqinn · 2024-02-21T06:41:34Z

Hi @dongjoon-hyun.

This PR allows the use of commas in column names.
In contrast, we don't disallow any more special characters to be used in nested type information. Because it will eventually fail for HMS calls TypeInfoUtils.getTypeInfoFromTypeString, we just bring this step forward where we did for ",:;" before.

It might be necessary to verify that commas can be safely used in partition names, as they are allowed in column names.

create table a(`a,b` int, c int) using hive  PARTITIONED BY (`a,b`);
insert into a values(1, 2);
select * from a;
#output 
1	2


spark-sql (default)> !tree spark-warehouse/a;
spark-warehouse/a
└── a,b=2
    ├── part-00000-b75cb28d-3fb0-4858-b93d-3f089d3e63b4.c000
    └── part-00000-e558ae00-dcae-4025-bc6f-819a1debf209.c000

dongjoon-hyun · 2024-02-21T06:57:18Z

Thank you. Could you revise the PR title to specifically narrow down to the following additional contribution instead of saying hive column name rules?

This PR allows the use of commas in column names.

dongjoon-hyun · 2024-02-21T06:59:44Z

Ur, for the above example, it looks like unsafe in URL (S3 or Web URL based Hadoop-compatible file system). Can we use , in the middle of URI (except the file name part)?

spark-warehouse/a
└── a,b=2
    ├── part-00000-b75cb28d-3fb0-4858-b93d-3f089d3e63b4.c000
    └── part-00000-e558ae00-dcae-4025-bc6f-819a1debf209.c000

yaooqinn · 2024-02-21T07:54:49Z

Ur, for the above example, it looks like unsafe in URL (S3 or Web URL based Hadoop-compatible file system). Can we use , in the middle of URI (except the file name part)?

https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html#object-key-guidelines

I see Comma(",") and Equals ("=") in the same group of Characters that might require special handling. As = is always there for partition keys, so it's safe?

yaooqinn · 2024-02-21T08:01:39Z

Hi @dongjoon-hyun, I updated the title and PR description; please check if they are clearer or too earful.

cloud-fan · 2024-02-21T08:13:20Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

-          // Checks top-level column names
-          case _ if f.name.contains(",") =>
+        try {
+          TypeInfoUtils.getTypeInfoFromTypeString(f.dataType.catalogString)


what does it do? I can't find it in the previous code.

This tokenizes the input, such as string, struct<ab:int>, and then parses it to org.apache.hadoop.hive.serde2.typeinfo.TypeInfo

why do we need this new check now?

Ah, you're absolutely right. We don't need this check, then it seems that we can remove verifyDataSchema entirely

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

cloud-fan · 2024-02-21T11:59:09Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+          exception = intercept[SparkException] {
+            sql(s"CREATE TABLE t (a $typ) USING hive")
+          },
+          errorClass = "CANNOT_RECOGNIZE_HIVE_TYPE",


just for my education, where do we throw this error?

In HiveClientImpl.getSparkSQLDataType

Got it. We can reuse the existing error class in this case.

dongjoon-hyun · 2024-02-21T15:38:42Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+    // delimiter characters
+    Seq(",", ":").foreach { c =>
+      val typ = s"array<struct<`abc${c}xyz`:int>>"
+      val replaced = typ.replaceAll("`", "").replaceAll("(?<=struct<|,)([^,<:]+)(?=:)", "`$1`")


Does this replace rule came from Hive? Can we have a link?

I feel it's clearer to write the string literal of the replaced value, instead of using this complex regex.

dongjoon-hyun · 2024-02-21T15:42:29Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+            sql(s"CREATE TABLE t (a $typ) USING hive")
          },
-          errorClass = "INVALID_HIVE_COLUMN_NAME",
+          errorClass = "_LEGACY_ERROR_TEMP_3065",


Why does this PR switch from INVALID_HIVE_COLUMN_NAME to _LEGACY_ERROR_TEMP_3065?

Can we exclude the deletion of INVALID_HIVE_COLUMN_NAME from this PR?

docs/sql-error-conditions.md

common/utils/src/main/resources/error/error-classes.json

INVALID_HIVE_COLUMN_NAME is not necessary anymore. 1) the restrictions for column names have been removed in this PR. 2) Nested field names belong to the data type part. For these two reasons, INVALID_HIVE_COLUMN_NAME could be removed.

_LEGACY_ERROR_TEMP_3065 is thrown by org.apache.spark.sql.hive.HiveExternalCatalog#withClient. It's hard to distinguish one Hive error from another for metastore API calls.

dongjoon-hyun · 2024-02-21T15:44:19Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

-          "tableName" -> "`spark_catalog`.`default`.`t1`",
-          "columnName" -> "`DATE '2018-01-01' + make_dt_interval(0, id, 0, 0`.`000000)`")
-      )
+      sql("CREATE TABLE t1 STORED AS parquet SELECT id as `a,b` FROM range(1)")


Thank you for adding this simpler version.
However, if you don't mind, shall we keep the existing test case, too?

SELECT id, DATE'2018-01-01' + MAKE_DT_INTERVAL(0, id) FROM RANGE(0, 10)

Hi @dongjoon-hyun, this is changed via request from @cloud-fan #45180 (comment)

Ah, got it~

dongjoon-hyun

+1, LGTM. Thank you, @yaooqinn .

dongjoon-hyun · 2024-02-22T02:24:31Z

Merged to master for Apache Spark 4.0.0.

yaooqinn · 2024-02-22T02:31:34Z

Thank you @dongjoon-hyun @cloud-fan @MaxGekk

cloud-fan · 2024-02-26T13:36:27Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+      withTable("t") {
+        checkError(
+          exception = intercept[SparkException] {
+            sql(s"CREATE TABLE t (a $typ) USING hive")


for parquet tables, do we still have this error?

…nd remove check nested type definition in `HiveExternalCatalog.verifyDataSchema` ### What changes were proposed in this pull request? > In Hive 0.13 and later, column names can contain any [Unicode](http://en.wikipedia.org/wiki/List_of_Unicode_characters) character (see [HIVE-6013](https://issues.apache.org/jira/browse/HIVE-6013)), however, dot (.) and colon (:) yield errors on querying, so they are disallowed in Hive 1.2.0 (see [HIVE-10120](https://issues.apache.org/jira/browse/HIVE-10120)). Any column name that is specified within backticks (`) is treated literally. Within a backtick string, use double backticks (``) to represent a backtick character. Backtick quotation also enables the use of reserved keywords for table and column identifiers. According to Hive Doc, the column names have the flexibility to contain any character from the Unicode set. This PR makes HiveExternalCatalog.verifyDataSchema: - Allow comma to be used in top-level column names - remove check invalid characters in nested type definition for hard-coded ",:;", which turns out to be incomplete. for example, "^%", etc., are not allowed. They are all delayed to Hive API calls instead. ### Why are the changes needed? improvement ### Does this PR introduce _any_ user-facing change? yes, some special characters are now allowed and errors for some invalid characters now throw Spark Errors instead of Hive Meta Errors ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#45180 from yaooqinn/SPARK-47101. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

[SPARK-47101][SQL] Make HiveExternalCatalog.verifyDataSchema comply w…

bca98a8

…ith hive column name rules

github-actions bot added SQL DOCS labels Feb 20, 2024

[SPARK-47101][SQL] Make HiveExternalCatalog.verifyDataSchema comply w…

4cc8e1e

…ith hive column name rules

dongjoon-hyun reviewed Feb 20, 2024

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala Outdated Show resolved Hide resolved

MaxGekk reviewed Feb 20, 2024

View reviewed changes

address comments

4f98793

dongjoon-hyun changed the title ~~[SPARK-47101][SQL] Make HiveExternalCatalog.verifyDataSchema comply with hive column name rules~~ [SPARK-47101][SQL] Make HiveExternalCatalog.verifyDataSchema comply with hive column name rules Feb 21, 2024

fix test

4b7a3c7

dongjoon-hyun reviewed Feb 21, 2024

View reviewed changes

cloud-fan reviewed Feb 21, 2024

View reviewed changes

refine

30cabaf

cloud-fan reviewed Feb 21, 2024

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 21, 2024

View reviewed changes

test

a9061f6

dongjoon-hyun reviewed Feb 21, 2024

View reviewed changes

test

8464a1e

dongjoon-hyun approved these changes Feb 22, 2024

View reviewed changes

dongjoon-hyun closed this in 65fe9ef Feb 22, 2024

cloud-fan reviewed Feb 26, 2024

View reviewed changes

[SPARK-47101][SQL] Allow comma to be used in top-level column names and remove check nested type definition in HiveExternalCatalog.verifyDataSchema #45180

[SPARK-47101][SQL] Allow comma to be used in top-level column names and remove check nested type definition in HiveExternalCatalog.verifyDataSchema #45180

Uh oh!

Conversation

yaooqinn commented Feb 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Feb 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 21, 2024

Uh oh!

dongjoon-hyun commented Feb 21, 2024

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Feb 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 21, 2024

Uh oh!

dongjoon-hyun commented Feb 21, 2024

Uh oh!

yaooqinn commented Feb 21, 2024

Uh oh!

yaooqinn commented Feb 21, 2024

Uh oh!

cloud-fan Feb 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 22, 2024

Uh oh!

[SPARK-47101][SQL] Allow comma to be used in top-level column names and remove check nested type definition in `HiveExternalCatalog.verifyDataSchema` #45180

[SPARK-47101][SQL] Allow comma to be used in top-level column names and remove check nested type definition in `HiveExternalCatalog.verifyDataSchema` #45180

yaooqinn commented Feb 20, 2024 •

edited

Loading

yaooqinn commented Feb 21, 2024 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

yaooqinn commented Feb 21, 2024 •

edited

Loading

cloud-fan Feb 21, 2024 •

edited

Loading