Skip to content

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Feb 20, 2024

What changes were proposed in this pull request?

In Hive 0.13 and later, column names can contain any Unicode character (see HIVE-6013), however, dot (.) and colon (:) yield errors on querying, so they are disallowed in Hive 1.2.0 (see HIVE-10120). Any column name that is specified within backticks (`) is treated literally. Within a backtick string, use double backticks (``) to represent a backtick character. Backtick quotation also enables the use of reserved keywords for table and column identifiers.

According to Hive Doc, the column names have the flexibility to contain any character from the Unicode set.

This PR makes HiveExternalCatalog.verifyDataSchema:

  • Allow comma to be used in top-level column names
  • remove check invalid characters in nested type definition for hard-coded ",:;", which turns out to be incomplete. for example, "^%", etc., are not allowed. They are all delayed to Hive API calls instead.

Why are the changes needed?

improvement

Does this PR introduce any user-facing change?

yes, some special characters are now allowed and errors for some invalid characters now throw Spark Errors instead of Hive Meta Errors

How was this patch tested?

new tests

Was this patch authored or co-authored using generative AI tooling?

no

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I believe this is an improvement instead of a bug fix to provide a better Hive compatibility, @yaooqinn . If you don't mind, could you fix the PR description?

Screenshot 2024-02-20 at 08 16 38

  1. Does this comply with other RDBMSes too? I'm curious if this is another Hive esoteric feature or not.

errorClass = "INVALID_HIVE_COLUMN_TYPE",
parameters = Map(
"invalidChars" -> "',', ':', ';'",
"detailMessage" -> msg,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid embedding arbitrary text as parameters.

  • If you want provide more details, just put the cause exception as cause to AnalysisException.
  • clients might reassemble error messages from parameters and show it in different languages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the information

@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 21, 2024

Does this comply with other RDBMSes too? I'm curious if this is another Hive esoteric feature or not.

This does not change the parser layer, which means we already have the capability to handle special characters in the column names. This schema verification happens only in the hive catalog, while v1 in-memory, v2 jdbc, and other catalogs are free to use any character in column names.

We do not intend to comply with user behavior on Hive which we already do, but rather with underlying restrictions when calling HMS APIs.

@dongjoon-hyun
Copy link
Member

Got it. Thank you for the clarification~

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-47101][SQL] Make HiveExternalCatalog.verifyDataSchema comply with hive column name rules [SPARK-47101][SQL] Make HiveExternalCatalog.verifyDataSchema comply with hive column name rules Feb 21, 2024
@dongjoon-hyun
Copy link
Member

Also, cc @cloud-fan

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this offloading actually good and safe at Apache Spark layer in a long-term perspective?

TypeInfoUtils.getTypeInfoFromTypeString(f.dataType.catalogString)

  • Although this is for data schema, after this PR, are we consistent in partition column name?
  • Although Apache Spark already provides slight different logics for data sources and hive tables, are we going to become more consistent with Apache Parquet and Apache ORC data source tables after this PR?

@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 21, 2024

Hi @dongjoon-hyun.

  • This PR allows the use of commas in column names.

  • In contrast, we don't disallow any more special characters to be used in nested type information. Because it will eventually fail for HMS calls TypeInfoUtils.getTypeInfoFromTypeString, we just bring this step forward where we did for ",:;" before.

It might be necessary to verify that commas can be safely used in partition names, as they are allowed in column names.

create table a(`a,b` int, c int) using hive  PARTITIONED BY (`a,b`);
insert into a values(1, 2);
select * from a;
#output 
1	2


spark-sql (default)> !tree spark-warehouse/a;
spark-warehouse/a
└── a,b=2
    ├── part-00000-b75cb28d-3fb0-4858-b93d-3f089d3e63b4.c000
    └── part-00000-e558ae00-dcae-4025-bc6f-819a1debf209.c000

@dongjoon-hyun
Copy link
Member

Thank you. Could you revise the PR title to specifically narrow down to the following additional contribution instead of saying hive column name rules?

This PR allows the use of commas in column names.

@dongjoon-hyun
Copy link
Member

Ur, for the above example, it looks like unsafe in URL (S3 or Web URL based Hadoop-compatible file system). Can we use , in the middle of URI (except the file name part)?

spark-warehouse/a
└── a,b=2
    ├── part-00000-b75cb28d-3fb0-4858-b93d-3f089d3e63b4.c000
    └── part-00000-e558ae00-dcae-4025-bc6f-819a1debf209.c000

@yaooqinn
Copy link
Member Author

Ur, for the above example, it looks like unsafe in URL (S3 or Web URL based Hadoop-compatible file system). Can we use , in the middle of URI (except the file name part)?

https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html#object-key-guidelines

I see Comma(",") and Equals ("=") in the same group of Characters that might require special handling. As = is always there for partition keys, so it's safe?

@yaooqinn yaooqinn changed the title [SPARK-47101][SQL] Make HiveExternalCatalog.verifyDataSchema comply with hive column name rules [SPARK-47101][SQL] Allow comma to be used in top-level column names and use TypeInfoUtils.getTypeInfoFromTypeString to check nested type definition in HiveExternalCatalog.verifyDataSchema Feb 21, 2024
@yaooqinn
Copy link
Member Author

Hi @dongjoon-hyun, I updated the title and PR description; please check if they are clearer or too earful.

// Checks top-level column names
case _ if f.name.contains(",") =>
try {
TypeInfoUtils.getTypeInfoFromTypeString(f.dataType.catalogString)
Copy link
Contributor

@cloud-fan cloud-fan Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it do? I can't find it in the previous code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tokenizes the input, such as string, struct<ab:int>, and then parses it to org.apache.hadoop.hive.serde2.typeinfo.TypeInfo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this new check now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're absolutely right. We don't need this check, then it seems that we can remove verifyDataSchema entirely

@yaooqinn yaooqinn changed the title [SPARK-47101][SQL] Allow comma to be used in top-level column names and use TypeInfoUtils.getTypeInfoFromTypeString to check nested type definition in HiveExternalCatalog.verifyDataSchema [SPARK-47101][SQL] Allow comma to be used in top-level column names and remove check nested type definition in HiveExternalCatalog.verifyDataSchema Feb 21, 2024
exception = intercept[SparkException] {
sql(s"CREATE TABLE t (a $typ) USING hive")
},
errorClass = "CANNOT_RECOGNIZE_HIVE_TYPE",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my education, where do we throw this error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In HiveClientImpl.getSparkSQLDataType

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. We can reuse the existing error class in this case.

// delimiter characters
Seq(",", ":").foreach { c =>
val typ = s"array<struct<`abc${c}xyz`:int>>"
val replaced = typ.replaceAll("`", "").replaceAll("(?<=struct<|,)([^,<:]+)(?=:)", "`$1`")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this replace rule came from Hive? Can we have a link?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's clearer to write the string literal of the replaced value, instead of using this complex regex.

sql(s"CREATE TABLE t (a $typ) USING hive")
},
errorClass = "INVALID_HIVE_COLUMN_NAME",
errorClass = "_LEGACY_ERROR_TEMP_3065",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this PR switch from INVALID_HIVE_COLUMN_NAME to _LEGACY_ERROR_TEMP_3065?

Can we exclude the deletion of INVALID_HIVE_COLUMN_NAME from this PR?

  • docs/sql-error-conditions.md
  • common/utils/src/main/resources/error/error-classes.json

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INVALID_HIVE_COLUMN_NAME is not necessary anymore. 1) the restrictions for column names have been removed in this PR. 2) Nested field names belong to the data type part. For these two reasons, INVALID_HIVE_COLUMN_NAME could be removed.

_LEGACY_ERROR_TEMP_3065 is thrown by org.apache.spark.sql.hive.HiveExternalCatalog#withClient. It's hard to distinguish one Hive error from another for metastore API calls.

"tableName" -> "`spark_catalog`.`default`.`t1`",
"columnName" -> "`DATE '2018-01-01' + make_dt_interval(0, id, 0, 0`.`000000)`")
)
sql("CREATE TABLE t1 STORED AS parquet SELECT id as `a,b` FROM range(1)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this simpler version.
However, if you don't mind, shall we keep the existing test case, too?

SELECT id, DATE'2018-01-01' + MAKE_DT_INTERVAL(0, id) FROM RANGE(0, 10)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dongjoon-hyun, this is changed via request from @cloud-fan #45180 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it~

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @yaooqinn .

@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 4.0.0.

@yaooqinn
Copy link
Member Author

Thank you @dongjoon-hyun @cloud-fan @MaxGekk

withTable("t") {
checkError(
exception = intercept[SparkException] {
sql(s"CREATE TABLE t (a $typ) USING hive")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for parquet tables, do we still have this error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still fine

ericm-db pushed a commit to ericm-db/spark that referenced this pull request Mar 5, 2024
…nd remove check nested type definition in `HiveExternalCatalog.verifyDataSchema`

### What changes were proposed in this pull request?

> In Hive 0.13 and later, column names can contain any [Unicode](http://en.wikipedia.org/wiki/List_of_Unicode_characters) character (see [HIVE-6013](https://issues.apache.org/jira/browse/HIVE-6013)), however, dot (.) and colon (:) yield errors on querying, so they are disallowed in Hive 1.2.0 (see [HIVE-10120](https://issues.apache.org/jira/browse/HIVE-10120)). Any column name that is specified within backticks (`) is treated literally. Within a backtick string, use double backticks (``) to represent a backtick character. Backtick quotation also enables the use of reserved keywords for table and column identifiers.

According to Hive Doc, the column names have the flexibility to contain any character from the Unicode set.

This PR makes HiveExternalCatalog.verifyDataSchema:

- Allow comma to be used in top-level column names
- remove check invalid characters in nested type definition for hard-coded ",:;", which turns out to be incomplete. for example, "^%", etc., are not allowed. They are all delayed to Hive API calls instead.

### Why are the changes needed?

improvement

### Does this PR introduce _any_ user-facing change?

yes, some special characters are now allowed and errors for some invalid characters now throw Spark Errors instead of Hive Meta Errors

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#45180 from yaooqinn/SPARK-47101.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants