Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36949][SQL] Disallow Hive provider tables with ANSI intervals #34259

Closed
wants to merge 4 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 12, 2021

What changes were proposed in this pull request?

In the PR, I propose to check column types of tables that use Hive as the provider, and disallow tables with ANSI intervals. Currently, Hive Metastore & SerDe don't support creating tables with interval types. So, even Spark converts Catalyst's ANSI interval types to Hive's types:

Hive Metastore fails with the error (for parquet SerDe):

Caused by: java.lang.UnsupportedOperationException: Unknown field type: interval_year_month
	at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getObjectInspector(ArrayWritableObjectInspector.java:141)
	at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.<init>(ArrayWritableObjectInspector.java:86)
	at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.<init>(ArrayWritableObjectInspector.java:59)
	at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:129)

at https://github.com/apache/hive/blame/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ArrayWritableObjectInspector.java#L94 .

After the changes, Spark will verify column types before sending unnecessary requests to Hive Metastore.

I more details, I added new check to the existing function DDLUtils.checkDataColNames(), and renamed the function to checkTableColumns() because it checks table column names together with column types.

Note: We still allow tables of non-Hive provider with ANSI intervals using Hive external catalog, see #34215.

Why are the changes needed?

To make the error message more clear and independent from Hive Metastore.

Does this PR introduce any user-facing change?

Yes. This PR changes the error message.

Before:

spark-sql> CREATE TABLE TBL STORED AS PARQUET AS SELECT INTERVAL '1-1' YEAR TO MONTH AS YM;
Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'interval year to month' but 'interval year to month' is found.

After:

spark-sql> CREATE TABLE TBL STORED AS PARQUET AS SELECT INTERVAL '1-1' YEAR TO MONTH AS YM;
21/10/12 16:51:14 ERROR SparkSQLDriver: Failed in [CREATE TABLE TBL STORED AS PARQUET AS SELECT INTERVAL '1-1' YEAR TO MONTH AS YM]
java.lang.UnsupportedOperationException: Hive table `default`.`TBL` with ANSI intervals is not supported
	at org.apache.spark.sql.errors.QueryExecutionErrors$.hiveTableWithAnsiIntervalsError(QueryExecutionErrors.scala:1831)

How was this patch tested?

By running new tests:

$ build/sbt -Phive-2.3 "test:testOnly *HiveDDLSuite"

@github-actions github-actions bot added the SQL label Oct 12, 2021
@SparkQA
Copy link

SparkQA commented Oct 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48624/

@SparkQA
Copy link

SparkQA commented Oct 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48624/

@MaxGekk MaxGekk changed the title [WIP][SPARK-36949][SQL] Disallow Hive provider tables with ANSI intervals [SPARK-36949][SQL] Disallow Hive provider tables with ANSI intervals Oct 12, 2021
@MaxGekk MaxGekk marked this pull request as ready for review October 12, 2021 14:11
@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 12, 2021

@sunchao Could you take a look at this, please.

@SparkQA
Copy link

SparkQA commented Oct 12, 2021

Test build #144147 has finished for PR 34259 at commit 9a1c538.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This also handles ALTER TABLE ADD COLUMN as well right? it's great to also add a test for that.

sql(
s"""
|CREATE TABLE $tbl
|STORED AS PARQUET
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also test CREATE TABLE without AS SELECT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Serde does not matter, we can put the test in SQLQuerySuite under sql/hive with parquet serde only.

Copy link
Member Author

@MaxGekk MaxGekk Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Could you give me an example, please. The PR added a test already #34215

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if INSERT works, I'm wondering why CTAS does not

Copy link
Member Author

@MaxGekk MaxGekk Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is in Hive's SerDe/Metastore. So, when you insert ANSI intervals to a table where Hive SerDe is not involved (provider is parquet, for instance), and we store schema in Spark's specific format to Hive external catalog, I wonder why do you wonder that INSERT works well?

In that case, we use Hive MetaStore as a store for our schema only. HMS is not aware of our types, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can put the test in SQLQuerySuite under sql/hive with parquet serde only.

Why not to HiveDDLSuite, for instance? This PR is mostly about creating/modifying a table (data definition) but not about querying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveDDLSuite sounds better!

Comment on lines +936 to +938
if (schema.exists(_.dataType.isInstanceOf[AnsiIntervalType])) {
throw hiveTableWithAnsiIntervalsError(table.identifier.toString)
} else if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) {
Copy link
Member Author

@MaxGekk MaxGekk Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have released that I have some concerns about the place of the check. This function is supposed to check column names not column types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think of new private function checkDataColTypes or checkColumnTypes, WDYT? And call the function from the same places as checkDataColNames() is called.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to the private methods to checkTableColumns

@sunchao
Copy link
Member

sunchao commented Oct 12, 2021

I also wonder if ANSI intervals will work for Parquet data source tables if spark.sql.hive.convertMetastoreParquet is set to false.

Edit: NVM, I see this config only affect Hive tables.

@SparkQA
Copy link

SparkQA commented Oct 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48677/

@SparkQA
Copy link

SparkQA commented Oct 13, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48677/

@SparkQA
Copy link

SparkQA commented Oct 13, 2021

Test build #144198 has finished for PR 34259 at commit 88cc112.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

s"""
|CREATE TABLE $tbl
|STORED AS ORC
|AS SELECT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also test CREATE TABLE without SELECT?

@SparkQA
Copy link

SparkQA commented Oct 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48683/

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 13, 2021

Merging to master. Thank you, @sunchao and @cloud-fan for review.

@MaxGekk MaxGekk closed this in 1aa3611 Oct 13, 2021
@SparkQA
Copy link

SparkQA commented Oct 13, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48683/

@SparkQA
Copy link

SparkQA commented Oct 13, 2021

Test build #144204 has finished for PR 34259 at commit a855322.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants