-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-36949][SQL] Disallow Hive provider tables with ANSI intervals #34259
Conversation
Kubernetes integration test starting |
Kubernetes integration test status failure |
@sunchao Could you take a look at this, please. |
Test build #144147 has finished for PR 34259 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This also handles ALTER TABLE ADD COLUMN
as well right? it's great to also add a test for that.
sql( | ||
s""" | ||
|CREATE TABLE $tbl | ||
|STORED AS PARQUET |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also test CREATE TABLE
without AS SELECT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Serde does not matter, we can put the test in SQLQuerySuite
under sql/hive with parquet serde only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Could you give me an example, please. The PR added a test already #34215
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if INSERT works, I'm wondering why CTAS does not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is in Hive's SerDe/Metastore. So, when you insert ANSI intervals to a table where Hive SerDe is not involved (provider
is parquet, for instance), and we store schema in Spark's specific format to Hive external catalog, I wonder why do you wonder that INSERT works well?
In that case, we use Hive MetaStore as a store for our schema only. HMS is not aware of our types, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can put the test in SQLQuerySuite under sql/hive with parquet serde only.
Why not to HiveDDLSuite, for instance? This PR is mostly about creating/modifying a table (data definition) but not about querying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HiveDDLSuite
sounds better!
if (schema.exists(_.dataType.isInstanceOf[AnsiIntervalType])) { | ||
throw hiveTableWithAnsiIntervalsError(table.identifier.toString) | ||
} else if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have released that I have some concerns about the place of the check. This function is supposed to check column names not column types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think of new private function checkDataColTypes
or checkColumnTypes
, WDYT? And call the function from the same places as checkDataColNames()
is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to the private methods to checkTableColumns
I also wonder if ANSI intervals will work for Parquet data source tables if Edit: NVM, I see this config only affect Hive tables. |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #144198 has finished for PR 34259 at commit
|
s""" | ||
|CREATE TABLE $tbl | ||
|STORED AS ORC | ||
|AS SELECT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also test CREATE TABLE without SELECT?
Kubernetes integration test starting |
Merging to master. Thank you, @sunchao and @cloud-fan for review. |
Kubernetes integration test status failure |
Test build #144204 has finished for PR 34259 at commit
|
What changes were proposed in this pull request?
In the PR, I propose to check column types of tables that use
Hive
as the provider, and disallow tables with ANSI intervals. Currently, Hive Metastore & SerDe don't support creating tables with interval types. So, even Spark converts Catalyst's ANSI interval types to Hive's types:YearMonthIntervalType
-> INTERVAL_YEAR_MONTHDayTimeIntervalType
-> INTERVAL_DAY_TIMEHive Metastore fails with the error (for parquet SerDe):
at https://github.com/apache/hive/blame/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ArrayWritableObjectInspector.java#L94 .
After the changes, Spark will verify column types before sending unnecessary requests to Hive Metastore.
I more details, I added new check to the existing function
DDLUtils.checkDataColNames()
, and renamed the function tocheckTableColumns()
because it checks table column names together with column types.Note: We still allow tables of non-Hive provider with ANSI intervals using Hive external catalog, see #34215.
Why are the changes needed?
To make the error message more clear and independent from Hive Metastore.
Does this PR introduce any user-facing change?
Yes. This PR changes the error message.
Before:
After:
How was this patch tested?
By running new tests: