[SPARK-21599][SQL] Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException#18804
Conversation
…fail with java.util.NoSuchElementException
|
Test build #80135 has finished for PR 18804 at commit
|
|
retest this please |
|
Test build #80143 has finished for PR 18804 at commit
|
| |OPTIONS (skipHiveMetadata true) | ||
| """.stripMargin) | ||
| sql(s"insert into $table values (1, 1)") | ||
| sql(s"insert into $table values (2, 1)") |
There was a problem hiding this comment.
nit: minor style issue. INSERT INTO...
|
LGTM |
| // For datasource tables the data schema is stored in the table properties. | ||
| val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match { | ||
| case Some(provider) => getSchemaFromTableProperties(rawTable) | ||
| case _ => rawTable.schema |
There was a problem hiding this comment.
For Hive serde tables that were created by Spark 2.1 or later, we should still restore it from table properties.
There was a problem hiding this comment.
There was a problem hiding this comment.
Maybe call restoreTableMetadata to avoid duplicate logic.
There was a problem hiding this comment.
should we call getTable().schema or you guys think its too verbose ?
val schema = getTable(db, table).schema ?
There was a problem hiding this comment.
@viirya Actually, we do have a raw table here.. so i will just call restoreTableMetadata. Thanks a lot @gatorsmile and @viirya
There was a problem hiding this comment.
You still need rawTable. Call getTable will incur another metastore access.
There was a problem hiding this comment.
@viirya right. I agree. I was saying that we do have a raw table from a prior call. So here we just pass that to restoreTableMetadata like you suggested.
There was a problem hiding this comment.
Yeah, I saw your comment #18804 (comment) after post #18804 (comment). :)
| statsProperties += STATISTICS_NUM_ROWS -> stats.get.rowCount.get.toString() | ||
| } | ||
|
|
||
| // For datasource tables and hive serde tables created by spark 2.1 or higher, |
There was a problem hiding this comment.
Also add a test for hive serde tables?
There was a problem hiding this comment.
@viirya Hive serde tables should already be tested , right ?
https://github.com/dilipbiswal/spark/blob/420be2f28db5f413566c161aa7969db664cd8f3b/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L251
Isn't this testing hive serde tables ? Is there anything specific in hive serde tables that we want to test ?
There was a problem hiding this comment.
IIUC, here I think we should test when the schema read by getRawTable (i.e., rawTable) is different than the schema stored in table properties.
Previously because two copies of schema are different, you will get the NoSuchElement too. That's why we should call restoreTableMetadata here for hive serde tables.
There was a problem hiding this comment.
@viirya Thanks .. i see.. so far i always get the same schema for both. I will study this some more to see when they can be different.
There was a problem hiding this comment.
The case they are different I think is case sensitivity issue in Hive schema.
There was a problem hiding this comment.
@viirya Thank you. I have added the new test.
|
Test build #80149 has finished for PR 18804 at commit
|
|
retest this please |
|
Test build #80153 has finished for PR 18804 at commit
|
|
Test build #80178 has started for PR 18804 at commit |
| test("Analyze hive serde tables when schema is not same as schema in table properties") { | ||
| val table = "hive_serde_tab_cols_uppercase" | ||
| withTable(table) { | ||
| sql(s"CREATE TABLE $table (C1 INT, C2 STRING, C3 DOUBLE)") |
There was a problem hiding this comment.
For this test case, we need to add two asserts. One is to assert the schema of raw tables; another is to assert the rebuilt schema from table properties.
There was a problem hiding this comment.
@gatorsmile Sean, is it possible to get hold of the raw table schema from the testcase ?
There was a problem hiding this comment.
@gatorsmile I got it sean. I will follow whats done in HiveExternalCatalogSuite to see if it works. :-)
|
Test build #80192 has finished for PR 18804 at commit
|
|
Test build #80195 has finished for PR 18804 at commit
|
|
retest this please |
|
Test build #80196 has finished for PR 18804 at commit
|
|
Test build #80204 has finished for PR 18804 at commit
|
| } | ||
|
|
||
| test("Analyze hive serde tables when schema is not same as schema in table properties") { | ||
|
|
|
Thanks! LGTM |
|
Merged to master. Could you submit a backport PR to 2.2? |
|
@gatorsmile Thank you very much !! Sure, i will submit a backport to 2.2. |
… may fail with java.util.NoSuchElementException In case of datasource tables (when they are stored in non-hive compatible way) , the schema information is recorded as table properties in hive meta-store. The alterTableStats method needs to get the schema information from table properties for data source tables before recording the column level statistics. Currently, we don't get the correct schema information and fail with java.util.NoSuchElement exception. A new test case is added in StatisticsSuite. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes apache#18804 from dilipbiswal/datasource_stats.
What changes were proposed in this pull request?
In case of datasource tables (when they are stored in non-hive compatible way) , the schema information is recorded as table properties in hive meta-store. The alterTableStats method needs to get the schema information from table properties for data source tables before recording the column level statistics. Currently, we don't get the correct schema information and fail with java.util.NoSuchElement exception.
How was this patch tested?
A new test case is added in StatisticsSuite.