Skip to content

[SPARK-21599][SQL] Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException#18804

Closed
dilipbiswal wants to merge 7 commits intoapache:masterfrom
dilipbiswal:datasource_stats
Closed

[SPARK-21599][SQL] Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException#18804
dilipbiswal wants to merge 7 commits intoapache:masterfrom
dilipbiswal:datasource_stats

Conversation

@dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

In case of datasource tables (when they are stored in non-hive compatible way) , the schema information is recorded as table properties in hive meta-store. The alterTableStats method needs to get the schema information from table properties for data source tables before recording the column level statistics. Currently, we don't get the correct schema information and fail with java.util.NoSuchElement exception.

How was this patch tested?

A new test case is added in StatisticsSuite.

@dilipbiswal dilipbiswal changed the title [SPARK-21599] Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException [SPARK-21599][SQL} Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException Aug 1, 2017
@dilipbiswal dilipbiswal changed the title [SPARK-21599][SQL} Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException [SPARK-21599][SQL] Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException Aug 1, 2017
@dilipbiswal
Copy link
Contributor Author

cc @wzhfy @gatorsmile

@SparkQA
Copy link

SparkQA commented Aug 2, 2017

Test build #80135 has finished for PR 18804 at commit 0afefd5.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 2, 2017

Test build #80143 has finished for PR 18804 at commit 0afefd5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

|OPTIONS (skipHiveMetadata true)
""".stripMargin)
sql(s"insert into $table values (1, 1)")
sql(s"insert into $table values (2, 1)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: minor style issue. INSERT INTO...

@viirya
Copy link
Member

viirya commented Aug 2, 2017

LGTM

// For datasource tables the data schema is stored in the table properties.
val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match {
case Some(provider) => getSchemaFromTableProperties(rawTable)
case _ => rawTable.schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Hive serde tables that were created by Spark 2.1 or later, we should still restore it from table properties.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call restoreTableMetadata to avoid duplicate logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call getTable().schema or you guys think its too verbose ?
val schema = getTable(db, table).schema ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Actually, we do have a raw table here.. so i will just call restoreTableMetadata. Thanks a lot @gatorsmile and @viirya

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need rawTable. Call getTable will incur another metastore access.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya right. I agree. I was saying that we do have a raw table from a prior call. So here we just pass that to restoreTableMetadata like you suggested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I saw your comment #18804 (comment) after post #18804 (comment). :)

statsProperties += STATISTICS_NUM_ROWS -> stats.get.rowCount.get.toString()
}

// For datasource tables and hive serde tables created by spark 2.1 or higher,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a test for hive serde tables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Hive serde tables should already be tested , right ?
https://github.com/dilipbiswal/spark/blob/420be2f28db5f413566c161aa7969db664cd8f3b/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L251
Isn't this testing hive serde tables ? Is there anything specific in hive serde tables that we want to test ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, here I think we should test when the schema read by getRawTable (i.e., rawTable) is different than the schema stored in table properties.

Previously because two copies of schema are different, you will get the NoSuchElement too. That's why we should call restoreTableMetadata here for hive serde tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Thanks .. i see.. so far i always get the same schema for both. I will study this some more to see when they can be different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case they are different I think is case sensitivity issue in Hive schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Thank you. I have added the new test.

@SparkQA
Copy link

SparkQA commented Aug 2, 2017

Test build #80149 has finished for PR 18804 at commit 420be2f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 2, 2017

Test build #80153 has finished for PR 18804 at commit 420be2f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 2, 2017

Test build #80178 has started for PR 18804 at commit 80b0073.

test("Analyze hive serde tables when schema is not same as schema in table properties") {
val table = "hive_serde_tab_cols_uppercase"
withTable(table) {
sql(s"CREATE TABLE $table (C1 INT, C2 STRING, C3 DOUBLE)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test case, we need to add two asserts. One is to assert the schema of raw tables; another is to assert the rebuilt schema from table properties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Sean, is it possible to get hold of the raw table schema from the testcase ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile I got it sean. I will follow whats done in HiveExternalCatalogSuite to see if it works. :-)

@SparkQA
Copy link

SparkQA commented Aug 3, 2017

Test build #80192 has finished for PR 18804 at commit 7a8fa2c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2017

Test build #80195 has finished for PR 18804 at commit c1ab569.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 3, 2017

Test build #80196 has finished for PR 18804 at commit c1ab569.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2017

Test build #80204 has finished for PR 18804 at commit a120357.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

test("Analyze hive serde tables when schema is not same as schema in table properties") {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This line is useless.

@gatorsmile
Copy link
Member

Thanks! LGTM

@asfgit asfgit closed this in 13785da Aug 3, 2017
@gatorsmile
Copy link
Member

Merged to master. Could you submit a backport PR to 2.2?

@dilipbiswal
Copy link
Contributor Author

@gatorsmile Thank you very much !! Sure, i will submit a backport to 2.2.

dilipbiswal added a commit to dilipbiswal/spark that referenced this pull request Aug 4, 2017
… may fail with java.util.NoSuchElementException

In case of datasource tables (when they are stored in non-hive compatible way) , the schema information is recorded as table properties in hive meta-store. The alterTableStats method needs to get the schema information from table properties for data source tables before recording the column level statistics. Currently, we don't get the correct schema information and fail with java.util.NoSuchElement exception.

A new test case is added in StatisticsSuite.

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes apache#18804 from dilipbiswal/datasource_stats.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants