[SPARK-32546][SQL] Get table names directly from Hive tables #29363

MaxGekk · 2020-08-05T16:29:27Z

What changes were proposed in this pull request?

Get table names directly from a sequence of Hive tables in HiveClientImpl.listTablesByType() by skipping conversions Hive tables to Catalog tables.

Why are the changes needed?

A Hive metastore can be shared across many clients. A client can create tables using a SerDe which is not available on other clients, for instance ROW FORMAT SERDE "com.ibm.spss.hive.serde2.xml.XmlSerDe". In the current implementation, other clients get the following exception while getting views:

java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class com.ibm.spss.hive.serde2.xml.XmlSerDe not found)

when com.ibm.spss.hive.serde2.xml.XmlSerDe is not available.

Does this PR introduce any user-facing change?

Yes. For example, SHOW VIEWS returns a list of views instead of throwing an exception.

How was this patch tested?

By existing test suites like:

$ build/sbt -Phive-2.3 "test:testOnly org.apache.spark.sql.hive.client.VersionsSuite"

And manually:

Build Spark with Hive 1.2: ./build/sbt package -Phive-1.2 -Phive -Dhadoop.version=2.8.5
Run spark-shell with a custom Hive SerDe, for instance download json-serde-1.3.8-jar-with-dependencies.jar from https://github.com/cdamak/Twitter-Hive:

$ ./bin/spark-shell --jars ../Downloads/json-serde-1.3.8-jar-with-dependencies.jar

Create a Hive table using this SerDe:

scala> :paste
// Entering paste mode (ctrl-D to finish)

sql(s"""
  |CREATE TABLE json_table2(page_id INT NOT NULL)
  |ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
  |""".stripMargin)

// Exiting paste mode, now interpreting.
res0: org.apache.spark.sql.DataFrame = []

scala> sql("SHOW TABLES").show
+--------+-----------+-----------+
|database|  tableName|isTemporary|
+--------+-----------+-----------+
| default|json_table2|      false|
+--------+-----------+-----------+

scala> sql("SHOW VIEWS").show
+---------+--------+-----------+
|namespace|viewName|isTemporary|
+---------+--------+-----------+
+---------+--------+-----------+

Quit from the current spark-shell and run it without jars:

$ ./bin/spark-shell

Show views. Without the fix, it throws the exception:

scala> sql("SHOW VIEWS").show
20/08/06 10:53:36 ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found
java.lang.ClassNotFoundException: Class org.openx.data.jsonserde.JsonSerDe not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273)
	at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:385)
	at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
	at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258)
	at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605)

After the fix:

scala> sql("SHOW VIEWS").show
+---------+--------+-----------+
|namespace|viewName|isTemporary|
+---------+--------+-----------+
+---------+--------+-----------+

gatorsmile · 2020-08-05T16:58:28Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

@@ -759,15 +759,17 @@ private[hive] class HiveClientImpl(
      dbName: String,
      pattern: String,
      tableType: CatalogTableType): Seq[String] = withHiveState {
+    val hiveTableType = toHiveTableType(tableType)


gatorsmile

LGTM pending Jenkins

dongjoon-hyun

What do you mean by the following, @MaxGekk ? The existing test suite already has been passing without this PR.

### How was this patch tested?

By existing test suites like:

$ build/sbt -Phive-2.3 "test:testOnly org.apache.spark.sql.hive.client.VersionsSuite"

In that section, could you provide the manual reproducible steps which described in your previous section?

MaxGekk · 2020-08-05T18:30:01Z

What do you mean by the following, @MaxGekk ? The existing test suite already has been passing without this PR.

@dongjoon-hyun I mean that all modified lines by me are covered by the tests. Could you image a function which should increase an integer by one:

def plusOne(i: Int): Int = {
  downloadPage("http://www.blablabla.com")
  i + 1
}

If we remove unnecessary code downloadPage which can fail sometimes, the tests for main functionality will still pass. I don't see any reasons to check function behaviour when http://www.blablabla.com is not available.

The same in our situation listTablesByType() instantiates row SerDes via reflection even it is not needed to get the list of table names.

SparkQA · 2020-08-05T18:36:13Z

Test build #127099 has finished for PR 29363 at commit 9236cc6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-08-05T21:53:13Z

Why don't you put your comment into the PR description? "How was this patch tested?" section is designed for that.

In addition to that, please note that what I asked was the following.

In that section, could you provide the manual reproducible steps which described in your previous section?

HyukjinKwon · 2020-08-06T01:42:24Z

I agree with the point of @dongjoon-hyun here. It's best to describe how to test so people just read and follow.

HyukjinKwon

Looks good to me to except @dongjoon-hyun's point in the PR description.

cloud-fan · 2020-08-06T04:54:23Z

The fix LGTM. This PR is kind of an improvement to skip the unnecessary table conversion, but also fixes the serde class loading issues. Agree with @dongjoon-hyun and let's mention it in the How was this patch tested? section.

MaxGekk · 2020-08-06T07:35:44Z

@cloud-fan @HyukjinKwon @dongjoon-hyun I have updated PR's description. Please, take a look at this PR one more time.

cloud-fan · 2020-08-06T08:36:39Z

thanks, merging to master!

MaxGekk · 2020-08-06T08:46:35Z

FYI, branch-3.0 has the same issue, should I backport it there ?

Get table names directly from a sequence of Hive tables in `HiveClientImpl.listTablesByType()` by skipping conversions Hive tables to Catalog tables. A Hive metastore can be shared across many clients. A client can create tables using a SerDe which is not available on other clients, for instance `ROW FORMAT SERDE "com.ibm.spss.hive.serde2.xml.XmlSerDe"`. In the current implementation, other clients get the following exception while getting views: ``` java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class com.ibm.spss.hive.serde2.xml.XmlSerDe not found) ``` when `com.ibm.spss.hive.serde2.xml.XmlSerDe` is not available. Yes. For example, `SHOW VIEWS` returns a list of views instead of throwing an exception. - By existing test suites like: ``` $ build/sbt -Phive-2.3 "test:testOnly org.apache.spark.sql.hive.client.VersionsSuite" ``` - And manually: 1. Build Spark with Hive 1.2: `./build/sbt package -Phive-1.2 -Phive -Dhadoop.version=2.8.5` 2. Run spark-shell with a custom Hive SerDe, for instance download [json-serde-1.3.8-jar-with-dependencies.jar](https://github.com/cdamak/Twitter-Hive/blob/master/json-serde-1.3.8-jar-with-dependencies.jar) from https://github.com/cdamak/Twitter-Hive: ``` $ ./bin/spark-shell --jars ../Downloads/json-serde-1.3.8-jar-with-dependencies.jar ``` 3. Create a Hive table using this SerDe: ```scala scala> :paste // Entering paste mode (ctrl-D to finish) sql(s""" |CREATE TABLE json_table2(page_id INT NOT NULL) |ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' |""".stripMargin) // Exiting paste mode, now interpreting. res0: org.apache.spark.sql.DataFrame = [] scala> sql("SHOW TABLES").show +--------+-----------+-----------+ |database| tableName|isTemporary| +--------+-----------+-----------+ | default|json_table2| false| +--------+-----------+-----------+ scala> sql("SHOW VIEWS").show +---------+--------+-----------+ |namespace|viewName|isTemporary| +---------+--------+-----------+ +---------+--------+-----------+ ``` 4. Quit from the current `spark-shell` and run it without jars: ``` $ ./bin/spark-shell ``` 5. Show views. Without the fix, it throws the exception: ```scala scala> sql("SHOW VIEWS").show 20/08/06 10:53:36 ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found java.lang.ClassNotFoundException: Class org.openx.data.jsonserde.JsonSerDe not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:385) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258) at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605) ``` After the fix: ```scala scala> sql("SHOW VIEWS").show +---------+--------+-----------+ |namespace|viewName|isTemporary| +---------+--------+-----------+ +---------+--------+-----------+ ``` Closes apache#29363 from MaxGekk/fix-listTablesByType-for-views. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dc96f2f) Signed-off-by: Max Gekk <max.gekk@gmail.com>

MaxGekk · 2020-08-06T10:42:33Z

Here is a backport for 3.0 #29377

…entImpl.listTablesByType` ### What changes were proposed in this pull request? Explicitly convert `tableNames` to `Seq` in `HiveClientImpl.listTablesByType` as it was done by c28a6fa#diff-6fd847124f8eae45ba2de1cf7d6296feR769 ### Why are the changes needed? See this PR #29111, to compile by Scala 2.13. The changes were discarded by #29363 accidentally. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Compiling by Scala 2.13 Closes #29379 from MaxGekk/fix-listTablesByType-for-views-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dongjoon-hyun · 2020-08-06T14:33:54Z

Thank you, @MaxGekk and all. +1, late LGTM.

…entImpl.listTablesByType` ### What changes were proposed in this pull request? Explicitly convert `tableNames` to `Seq` in `HiveClientImpl.listTablesByType` as it was done by apache/spark@c28a6fa#diff-6fd847124f8eae45ba2de1cf7d6296feR769 ### Why are the changes needed? See this PR apache/spark#29111, to compile by Scala 2.13. The changes were discarded by apache/spark#29363 accidentally. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Compiling by Scala 2.13 Closes #29379 from MaxGekk/fix-listTablesByType-for-views-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Gets table names directly from Hive tables

9236cc6

probot-autolabeler bot added the SQL label Aug 5, 2020

MaxGekk changed the title ~~[SPARK-32546][SQL] Gets table names directly from Hive tables~~ [SPARK-32546][SQL] Get table names directly from Hive tables Aug 5, 2020

gatorsmile reviewed Aug 5, 2020

View reviewed changes

gatorsmile approved these changes Aug 5, 2020

View reviewed changes

dongjoon-hyun requested changes Aug 5, 2020

View reviewed changes

HyukjinKwon approved these changes Aug 6, 2020

View reviewed changes

cloud-fan closed this in dc96f2f Aug 6, 2020

MaxGekk mentioned this pull request Aug 6, 2020

[SPARK-32546][SQL][FOLLOWUP] Add .toSeq to tableNames in HiveClientImpl.listTablesByType #29379

Closed

MaxGekk deleted the fix-listTablesByType-for-views branch December 11, 2020 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32546][SQL] Get table names directly from Hive tables #29363

[SPARK-32546][SQL] Get table names directly from Hive tables #29363

MaxGekk commented Aug 5, 2020 •

edited

gatorsmile Aug 5, 2020

gatorsmile left a comment

dongjoon-hyun left a comment •

edited

MaxGekk commented Aug 5, 2020

SparkQA commented Aug 5, 2020

dongjoon-hyun commented Aug 5, 2020 •

edited

HyukjinKwon commented Aug 6, 2020

HyukjinKwon left a comment

cloud-fan commented Aug 6, 2020

MaxGekk commented Aug 6, 2020

cloud-fan commented Aug 6, 2020

MaxGekk commented Aug 6, 2020

MaxGekk commented Aug 6, 2020

dongjoon-hyun commented Aug 6, 2020 •

edited

[SPARK-32546][SQL] Get table names directly from Hive tables #29363

[SPARK-32546][SQL] Get table names directly from Hive tables #29363

Conversation

MaxGekk commented Aug 5, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

gatorsmile Aug 5, 2020

Choose a reason for hiding this comment

gatorsmile left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited

Choose a reason for hiding this comment

MaxGekk commented Aug 5, 2020

SparkQA commented Aug 5, 2020

dongjoon-hyun commented Aug 5, 2020 • edited

HyukjinKwon commented Aug 6, 2020

HyukjinKwon left a comment

Choose a reason for hiding this comment

cloud-fan commented Aug 6, 2020

MaxGekk commented Aug 6, 2020

cloud-fan commented Aug 6, 2020

MaxGekk commented Aug 6, 2020

MaxGekk commented Aug 6, 2020

dongjoon-hyun commented Aug 6, 2020 • edited

MaxGekk commented Aug 5, 2020 •

edited

dongjoon-hyun left a comment •

edited

dongjoon-hyun commented Aug 5, 2020 •

edited

dongjoon-hyun commented Aug 6, 2020 •

edited