Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32546][SQL] Get table names directly from Hive tables #29363

Closed

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Aug 5, 2020

What changes were proposed in this pull request?

Get table names directly from a sequence of Hive tables in HiveClientImpl.listTablesByType() by skipping conversions Hive tables to Catalog tables.

Why are the changes needed?

A Hive metastore can be shared across many clients. A client can create tables using a SerDe which is not available on other clients, for instance ROW FORMAT SERDE "com.ibm.spss.hive.serde2.xml.XmlSerDe". In the current implementation, other clients get the following exception while getting views:

java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class com.ibm.spss.hive.serde2.xml.XmlSerDe not found)

when com.ibm.spss.hive.serde2.xml.XmlSerDe is not available.

Does this PR introduce any user-facing change?

Yes. For example, SHOW VIEWS returns a list of views instead of throwing an exception.

How was this patch tested?

  • By existing test suites like:
$ build/sbt -Phive-2.3 "test:testOnly org.apache.spark.sql.hive.client.VersionsSuite"
  • And manually:
  1. Build Spark with Hive 1.2: ./build/sbt package -Phive-1.2 -Phive -Dhadoop.version=2.8.5

  2. Run spark-shell with a custom Hive SerDe, for instance download json-serde-1.3.8-jar-with-dependencies.jar from https://github.com/cdamak/Twitter-Hive:

$ ./bin/spark-shell --jars ../Downloads/json-serde-1.3.8-jar-with-dependencies.jar
  1. Create a Hive table using this SerDe:
scala> :paste
// Entering paste mode (ctrl-D to finish)

sql(s"""
  |CREATE TABLE json_table2(page_id INT NOT NULL)
  |ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
  |""".stripMargin)

// Exiting paste mode, now interpreting.
res0: org.apache.spark.sql.DataFrame = []

scala> sql("SHOW TABLES").show
+--------+-----------+-----------+
|database|  tableName|isTemporary|
+--------+-----------+-----------+
| default|json_table2|      false|
+--------+-----------+-----------+

scala> sql("SHOW VIEWS").show
+---------+--------+-----------+
|namespace|viewName|isTemporary|
+---------+--------+-----------+
+---------+--------+-----------+
  1. Quit from the current spark-shell and run it without jars:
$ ./bin/spark-shell
  1. Show views. Without the fix, it throws the exception:
scala> sql("SHOW VIEWS").show
20/08/06 10:53:36 ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found
java.lang.ClassNotFoundException: Class org.openx.data.jsonserde.JsonSerDe not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273)
	at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:385)
	at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
	at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258)
	at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605)

After the fix:

scala> sql("SHOW VIEWS").show
+---------+--------+-----------+
|namespace|viewName|isTemporary|
+---------+--------+-----------+
+---------+--------+-----------+

@MaxGekk MaxGekk changed the title [SPARK-32546][SQL] Gets table names directly from Hive tables [SPARK-32546][SQL] Get table names directly from Hive tables Aug 5, 2020
@@ -759,15 +759,17 @@ private[hive] class HiveClientImpl(
dbName: String,
pattern: String,
tableType: CatalogTableType): Seq[String] = withHiveState {
val hiveTableType = toHiveTableType(tableType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending Jenkins

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by the following, @MaxGekk ? The existing test suite already has been passing without this PR.

### How was this patch tested?

By existing test suites like:

$ build/sbt -Phive-2.3 "test:testOnly org.apache.spark.sql.hive.client.VersionsSuite"

In that section, could you provide the manual reproducible steps which described in your previous section?

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 5, 2020

What do you mean by the following, @MaxGekk ? The existing test suite already has been passing without this PR.

@dongjoon-hyun I mean that all modified lines by me are covered by the tests. Could you image a function which should increase an integer by one:

def plusOne(i: Int): Int = {
  downloadPage("http://www.blablabla.com")
  i + 1
}

If we remove unnecessary code downloadPage which can fail sometimes, the tests for main functionality will still pass. I don't see any reasons to check function behaviour when http://www.blablabla.com is not available.

The same in our situation listTablesByType() instantiates row SerDes via reflection even it is not needed to get the list of table names.

@SparkQA
Copy link

SparkQA commented Aug 5, 2020

Test build #127099 has finished for PR 29363 at commit 9236cc6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 5, 2020

Why don't you put your comment into the PR description? "How was this patch tested?" section is designed for that.

In addition to that, please note that what I asked was the following.

In that section, could you provide the manual reproducible steps which described in your previous section?

@HyukjinKwon
Copy link
Member

I agree with the point of @dongjoon-hyun here. It's best to describe how to test so people just read and follow.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me to except @dongjoon-hyun's point in the PR description.

@cloud-fan
Copy link
Contributor

The fix LGTM. This PR is kind of an improvement to skip the unnecessary table conversion, but also fixes the serde class loading issues. Agree with @dongjoon-hyun and let's mention it in the How was this patch tested? section.

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 6, 2020

@cloud-fan @HyukjinKwon @dongjoon-hyun I have updated PR's description. Please, take a look at this PR one more time.

@cloud-fan cloud-fan closed this in dc96f2f Aug 6, 2020
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 6, 2020

FYI, branch-3.0 has the same issue, should I backport it there ?

MaxGekk added a commit to MaxGekk/spark that referenced this pull request Aug 6, 2020
Get table names directly from a sequence of Hive tables in `HiveClientImpl.listTablesByType()` by skipping conversions Hive tables to Catalog tables.

A Hive metastore can be shared across many clients. A client can create tables using a SerDe which is not available on other clients, for instance `ROW FORMAT SERDE "com.ibm.spss.hive.serde2.xml.XmlSerDe"`. In the current implementation, other clients get the following exception while getting views:
```
java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class com.ibm.spss.hive.serde2.xml.XmlSerDe not found)
```
when `com.ibm.spss.hive.serde2.xml.XmlSerDe` is not available.

Yes. For example, `SHOW VIEWS` returns a list of views instead of throwing an exception.

- By existing test suites like:
```
$ build/sbt -Phive-2.3 "test:testOnly org.apache.spark.sql.hive.client.VersionsSuite"
```
- And manually:

1. Build Spark with Hive 1.2: `./build/sbt package -Phive-1.2 -Phive -Dhadoop.version=2.8.5`

2. Run spark-shell with a custom Hive SerDe, for instance download [json-serde-1.3.8-jar-with-dependencies.jar](https://github.com/cdamak/Twitter-Hive/blob/master/json-serde-1.3.8-jar-with-dependencies.jar) from https://github.com/cdamak/Twitter-Hive:
```
$ ./bin/spark-shell --jars ../Downloads/json-serde-1.3.8-jar-with-dependencies.jar
```

3. Create a Hive table using this SerDe:
```scala
scala> :paste
// Entering paste mode (ctrl-D to finish)

sql(s"""
  |CREATE TABLE json_table2(page_id INT NOT NULL)
  |ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
  |""".stripMargin)

// Exiting paste mode, now interpreting.
res0: org.apache.spark.sql.DataFrame = []

scala> sql("SHOW TABLES").show
+--------+-----------+-----------+
|database|  tableName|isTemporary|
+--------+-----------+-----------+
| default|json_table2|      false|
+--------+-----------+-----------+

scala> sql("SHOW VIEWS").show
+---------+--------+-----------+
|namespace|viewName|isTemporary|
+---------+--------+-----------+
+---------+--------+-----------+
```

4. Quit from the current `spark-shell` and run it without jars:
```
$ ./bin/spark-shell
```

5. Show views. Without the fix, it throws the exception:
```scala
scala> sql("SHOW VIEWS").show
20/08/06 10:53:36 ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found
java.lang.ClassNotFoundException: Class org.openx.data.jsonserde.JsonSerDe not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273)
	at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:385)
	at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
	at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258)
	at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605)
```

After the fix:
```scala
scala> sql("SHOW VIEWS").show
+---------+--------+-----------+
|namespace|viewName|isTemporary|
+---------+--------+-----------+
+---------+--------+-----------+
```

Closes apache#29363 from MaxGekk/fix-listTablesByType-for-views.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit dc96f2f)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 6, 2020

Here is a backport for 3.0 #29377

cloud-fan pushed a commit that referenced this pull request Aug 6, 2020
…entImpl.listTablesByType`

### What changes were proposed in this pull request?
Explicitly convert `tableNames` to `Seq` in `HiveClientImpl.listTablesByType` as it was done by c28a6fa#diff-6fd847124f8eae45ba2de1cf7d6296feR769

### Why are the changes needed?
See this PR #29111, to compile by Scala 2.13. The changes were discarded by #29363 accidentally.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Compiling by Scala 2.13

Closes #29379 from MaxGekk/fix-listTablesByType-for-views-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 6, 2020

Thank you, @MaxGekk and all. +1, late LGTM.

a0x8o added a commit to a0x8o/spark that referenced this pull request Aug 6, 2020
…entImpl.listTablesByType`

### What changes were proposed in this pull request?
Explicitly convert `tableNames` to `Seq` in `HiveClientImpl.listTablesByType` as it was done by apache/spark@c28a6fa#diff-6fd847124f8eae45ba2de1cf7d6296feR769

### Why are the changes needed?
See this PR apache/spark#29111, to compile by Scala 2.13. The changes were discarded by apache/spark#29363 accidentally.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Compiling by Scala 2.13

Closes #29379 from MaxGekk/fix-listTablesByType-for-views-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@MaxGekk MaxGekk deleted the fix-listTablesByType-for-views branch December 11, 2020 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants