-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32546][SQL][3.0] Get table names directly from Hive tables #29377
[SPARK-32546][SQL][3.0] Get table names directly from Hive tables #29377
Conversation
Get table names directly from a sequence of Hive tables in `HiveClientImpl.listTablesByType()` by skipping conversions Hive tables to Catalog tables. A Hive metastore can be shared across many clients. A client can create tables using a SerDe which is not available on other clients, for instance `ROW FORMAT SERDE "com.ibm.spss.hive.serde2.xml.XmlSerDe"`. In the current implementation, other clients get the following exception while getting views: ``` java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class com.ibm.spss.hive.serde2.xml.XmlSerDe not found) ``` when `com.ibm.spss.hive.serde2.xml.XmlSerDe` is not available. Yes. For example, `SHOW VIEWS` returns a list of views instead of throwing an exception. - By existing test suites like: ``` $ build/sbt -Phive-2.3 "test:testOnly org.apache.spark.sql.hive.client.VersionsSuite" ``` - And manually: 1. Build Spark with Hive 1.2: `./build/sbt package -Phive-1.2 -Phive -Dhadoop.version=2.8.5` 2. Run spark-shell with a custom Hive SerDe, for instance download [json-serde-1.3.8-jar-with-dependencies.jar](https://github.com/cdamak/Twitter-Hive/blob/master/json-serde-1.3.8-jar-with-dependencies.jar) from https://github.com/cdamak/Twitter-Hive: ``` $ ./bin/spark-shell --jars ../Downloads/json-serde-1.3.8-jar-with-dependencies.jar ``` 3. Create a Hive table using this SerDe: ```scala scala> :paste // Entering paste mode (ctrl-D to finish) sql(s""" |CREATE TABLE json_table2(page_id INT NOT NULL) |ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' |""".stripMargin) // Exiting paste mode, now interpreting. res0: org.apache.spark.sql.DataFrame = [] scala> sql("SHOW TABLES").show +--------+-----------+-----------+ |database| tableName|isTemporary| +--------+-----------+-----------+ | default|json_table2| false| +--------+-----------+-----------+ scala> sql("SHOW VIEWS").show +---------+--------+-----------+ |namespace|viewName|isTemporary| +---------+--------+-----------+ +---------+--------+-----------+ ``` 4. Quit from the current `spark-shell` and run it without jars: ``` $ ./bin/spark-shell ``` 5. Show views. Without the fix, it throws the exception: ```scala scala> sql("SHOW VIEWS").show 20/08/06 10:53:36 ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found java.lang.ClassNotFoundException: Class org.openx.data.jsonserde.JsonSerDe not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:385) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258) at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605) ``` After the fix: ```scala scala> sql("SHOW VIEWS").show +---------+--------+-----------+ |namespace|viewName|isTemporary| +---------+--------+-----------+ +---------+--------+-----------+ ``` Closes apache#29363 from MaxGekk/fix-listTablesByType-for-views. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dc96f2f) Signed-off-by: Max Gekk <max.gekk@gmail.com>
} catch { | ||
case _: UnsupportedOperationException => | ||
// Fallback to filter logic if getTablesByType not supported. | ||
val tableNames = client.getTablesByPattern(dbName, pattern).asScala | ||
val tables = getTablesByName(dbName, tableNames).filter(_.tableType == tableType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Master's patch conflicts with branch-3.0 because of different base line:
- master: tableNames.toSeq
- branch-3.0: tableNames
Not on purpose but I removed @srowen 's changes c28a6fa#diff-6fd847124f8eae45ba2de1cf7d6296feR769 . I have realised that when I tried to cherry-pick the changes to branch-3.0. I will prepare a follow up PR for master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the fix for master #29379
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine if we 'break' Scala 2.13 a bit in master while we're still getting it working, if it makes it easier to not reason about it. If you carry back to .toSeq / .toMap-style changes into 3.0 it's fine though, still works
Test build #127136 has finished for PR 29377 at commit
|
thanks, merging to 3.0! |
### What changes were proposed in this pull request? Get table names directly from a sequence of Hive tables in `HiveClientImpl.listTablesByType()` by skipping conversions Hive tables to Catalog tables. ### Why are the changes needed? A Hive metastore can be shared across many clients. A client can create tables using a SerDe which is not available on other clients, for instance `ROW FORMAT SERDE "com.ibm.spss.hive.serde2.xml.XmlSerDe"`. In the current implementation, other clients get the following exception while getting views: ``` java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class com.ibm.spss.hive.serde2.xml.XmlSerDe not found) ``` when `com.ibm.spss.hive.serde2.xml.XmlSerDe` is not available. ### Does this PR introduce _any_ user-facing change? Yes. For example, `SHOW VIEWS` returns a list of views instead of throwing an exception. ### How was this patch tested? - By existing test suites like: ``` $ build/sbt -Phive-2.3 "test:testOnly org.apache.spark.sql.hive.client.VersionsSuite" ``` - And manually: 1. Build Spark with Hive 1.2: `./build/sbt package -Phive-1.2 -Phive -Dhadoop.version=2.8.5` 2. Run spark-shell with a custom Hive SerDe, for instance download [json-serde-1.3.8-jar-with-dependencies.jar](https://github.com/cdamak/Twitter-Hive/blob/master/json-serde-1.3.8-jar-with-dependencies.jar) from https://github.com/cdamak/Twitter-Hive: ``` $ ./bin/spark-shell --jars ../Downloads/json-serde-1.3.8-jar-with-dependencies.jar ``` 3. Create a Hive table using this SerDe: ```scala scala> :paste // Entering paste mode (ctrl-D to finish) sql(s""" |CREATE TABLE json_table2(page_id INT NOT NULL) |ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' |""".stripMargin) // Exiting paste mode, now interpreting. res0: org.apache.spark.sql.DataFrame = [] scala> sql("SHOW TABLES").show +--------+-----------+-----------+ |database| tableName|isTemporary| +--------+-----------+-----------+ | default|json_table2| false| +--------+-----------+-----------+ scala> sql("SHOW VIEWS").show +---------+--------+-----------+ |namespace|viewName|isTemporary| +---------+--------+-----------+ +---------+--------+-----------+ ``` 4. Quit from the current `spark-shell` and run it without jars: ``` $ ./bin/spark-shell ``` 5. Show views. Without the fix, it throws the exception: ```scala scala> sql("SHOW VIEWS").show 20/08/06 10:53:36 ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found java.lang.ClassNotFoundException: Class org.openx.data.jsonserde.JsonSerDe not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:385) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258) at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605) ``` After the fix: ```scala scala> sql("SHOW VIEWS").show +---------+--------+-----------+ |namespace|viewName|isTemporary| +---------+--------+-----------+ +---------+--------+-----------+ ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit dc96f2f) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #29377 from MaxGekk/fix-listTablesByType-for-views-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Get table names directly from a sequence of Hive tables in
HiveClientImpl.listTablesByType()
by skipping conversions Hive tables to Catalog tables.Why are the changes needed?
A Hive metastore can be shared across many clients. A client can create tables using a SerDe which is not available on other clients, for instance
ROW FORMAT SERDE "com.ibm.spss.hive.serde2.xml.XmlSerDe"
. In the current implementation, other clients get the following exception while getting views:when
com.ibm.spss.hive.serde2.xml.XmlSerDe
is not available.Does this PR introduce any user-facing change?
Yes. For example,
SHOW VIEWS
returns a list of views instead of throwing an exception.How was this patch tested?
Build Spark with Hive 1.2:
./build/sbt package -Phive-1.2 -Phive -Dhadoop.version=2.8.5
Run spark-shell with a custom Hive SerDe, for instance download json-serde-1.3.8-jar-with-dependencies.jar from https://github.com/cdamak/Twitter-Hive:
spark-shell
and run it without jars:After the fix:
Authored-by: Max Gekk max.gekk@gmail.com
Signed-off-by: Wenchen Fan wenchen@databricks.com
(cherry picked from commit dc96f2f)
Signed-off-by: Max Gekk max.gekk@gmail.com