Avoid listing all schemas for Spark session catalog on schema pruning#4336
Closed
pan3793 wants to merge 2 commits intoapache:masterfrom
Closed
Avoid listing all schemas for Spark session catalog on schema pruning#4336pan3793 wants to merge 2 commits intoapache:masterfrom
pan3793 wants to merge 2 commits intoapache:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4336 +/- ##
============================================
- Coverage 53.66% 53.58% -0.08%
Complexity 13 13
============================================
Files 562 562
Lines 30738 30709 -29
Branches 4161 4142 -19
============================================
- Hits 16496 16456 -40
- Misses 12682 12706 +24
+ Partials 1560 1547 -13
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
cxzl25
approved these changes
Feb 16, 2023
pan3793
added a commit
that referenced
this pull request
Feb 16, 2023
… schema pruning
### _Why are the changes needed?_
Some DBMS tools like DBeaver and HUE will call thrift meta api for listing catalogs, databases, and tables. The current implementation of `CatalogShim_v3_0#getSchemas` will call `listAllNamespaces` first and do schema pruning on the Spark driver, which may cause "permission denied" exception when HMS has permission control, like the ranger plugin.
This PR proposes to call HMS API(through v1 session catalog) directly for `spark_catalog`, to suppress the above issue.
```
2023-02-15 20:02:13.048 ERROR org.apache.kyuubi.server.KyuubiTBinaryFrontendService: Error getting schemas:
org.apache.kyuubi.KyuubiSQLException: Error operating GetSchemas: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [user1] does not have [SELECT] privilege on [userdb1])
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:134)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:249)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.databaseExists(ExternalCatalogWithListener.scala:69)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.databaseExists(SessionCatalog.scala:294)
at org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:212)
at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.$anonfun$listAllNamespaces$1(CatalogShim_v3_0.scala:74)
at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.$anonfun$listAllNamespaces$1$adapted(CatalogShim_v3_0.scala:73)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.listAllNamespaces(CatalogShim_v3_0.scala:73)
at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.listAllNamespaces(CatalogShim_v3_0.scala:90)
at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.getSchemasWithPattern(CatalogShim_v3_0.scala:118)
at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.getSchemas(CatalogShim_v3_0.scala:133)
at org.apache.kyuubi.engine.spark.operation.GetSchemas.runInternal(GetSchemas.scala:43)
at org.apache.kyuubi.operation.AbstractOperation.run(AbstractOperation.scala:164)
at org.apache.kyuubi.session.AbstractSession.runOperation(AbstractSession.scala:99)
at org.apache.kyuubi.engine.spark.session.SparkSessionImpl.runOperation(SparkSessionImpl.scala:78)
at org.apache.kyuubi.session.AbstractSession.getSchemas(AbstractSession.scala:150)
at org.apache.kyuubi.service.AbstractBackendService.getSchemas(AbstractBackendService.scala:83)
at org.apache.kyuubi.service.TFrontendService.GetSchemas(TFrontendService.scala:294)
at org.apache.kyuubi.shade.org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1617)
at org.apache.kyuubi.shade.org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1602)
at org.apache.kyuubi.shade.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.kyuubi.shade.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.kyuubi.service.authentication.TSetIpAddressProcessor.process(TSetIpAddressProcessor.scala:36)
at org.apache.kyuubi.shade.org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
```
### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible
- [ ] Add screenshots for manual tests if appropriate
- [ ] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request
Closes #4336 from pan3793/list-schemas.
Closes #4336
9ece864 [Cheng Pan] fix
f71587e [Cheng Pan] Avoid listing all schemas for Spark session catalog on schema prunning
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
(cherry picked from commit 89fe835)
Signed-off-by: Cheng Pan <chengpan@apache.org>
pan3793
added a commit
that referenced
this pull request
Feb 16, 2023
… schema pruning
### _Why are the changes needed?_
Some DBMS tools like DBeaver and HUE will call thrift meta api for listing catalogs, databases, and tables. The current implementation of `CatalogShim_v3_0#getSchemas` will call `listAllNamespaces` first and do schema pruning on the Spark driver, which may cause "permission denied" exception when HMS has permission control, like the ranger plugin.
This PR proposes to call HMS API(through v1 session catalog) directly for `spark_catalog`, to suppress the above issue.
```
2023-02-15 20:02:13.048 ERROR org.apache.kyuubi.server.KyuubiTBinaryFrontendService: Error getting schemas:
org.apache.kyuubi.KyuubiSQLException: Error operating GetSchemas: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [user1] does not have [SELECT] privilege on [userdb1])
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:134)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:249)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.databaseExists(ExternalCatalogWithListener.scala:69)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.databaseExists(SessionCatalog.scala:294)
at org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:212)
at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.$anonfun$listAllNamespaces$1(CatalogShim_v3_0.scala:74)
at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.$anonfun$listAllNamespaces$1$adapted(CatalogShim_v3_0.scala:73)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.listAllNamespaces(CatalogShim_v3_0.scala:73)
at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.listAllNamespaces(CatalogShim_v3_0.scala:90)
at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.getSchemasWithPattern(CatalogShim_v3_0.scala:118)
at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.getSchemas(CatalogShim_v3_0.scala:133)
at org.apache.kyuubi.engine.spark.operation.GetSchemas.runInternal(GetSchemas.scala:43)
at org.apache.kyuubi.operation.AbstractOperation.run(AbstractOperation.scala:164)
at org.apache.kyuubi.session.AbstractSession.runOperation(AbstractSession.scala:99)
at org.apache.kyuubi.engine.spark.session.SparkSessionImpl.runOperation(SparkSessionImpl.scala:78)
at org.apache.kyuubi.session.AbstractSession.getSchemas(AbstractSession.scala:150)
at org.apache.kyuubi.service.AbstractBackendService.getSchemas(AbstractBackendService.scala:83)
at org.apache.kyuubi.service.TFrontendService.GetSchemas(TFrontendService.scala:294)
at org.apache.kyuubi.shade.org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1617)
at org.apache.kyuubi.shade.org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1602)
at org.apache.kyuubi.shade.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.kyuubi.shade.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.kyuubi.service.authentication.TSetIpAddressProcessor.process(TSetIpAddressProcessor.scala:36)
at org.apache.kyuubi.shade.org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
```
### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible
- [ ] Add screenshots for manual tests if appropriate
- [ ] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request
Closes #4336 from pan3793/list-schemas.
Closes #4336
9ece864 [Cheng Pan] fix
f71587e [Cheng Pan] Avoid listing all schemas for Spark session catalog on schema prunning
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
(cherry picked from commit 89fe835)
Signed-off-by: Cheng Pan <chengpan@apache.org>
|
Problem Solved. |
Member
Author
|
Thanks @echollee for verificating, merged to master/1.7/1.6 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
Some DBMS tools like DBeaver and HUE will call thrift meta api for listing catalogs, databases, and tables. The current implementation of
CatalogShim_v3_0#getSchemaswill calllistAllNamespacesfirst and do schema pruning on the Spark driver, which may cause "permission denied" exception when HMS has permission control, like the ranger plugin.This PR proposes to call HMS API(through v1 session catalog) directly for
spark_catalog, to suppress the above issue.How was this patch tested?
Add some test cases that check the changes thoroughly including negative and positive cases if possible
Add screenshots for manual tests if appropriate
Run test locally before make a pull request