Avoid listing all schemas for Spark session catalog on schema pruning by pan3793 · Pull Request #4336 · apache/kyuubi

pan3793 · 2023-02-15T12:09:32Z

Why are the changes needed?

Some DBMS tools like DBeaver and HUE will call thrift meta api for listing catalogs, databases, and tables. The current implementation of CatalogShim_v3_0#getSchemas will call listAllNamespaces first and do schema pruning on the Spark driver, which may cause "permission denied" exception when HMS has permission control, like the ranger plugin.

This PR proposes to call HMS API(through v1 session catalog) directly for spark_catalog, to suppress the above issue.

2023-02-15 20:02:13.048 ERROR org.apache.kyuubi.server.KyuubiTBinaryFrontendService: Error getting schemas:
org.apache.kyuubi.KyuubiSQLException: Error operating GetSchemas: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [user1] does not have [SELECT] privilege on [userdb1])
        at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:134)
        at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:249)
        at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.databaseExists(ExternalCatalogWithListener.scala:69)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.databaseExists(SessionCatalog.scala:294)
        at org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:212)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.$anonfun$listAllNamespaces$1(CatalogShim_v3_0.scala:74)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.$anonfun$listAllNamespaces$1$adapted(CatalogShim_v3_0.scala:73)
        at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
        at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
        at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
        at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.listAllNamespaces(CatalogShim_v3_0.scala:73)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.listAllNamespaces(CatalogShim_v3_0.scala:90)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.getSchemasWithPattern(CatalogShim_v3_0.scala:118)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.getSchemas(CatalogShim_v3_0.scala:133)
        at org.apache.kyuubi.engine.spark.operation.GetSchemas.runInternal(GetSchemas.scala:43)
        at org.apache.kyuubi.operation.AbstractOperation.run(AbstractOperation.scala:164)
        at org.apache.kyuubi.session.AbstractSession.runOperation(AbstractSession.scala:99)
        at org.apache.kyuubi.engine.spark.session.SparkSessionImpl.runOperation(SparkSessionImpl.scala:78)
        at org.apache.kyuubi.session.AbstractSession.getSchemas(AbstractSession.scala:150)
        at org.apache.kyuubi.service.AbstractBackendService.getSchemas(AbstractBackendService.scala:83)
        at org.apache.kyuubi.service.TFrontendService.GetSchemas(TFrontendService.scala:294)
        at org.apache.kyuubi.shade.org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1617)
        at org.apache.kyuubi.shade.org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1602)
        at org.apache.kyuubi.shade.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
        at org.apache.kyuubi.shade.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
        at org.apache.kyuubi.service.authentication.TSetIpAddressProcessor.process(TSetIpAddressProcessor.scala:36)
        at org.apache.kyuubi.shade.org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

How was this patch tested?

Add some test cases that check the changes thoroughly including negative and positive cases if possible
Add screenshots for manual tests if appropriate
Run test locally before make a pull request

codecov-commenter · 2023-02-16T05:37:22Z

Codecov Report

Merging #4336 (9ece864) into master (02deaf4) will decrease coverage by 0.08%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master    #4336      +/-   ##
============================================
- Coverage     53.66%   53.58%   -0.08%     
  Complexity       13       13              
============================================
  Files           562      562              
  Lines         30738    30709      -29     
  Branches       4161     4142      -19     
============================================
- Hits          16496    16456      -40     
- Misses        12682    12706      +24     
+ Partials       1560     1547      -13

Impacted Files	Coverage Δ
...he/kyuubi/engine/spark/shim/CatalogShim_v2_4.scala	`67.18% <100.00%> (+3.12%)`	⬆️
...he/kyuubi/engine/spark/shim/CatalogShim_v3_0.scala	`80.00% <100.00%> (-0.44%)`	⬇️
...ain/scala/org/apache/kyuubi/util/RowSetUtils.scala	`26.08% <0.00%> (-57.25%)`	⬇️
...kyuubi/server/trino/api/v1/StatementResource.scala	`50.00% <0.00%> (-6.76%)`	⬇️
...apache/kyuubi/engine/JpsApplicationOperation.scala	`77.41% <0.00%> (-3.23%)`	⬇️
...g/apache/kyuubi/operation/BatchJobSubmission.scala	`75.27% <0.00%> (-2.20%)`	⬇️
...rg/apache/kyuubi/ctl/cmd/log/LogBatchCommand.scala	`59.09% <0.00%> (-1.52%)`	⬇️
...mon/src/main/scala/org/apache/kyuubi/Logging.scala	`41.25% <0.00%> (-1.25%)`	⬇️
...n/scala/org/apache/kyuubi/engine/ProcBuilder.scala	`78.39% <0.00%> (-0.62%)`	⬇️
.../kyuubi/engine/spark/operation/ExecutePython.scala	`81.46% <0.00%> (-0.44%)`	⬇️
... and 13 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

… schema pruning ### _Why are the changes needed?_ Some DBMS tools like DBeaver and HUE will call thrift meta api for listing catalogs, databases, and tables. The current implementation of `CatalogShim_v3_0#getSchemas` will call `listAllNamespaces` first and do schema pruning on the Spark driver, which may cause "permission denied" exception when HMS has permission control, like the ranger plugin. This PR proposes to call HMS API(through v1 session catalog) directly for `spark_catalog`, to suppress the above issue. ``` 2023-02-15 20:02:13.048 ERROR org.apache.kyuubi.server.KyuubiTBinaryFrontendService: Error getting schemas: org.apache.kyuubi.KyuubiSQLException: Error operating GetSchemas: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [user1] does not have [SELECT] privilege on [userdb1]) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:134) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:249) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.databaseExists(ExternalCatalogWithListener.scala:69) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.databaseExists(SessionCatalog.scala:294) at org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:212) at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.$anonfun$listAllNamespaces$1(CatalogShim_v3_0.scala:74) at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.$anonfun$listAllNamespaces$1$adapted(CatalogShim_v3_0.scala:73) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290) at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198) at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.listAllNamespaces(CatalogShim_v3_0.scala:73) at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.listAllNamespaces(CatalogShim_v3_0.scala:90) at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.getSchemasWithPattern(CatalogShim_v3_0.scala:118) at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.getSchemas(CatalogShim_v3_0.scala:133) at org.apache.kyuubi.engine.spark.operation.GetSchemas.runInternal(GetSchemas.scala:43) at org.apache.kyuubi.operation.AbstractOperation.run(AbstractOperation.scala:164) at org.apache.kyuubi.session.AbstractSession.runOperation(AbstractSession.scala:99) at org.apache.kyuubi.engine.spark.session.SparkSessionImpl.runOperation(SparkSessionImpl.scala:78) at org.apache.kyuubi.session.AbstractSession.getSchemas(AbstractSession.scala:150) at org.apache.kyuubi.service.AbstractBackendService.getSchemas(AbstractBackendService.scala:83) at org.apache.kyuubi.service.TFrontendService.GetSchemas(TFrontendService.scala:294) at org.apache.kyuubi.shade.org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1617) at org.apache.kyuubi.shade.org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1602) at org.apache.kyuubi.shade.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.kyuubi.shade.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.kyuubi.service.authentication.TSetIpAddressProcessor.process(TSetIpAddressProcessor.scala:36) at org.apache.kyuubi.shade.org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ``` ### _How was this patch tested?_ - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [ ] Add screenshots for manual tests if appropriate - [ ] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request Closes #4336 from pan3793/list-schemas. Closes #4336 9ece864 [Cheng Pan] fix f71587e [Cheng Pan] Avoid listing all schemas for Spark session catalog on schema prunning Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org> (cherry picked from commit 89fe835) Signed-off-by: Cheng Pan <chengpan@apache.org>

echollee · 2023-02-16T07:09:41Z

Problem Solved.
Excellent!

pan3793 · 2023-02-16T07:10:52Z

Thanks @echollee for verificating, merged to master/1.7/1.6

Avoid listing all schemas for Spark session catalog on schema prunning

f71587e

github-actions bot added the module:spark label Feb 15, 2023

pan3793 requested a review from turboFei February 15, 2023 12:30

fix

9ece864

pan3793 self-assigned this Feb 16, 2023

cxzl25 approved these changes Feb 16, 2023

View reviewed changes

pan3793 closed this in 89fe835 Feb 16, 2023

pan3793 added this to the v1.6.2 milestone Feb 16, 2023

pan3793 deleted the list-schemas branch February 21, 2023 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid listing all schemas for Spark session catalog on schema pruning#4336

Avoid listing all schemas for Spark session catalog on schema pruning#4336
pan3793 wants to merge 2 commits intoapache:masterfrom
pan3793:list-schemas

pan3793 commented Feb 15, 2023

Uh oh!

codecov-commenter commented Feb 16, 2023

Uh oh!

echollee commented Feb 16, 2023

Uh oh!

pan3793 commented Feb 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pan3793 commented Feb 15, 2023

Why are the changes needed?

How was this patch tested?

Uh oh!

codecov-commenter commented Feb 16, 2023

Codecov Report

Uh oh!

echollee commented Feb 16, 2023

Uh oh!

pan3793 commented Feb 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants