Skip to content

Avoid listing all schemas for Spark session catalog on schema pruning#4336

Closed
pan3793 wants to merge 2 commits intoapache:masterfrom
pan3793:list-schemas
Closed

Avoid listing all schemas for Spark session catalog on schema pruning#4336
pan3793 wants to merge 2 commits intoapache:masterfrom
pan3793:list-schemas

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented Feb 15, 2023

Why are the changes needed?

Some DBMS tools like DBeaver and HUE will call thrift meta api for listing catalogs, databases, and tables. The current implementation of CatalogShim_v3_0#getSchemas will call listAllNamespaces first and do schema pruning on the Spark driver, which may cause "permission denied" exception when HMS has permission control, like the ranger plugin.

This PR proposes to call HMS API(through v1 session catalog) directly for spark_catalog, to suppress the above issue.

2023-02-15 20:02:13.048 ERROR org.apache.kyuubi.server.KyuubiTBinaryFrontendService: Error getting schemas:
org.apache.kyuubi.KyuubiSQLException: Error operating GetSchemas: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [user1] does not have [SELECT] privilege on [userdb1])
        at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:134)
        at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:249)
        at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.databaseExists(ExternalCatalogWithListener.scala:69)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.databaseExists(SessionCatalog.scala:294)
        at org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:212)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.$anonfun$listAllNamespaces$1(CatalogShim_v3_0.scala:74)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.$anonfun$listAllNamespaces$1$adapted(CatalogShim_v3_0.scala:73)
        at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
        at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
        at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
        at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.listAllNamespaces(CatalogShim_v3_0.scala:73)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.listAllNamespaces(CatalogShim_v3_0.scala:90)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.getSchemasWithPattern(CatalogShim_v3_0.scala:118)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.getSchemas(CatalogShim_v3_0.scala:133)
        at org.apache.kyuubi.engine.spark.operation.GetSchemas.runInternal(GetSchemas.scala:43)
        at org.apache.kyuubi.operation.AbstractOperation.run(AbstractOperation.scala:164)
        at org.apache.kyuubi.session.AbstractSession.runOperation(AbstractSession.scala:99)
        at org.apache.kyuubi.engine.spark.session.SparkSessionImpl.runOperation(SparkSessionImpl.scala:78)
        at org.apache.kyuubi.session.AbstractSession.getSchemas(AbstractSession.scala:150)
        at org.apache.kyuubi.service.AbstractBackendService.getSchemas(AbstractBackendService.scala:83)
        at org.apache.kyuubi.service.TFrontendService.GetSchemas(TFrontendService.scala:294)
        at org.apache.kyuubi.shade.org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1617)
        at org.apache.kyuubi.shade.org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1602)
        at org.apache.kyuubi.shade.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
        at org.apache.kyuubi.shade.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
        at org.apache.kyuubi.service.authentication.TSetIpAddressProcessor.process(TSetIpAddressProcessor.scala:36)
        at org.apache.kyuubi.shade.org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible

  • Add screenshots for manual tests if appropriate

  • Run test locally before make a pull request

@codecov-commenter
Copy link

Codecov Report

Merging #4336 (9ece864) into master (02deaf4) will decrease coverage by 0.08%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master    #4336      +/-   ##
============================================
- Coverage     53.66%   53.58%   -0.08%     
  Complexity       13       13              
============================================
  Files           562      562              
  Lines         30738    30709      -29     
  Branches       4161     4142      -19     
============================================
- Hits          16496    16456      -40     
- Misses        12682    12706      +24     
+ Partials       1560     1547      -13     
Impacted Files Coverage Δ
...he/kyuubi/engine/spark/shim/CatalogShim_v2_4.scala 67.18% <100.00%> (+3.12%) ⬆️
...he/kyuubi/engine/spark/shim/CatalogShim_v3_0.scala 80.00% <100.00%> (-0.44%) ⬇️
...ain/scala/org/apache/kyuubi/util/RowSetUtils.scala 26.08% <0.00%> (-57.25%) ⬇️
...kyuubi/server/trino/api/v1/StatementResource.scala 50.00% <0.00%> (-6.76%) ⬇️
...apache/kyuubi/engine/JpsApplicationOperation.scala 77.41% <0.00%> (-3.23%) ⬇️
...g/apache/kyuubi/operation/BatchJobSubmission.scala 75.27% <0.00%> (-2.20%) ⬇️
...rg/apache/kyuubi/ctl/cmd/log/LogBatchCommand.scala 59.09% <0.00%> (-1.52%) ⬇️
...mon/src/main/scala/org/apache/kyuubi/Logging.scala 41.25% <0.00%> (-1.25%) ⬇️
...n/scala/org/apache/kyuubi/engine/ProcBuilder.scala 78.39% <0.00%> (-0.62%) ⬇️
.../kyuubi/engine/spark/operation/ExecutePython.scala 81.46% <0.00%> (-0.44%) ⬇️
... and 13 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@pan3793 pan3793 self-assigned this Feb 16, 2023
@pan3793 pan3793 closed this in 89fe835 Feb 16, 2023
pan3793 added a commit that referenced this pull request Feb 16, 2023
… schema pruning

### _Why are the changes needed?_

Some DBMS tools like DBeaver and HUE will call thrift meta api for listing catalogs, databases, and tables. The current implementation of `CatalogShim_v3_0#getSchemas` will call `listAllNamespaces` first and do schema pruning on the Spark driver, which may cause "permission denied" exception when HMS has permission control, like the ranger plugin.

This PR proposes to call HMS API(through v1 session catalog) directly for `spark_catalog`, to suppress the above issue.

```
2023-02-15 20:02:13.048 ERROR org.apache.kyuubi.server.KyuubiTBinaryFrontendService: Error getting schemas:
org.apache.kyuubi.KyuubiSQLException: Error operating GetSchemas: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [user1] does not have [SELECT] privilege on [userdb1])
        at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:134)
        at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:249)
        at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.databaseExists(ExternalCatalogWithListener.scala:69)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.databaseExists(SessionCatalog.scala:294)
        at org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:212)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.$anonfun$listAllNamespaces$1(CatalogShim_v3_0.scala:74)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.$anonfun$listAllNamespaces$1$adapted(CatalogShim_v3_0.scala:73)
        at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
        at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
        at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
        at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.listAllNamespaces(CatalogShim_v3_0.scala:73)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.listAllNamespaces(CatalogShim_v3_0.scala:90)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.getSchemasWithPattern(CatalogShim_v3_0.scala:118)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.getSchemas(CatalogShim_v3_0.scala:133)
        at org.apache.kyuubi.engine.spark.operation.GetSchemas.runInternal(GetSchemas.scala:43)
        at org.apache.kyuubi.operation.AbstractOperation.run(AbstractOperation.scala:164)
        at org.apache.kyuubi.session.AbstractSession.runOperation(AbstractSession.scala:99)
        at org.apache.kyuubi.engine.spark.session.SparkSessionImpl.runOperation(SparkSessionImpl.scala:78)
        at org.apache.kyuubi.session.AbstractSession.getSchemas(AbstractSession.scala:150)
        at org.apache.kyuubi.service.AbstractBackendService.getSchemas(AbstractBackendService.scala:83)
        at org.apache.kyuubi.service.TFrontendService.GetSchemas(TFrontendService.scala:294)
        at org.apache.kyuubi.shade.org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1617)
        at org.apache.kyuubi.shade.org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1602)
        at org.apache.kyuubi.shade.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
        at org.apache.kyuubi.shade.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
        at org.apache.kyuubi.service.authentication.TSetIpAddressProcessor.process(TSetIpAddressProcessor.scala:36)
        at org.apache.kyuubi.shade.org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
```

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [ ] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request

Closes #4336 from pan3793/list-schemas.

Closes #4336

9ece864 [Cheng Pan] fix
f71587e [Cheng Pan] Avoid listing all schemas for Spark session catalog on schema prunning

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
(cherry picked from commit 89fe835)
Signed-off-by: Cheng Pan <chengpan@apache.org>
pan3793 added a commit that referenced this pull request Feb 16, 2023
… schema pruning

### _Why are the changes needed?_

Some DBMS tools like DBeaver and HUE will call thrift meta api for listing catalogs, databases, and tables. The current implementation of `CatalogShim_v3_0#getSchemas` will call `listAllNamespaces` first and do schema pruning on the Spark driver, which may cause "permission denied" exception when HMS has permission control, like the ranger plugin.

This PR proposes to call HMS API(through v1 session catalog) directly for `spark_catalog`, to suppress the above issue.

```
2023-02-15 20:02:13.048 ERROR org.apache.kyuubi.server.KyuubiTBinaryFrontendService: Error getting schemas:
org.apache.kyuubi.KyuubiSQLException: Error operating GetSchemas: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [user1] does not have [SELECT] privilege on [userdb1])
        at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:134)
        at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:249)
        at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.databaseExists(ExternalCatalogWithListener.scala:69)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.databaseExists(SessionCatalog.scala:294)
        at org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.listNamespaces(V2SessionCatalog.scala:212)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.$anonfun$listAllNamespaces$1(CatalogShim_v3_0.scala:74)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.$anonfun$listAllNamespaces$1$adapted(CatalogShim_v3_0.scala:73)
        at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
        at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
        at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
        at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.listAllNamespaces(CatalogShim_v3_0.scala:73)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.listAllNamespaces(CatalogShim_v3_0.scala:90)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.getSchemasWithPattern(CatalogShim_v3_0.scala:118)
        at org.apache.kyuubi.engine.spark.shim.CatalogShim_v3_0.getSchemas(CatalogShim_v3_0.scala:133)
        at org.apache.kyuubi.engine.spark.operation.GetSchemas.runInternal(GetSchemas.scala:43)
        at org.apache.kyuubi.operation.AbstractOperation.run(AbstractOperation.scala:164)
        at org.apache.kyuubi.session.AbstractSession.runOperation(AbstractSession.scala:99)
        at org.apache.kyuubi.engine.spark.session.SparkSessionImpl.runOperation(SparkSessionImpl.scala:78)
        at org.apache.kyuubi.session.AbstractSession.getSchemas(AbstractSession.scala:150)
        at org.apache.kyuubi.service.AbstractBackendService.getSchemas(AbstractBackendService.scala:83)
        at org.apache.kyuubi.service.TFrontendService.GetSchemas(TFrontendService.scala:294)
        at org.apache.kyuubi.shade.org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1617)
        at org.apache.kyuubi.shade.org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1602)
        at org.apache.kyuubi.shade.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
        at org.apache.kyuubi.shade.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
        at org.apache.kyuubi.service.authentication.TSetIpAddressProcessor.process(TSetIpAddressProcessor.scala:36)
        at org.apache.kyuubi.shade.org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
```

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [ ] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request

Closes #4336 from pan3793/list-schemas.

Closes #4336

9ece864 [Cheng Pan] fix
f71587e [Cheng Pan] Avoid listing all schemas for Spark session catalog on schema prunning

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
(cherry picked from commit 89fe835)
Signed-off-by: Cheng Pan <chengpan@apache.org>
@echollee
Copy link

Problem Solved.
Excellent!

@pan3793 pan3793 added this to the v1.6.2 milestone Feb 16, 2023
@pan3793
Copy link
Member Author

pan3793 commented Feb 16, 2023

Thanks @echollee for verificating, merged to master/1.7/1.6

@pan3793 pan3793 deleted the list-schemas branch February 21, 2023 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants