[SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell by eubnara · Pull Request #45309 · apache/spark

eubnara · 2024-02-28T06:35:10Z

What changes were proposed in this pull request?

Make spark-sql, spark-shell be able to access iceberg with HiveCatalog.
If a user want to access iceberg table with HiveCatalog through spark-sql, spark-shell, the user should specify additional configuration:

$ spark-sql --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.hadoop_prod=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.hadoop_prod.type=hive \
--conf spark.sql.catalog.hadoop_prod.uri=thrift://hms1.example.com:9083,thrift://hms2.example.com:9083 \
--conf spark.hadoop.iceberg.engine.hive.enabled=true \
--conf spark.jars=hdfs:///some/path/to/iceberg-spark-runtime-3.2_2.12-1.4.3.jar \
--conf spark.hadoop.hive.aux.jars.path=hdfs:///some/path/to/iceberg-hive-runtime-1.4.3.jar \
--conf spark.security.credentials.hive.enabled=true

Why are the changes needed?

spark-sql and spark-shell cannot access iceberg table with HiveCatalog because there is no HIVE_DELEGATION_TOKEN.

Does this PR introduce any user-facing change?

If there is a user who specify --conf spark.security.credentials.hive.enabled=true, spark will get HIVE_DELEGATION_TOKEN even though deploy mode is not "cluster".

How was this patch tested?

Manually tested on on-premise internal cluster with Hadoop 3.3.4, Iceberg 1.4.3, and Spark 3.2.3.

Was this patch authored or co-authored using generative AI tooling?

No.

… HiveCatalog on spark-sql or spark-shell

pan3793 · 2024-02-28T07:02:24Z

HiveDelegationTokenProvider takes care of the Spark built-in HMS client token refresh, Iceberg uses its own implemented HMS client, and should take care of itself.

As an example, Apache Kyuubi implements a Hive Connector based on Spark DSv2 API, which allows connecting to multi HMSs, and implements KyuubiHiveConnectorDelegationTokenProvider to take care of the token refresh for its managed HMS clients apache/kyuubi#4560

eubnara · 2024-02-28T07:24:13Z

Thanks for reply.
With spark-sql or spark-shell, it is impossible to use iceberg with HiveCatalog? only iceberg with HadoopCatalog is supported?

pan3793 · 2024-02-28T07:37:21Z

IMO it's an Iceberg side issue, and in addition to the case you listed above, accessing multiple Kerberized HMS cases should be considered, e.g. the Spark built-in HMS and Iceberg HMS are different, configure more than one Iceberg Hive catalogs

+cc @pvary @szehon-ho @sunchao

eubnara · 2024-02-28T07:55:36Z

Thanks for explanation. I think I need to review spark, iceberg codes more...

eubnara · 2024-02-28T15:20:05Z

Even with this patch, insert into is broken. (describe extended, select * from queries are okay)
Maybe https://issues.apache.org/jira/browse/SPARK-30885 is related?

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:228)
        ... 61 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException
        at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:274)
        at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:132)
        at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:105)
        at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
        at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:146)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:300)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$17(FileFormatWriter.scala:239)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1492)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
        at org.apache.iceberg.mr.hive.TezUtil$TaskAttemptWrapper.<init>(TezUtil.java:105)
        at org.apache.iceberg.mr.hive.TezUtil.taskAttemptWrapper(TezUtil.java:78)
        at org.apache.iceberg.mr.hive.HiveIcebergOutputFormat.writer(HiveIcebergOutputFormat.java:73)
        at org.apache.iceberg.mr.hive.HiveIcebergOutputFormat.getHiveRecordWriter(HiveIcebergOutputFormat.java:58)
        at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:286)
        at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:271)
        ... 14 more

eubnara · 2024-02-28T15:46:30Z

Oh! I finally figure out why it fails.
I should not use iceberg-hive-runtime jar on spark-sql or spark-shell.
I forgot to specify database and query with "catalog".

 --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.hadoop_prod=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.hadoop_prod.type=hive \
--conf spark.sql.catalog.hadoop_prod.uri=thrift://hms1.example.com:9083 \
--conf spark.hadoop.iceberg.engine.hive.enabled=true

SELECT * FROM hadoop_prod.db.table; # correct
SELECT * FROM db.table; # wrong

[SPARK-47197] Failed to connect HiveMetastore when using iceberg with…

67780fa

… HiveCatalog on spark-sql or spark-shell

github-actions bot added the SQL label Feb 28, 2024

eubnara closed this Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell#45309

[SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell#45309
eubnara wants to merge 1 commit intoapache:masterfrom
eubnara:SPARK-47197

eubnara commented Feb 28, 2024

Uh oh!

pan3793 commented Feb 28, 2024

Uh oh!

eubnara commented Feb 28, 2024

Uh oh!

pan3793 commented Feb 28, 2024

Uh oh!

eubnara commented Feb 28, 2024

Uh oh!

eubnara commented Feb 28, 2024

Uh oh!

eubnara commented Feb 28, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

eubnara commented Feb 28, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 commented Feb 28, 2024

Uh oh!

eubnara commented Feb 28, 2024

Uh oh!

pan3793 commented Feb 28, 2024

Uh oh!

eubnara commented Feb 28, 2024

Uh oh!

eubnara commented Feb 28, 2024

Uh oh!

eubnara commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eubnara commented Feb 28, 2024 •

edited

Loading