Skip to content

Comments

[SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell#45309

Closed
eubnara wants to merge 1 commit intoapache:masterfrom
eubnara:SPARK-47197
Closed

[SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell#45309
eubnara wants to merge 1 commit intoapache:masterfrom
eubnara:SPARK-47197

Conversation

@eubnara
Copy link
Contributor

@eubnara eubnara commented Feb 28, 2024

What changes were proposed in this pull request?

Make spark-sql, spark-shell be able to access iceberg with HiveCatalog.
If a user want to access iceberg table with HiveCatalog through spark-sql, spark-shell, the user should specify additional configuration:

$ spark-sql --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.hadoop_prod=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.hadoop_prod.type=hive \
--conf spark.sql.catalog.hadoop_prod.uri=thrift://hms1.example.com:9083,thrift://hms2.example.com:9083 \
--conf spark.hadoop.iceberg.engine.hive.enabled=true \
--conf spark.jars=hdfs:///some/path/to/iceberg-spark-runtime-3.2_2.12-1.4.3.jar \
--conf spark.hadoop.hive.aux.jars.path=hdfs:///some/path/to/iceberg-hive-runtime-1.4.3.jar \
--conf spark.security.credentials.hive.enabled=true

Why are the changes needed?

spark-sql and spark-shell cannot access iceberg table with HiveCatalog because there is no HIVE_DELEGATION_TOKEN.

Does this PR introduce any user-facing change?

If there is a user who specify --conf spark.security.credentials.hive.enabled=true, spark will get HIVE_DELEGATION_TOKEN even though deploy mode is not "cluster".

How was this patch tested?

Manually tested on on-premise internal cluster with Hadoop 3.3.4, Iceberg 1.4.3, and Spark 3.2.3.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Feb 28, 2024
@pan3793
Copy link
Member

pan3793 commented Feb 28, 2024

HiveDelegationTokenProvider takes care of the Spark built-in HMS client token refresh, Iceberg uses its own implemented HMS client, and should take care of itself.

As an example, Apache Kyuubi implements a Hive Connector based on Spark DSv2 API, which allows connecting to multi HMSs, and implements KyuubiHiveConnectorDelegationTokenProvider to take care of the token refresh for its managed HMS clients apache/kyuubi#4560

@eubnara
Copy link
Contributor Author

eubnara commented Feb 28, 2024

Thanks for reply.
With spark-sql or spark-shell, it is impossible to use iceberg with HiveCatalog? only iceberg with HadoopCatalog is supported?

@pan3793
Copy link
Member

pan3793 commented Feb 28, 2024

IMO it's an Iceberg side issue, and in addition to the case you listed above, accessing multiple Kerberized HMS cases should be considered, e.g. the Spark built-in HMS and Iceberg HMS are different, configure more than one Iceberg Hive catalogs

+cc @pvary @szehon-ho @sunchao

@eubnara
Copy link
Contributor Author

eubnara commented Feb 28, 2024

Thanks for explanation. I think I need to review spark, iceberg codes more...

@eubnara
Copy link
Contributor Author

eubnara commented Feb 28, 2024

Even with this patch, insert into is broken. (describe extended, select * from queries are okay)
Maybe https://issues.apache.org/jira/browse/SPARK-30885 is related?

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:228)
        ... 61 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException
        at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:274)
        at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:132)
        at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:105)
        at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
        at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:146)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:300)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$17(FileFormatWriter.scala:239)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1492)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
        at org.apache.iceberg.mr.hive.TezUtil$TaskAttemptWrapper.<init>(TezUtil.java:105)
        at org.apache.iceberg.mr.hive.TezUtil.taskAttemptWrapper(TezUtil.java:78)
        at org.apache.iceberg.mr.hive.HiveIcebergOutputFormat.writer(HiveIcebergOutputFormat.java:73)
        at org.apache.iceberg.mr.hive.HiveIcebergOutputFormat.getHiveRecordWriter(HiveIcebergOutputFormat.java:58)
        at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:286)
        at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:271)
        ... 14 more

@eubnara
Copy link
Contributor Author

eubnara commented Feb 28, 2024

Oh! I finally figure out why it fails.
I should not use iceberg-hive-runtime jar on spark-sql or spark-shell.
I forgot to specify database and query with "catalog".

 --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.hadoop_prod=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.hadoop_prod.type=hive \
--conf spark.sql.catalog.hadoop_prod.uri=thrift://hms1.example.com:9083 \
--conf spark.hadoop.iceberg.engine.hive.enabled=true
SELECT * FROM hadoop_prod.db.table; # correct
SELECT * FROM db.table; # wrong

@eubnara eubnara closed this Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants