Skip to content

[BUG] Hudi 0.11.x support for Spark with CTAS throws hive sync error #7353

@atharvai

Description

@atharvai

Describe the problem you faced

Scenario: Create table using Spark SQL with Spark 3.2 (EMR 6.7.0) and Glue Data Catalog for hive sync
Expected: Successful table creation and registration with hive/glue
Actual: Successful table creation and registration with hive/glue AND an error that table already exists.

This scenario does not fail if using DataFrameWriter instead of SQL.

This indicates that SQL writer is performing a double hive sync somehow. so first hive sync is successful but second hive sync throws an error causing job to fail.

To Reproduce

Steps to reproduce the behavior:

  1. Run following Spark SQL with Glue catalog configured
CREATE TABLE {target_db}.{target_table_name} using hudi
location 's3://{target_bucket_name}/{target_table_name}/'
options (
    type = 'cow', primaryKey='{primary_key}', preCombineField='{precombine_field}',
    hoodie.table.name='{target_table_name}',
    hoodie.datasource.write.operation='upsert',
    hoodie.datasource.write.table.name='{target_table_name}',
    hoodie.datasource.write.recordkey.field='{primary_key}',
    hoodie.datasource.write.precombine.field='{precombine_field}',
    hoodie.datasource.write.partitionpath.field='{partition_fields}',
    hoodie.datasource.write.keygenerator.class='org.apache.hudi.keygen.ComplexKeyGenerator',
    
    hoodie.datasource.hive_sync.enable='true',
    hoodie.datasource.hive_sync.mode='hms',
    hoodie.datasource.hive_sync.use_jdbc='false',
    hoodie.datasource.write.hive_style_partitioning='false',
    hoodie.datasource.hive_sync.partition_fields='{partition_fields}',
    hoodie.datasource.hive_sync.database='{target_db}',
    hoodie.datasource.hive_sync.table='{target_table_name}',
    
    hoodie.write.concurrency.mode='optimistic_concurrency_control',
    hoodie.cleaner.policy.failed.writes='LAZY',
    hoodie.write.lock.provider='org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider',
    hoodie.write.lock.dynamodb.table='hudi_locks_{args.environment}',
    hoodie.write.lock.dynamodb.partition_key='{target_table_name}',
    hoodie.write.lock.dynamodb.region='{region}',
    hoodie.write.lock.dynamodb.billing_mode='PAY_PER_REQUEST',
    hoodie.write.lock.dynamodb.endpoint_url='dynamodb.{region}.amazonaws.com'
)
partitioned by ({partition_fields})
AS 
SELECT *
FROM {source_db}.{source_table};

Expected behavior

Successful table creation and registration with hive/glue and spark job completes with success.

Environment Description
EMR 6.7.0

  • Hudi version : 0.11.x (both 0.11.0 and 0.11.1)
  • Spark version : 3.2.1 (EMR 6.7.0)
  • Hive version : 3.1.3 (EMR 6.7.0)
  • Hadoop version :
  • Storage (HDFS/S3/GCS..) : S3
  • Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace

Traceback (most recent call last):
  File "/tmp/spark-41761cf5-73dc-4128-adba-ba4bd8670d7c/hudi_remodeller.py", line 84, in <module>
    args.primary_key, args.precombine_field, args.partition_fields, args.region)
  File "/tmp/spark-41761cf5-73dc-4128-adba-ba4bd8670d7c/hudi_remodeller.py", line 62, in remodel_table
    spark.sql(sql)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 723, in sql
  File "/usr/lib/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: Table or view '{target_table_name}' already exists in database '{target_db}'

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    🚧 Needs Repro

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions