[BUG] Hudi 0.11.x support for Spark with CTAS throws hive sync error

**Describe the problem you faced**

Scenario: Create table using Spark SQL with Spark 3.2 (EMR 6.7.0) and Glue Data Catalog for hive sync
Expected: Successful table creation and registration with hive/glue
Actual: Successful table creation and registration with hive/glue AND an error that table already exists.

_This scenario does not fail if using DataFrameWriter instead of SQL._

This indicates that SQL writer is performing a double hive sync somehow. so first hive sync is successful but second hive sync throws an error causing job to fail.

**To Reproduce**

Steps to reproduce the behavior:

1. Run following Spark SQL with Glue catalog configured
```sql
CREATE TABLE {target_db}.{target_table_name} using hudi
location 's3://{target_bucket_name}/{target_table_name}/'
options (
    type = 'cow', primaryKey='{primary_key}', preCombineField='{precombine_field}',
    hoodie.table.name='{target_table_name}',
    hoodie.datasource.write.operation='upsert',
    hoodie.datasource.write.table.name='{target_table_name}',
    hoodie.datasource.write.recordkey.field='{primary_key}',
    hoodie.datasource.write.precombine.field='{precombine_field}',
    hoodie.datasource.write.partitionpath.field='{partition_fields}',
    hoodie.datasource.write.keygenerator.class='org.apache.hudi.keygen.ComplexKeyGenerator',
    
    hoodie.datasource.hive_sync.enable='true',
    hoodie.datasource.hive_sync.mode='hms',
    hoodie.datasource.hive_sync.use_jdbc='false',
    hoodie.datasource.write.hive_style_partitioning='false',
    hoodie.datasource.hive_sync.partition_fields='{partition_fields}',
    hoodie.datasource.hive_sync.database='{target_db}',
    hoodie.datasource.hive_sync.table='{target_table_name}',
    
    hoodie.write.concurrency.mode='optimistic_concurrency_control',
    hoodie.cleaner.policy.failed.writes='LAZY',
    hoodie.write.lock.provider='org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider',
    hoodie.write.lock.dynamodb.table='hudi_locks_{args.environment}',
    hoodie.write.lock.dynamodb.partition_key='{target_table_name}',
    hoodie.write.lock.dynamodb.region='{region}',
    hoodie.write.lock.dynamodb.billing_mode='PAY_PER_REQUEST',
    hoodie.write.lock.dynamodb.endpoint_url='dynamodb.{region}.amazonaws.com'
)
partitioned by ({partition_fields})
AS 
SELECT *
FROM {source_db}.{source_table};
```

**Expected behavior**

Successful table creation and registration with hive/glue and spark job completes with success.

**Environment Description**
[EMR 6.7.0](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/release-version-670.html)

* Hudi version : 0.11.x (both 0.11.0 and 0.11.1)
* Spark version : 3.2.1 (EMR 6.7.0)
* Hive version : 3.1.3 (EMR 6.7.0)
* Hadoop version :
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no

**Additional context**

Add any other context about the problem here.

**Stacktrace**

```
Traceback (most recent call last):
  File "/tmp/spark-41761cf5-73dc-4128-adba-ba4bd8670d7c/hudi_remodeller.py", line 84, in <module>
    args.primary_key, args.precombine_field, args.partition_fields, args.region)
  File "/tmp/spark-41761cf5-73dc-4128-adba-ba4bd8670d7c/hudi_remodeller.py", line 62, in remodel_table
    spark.sql(sql)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 723, in sql
  File "/usr/lib/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: Table or view '{target_table_name}' already exists in database '{target_db}'
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Hudi 0.11.x support for Spark with CTAS throws hive sync error #7353

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Hudi 0.11.x support for Spark with CTAS throws hive sync error #7353

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions