-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Describe the problem you faced
Scenario: Create table using Spark SQL with Spark 3.2 (EMR 6.7.0) and Glue Data Catalog for hive sync
Expected: Successful table creation and registration with hive/glue
Actual: Successful table creation and registration with hive/glue AND an error that table already exists.
This scenario does not fail if using DataFrameWriter instead of SQL.
This indicates that SQL writer is performing a double hive sync somehow. so first hive sync is successful but second hive sync throws an error causing job to fail.
To Reproduce
Steps to reproduce the behavior:
- Run following Spark SQL with Glue catalog configured
CREATE TABLE {target_db}.{target_table_name} using hudi
location 's3://{target_bucket_name}/{target_table_name}/'
options (
type = 'cow', primaryKey='{primary_key}', preCombineField='{precombine_field}',
hoodie.table.name='{target_table_name}',
hoodie.datasource.write.operation='upsert',
hoodie.datasource.write.table.name='{target_table_name}',
hoodie.datasource.write.recordkey.field='{primary_key}',
hoodie.datasource.write.precombine.field='{precombine_field}',
hoodie.datasource.write.partitionpath.field='{partition_fields}',
hoodie.datasource.write.keygenerator.class='org.apache.hudi.keygen.ComplexKeyGenerator',
hoodie.datasource.hive_sync.enable='true',
hoodie.datasource.hive_sync.mode='hms',
hoodie.datasource.hive_sync.use_jdbc='false',
hoodie.datasource.write.hive_style_partitioning='false',
hoodie.datasource.hive_sync.partition_fields='{partition_fields}',
hoodie.datasource.hive_sync.database='{target_db}',
hoodie.datasource.hive_sync.table='{target_table_name}',
hoodie.write.concurrency.mode='optimistic_concurrency_control',
hoodie.cleaner.policy.failed.writes='LAZY',
hoodie.write.lock.provider='org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider',
hoodie.write.lock.dynamodb.table='hudi_locks_{args.environment}',
hoodie.write.lock.dynamodb.partition_key='{target_table_name}',
hoodie.write.lock.dynamodb.region='{region}',
hoodie.write.lock.dynamodb.billing_mode='PAY_PER_REQUEST',
hoodie.write.lock.dynamodb.endpoint_url='dynamodb.{region}.amazonaws.com'
)
partitioned by ({partition_fields})
AS
SELECT *
FROM {source_db}.{source_table};Expected behavior
Successful table creation and registration with hive/glue and spark job completes with success.
Environment Description
EMR 6.7.0
- Hudi version : 0.11.x (both 0.11.0 and 0.11.1)
- Spark version : 3.2.1 (EMR 6.7.0)
- Hive version : 3.1.3 (EMR 6.7.0)
- Hadoop version :
- Storage (HDFS/S3/GCS..) : S3
- Running on Docker? (yes/no) : no
Additional context
Add any other context about the problem here.
Stacktrace
Traceback (most recent call last):
File "/tmp/spark-41761cf5-73dc-4128-adba-ba4bd8670d7c/hudi_remodeller.py", line 84, in <module>
args.primary_key, args.precombine_field, args.partition_fields, args.region)
File "/tmp/spark-41761cf5-73dc-4128-adba-ba4bd8670d7c/hudi_remodeller.py", line 62, in remodel_table
spark.sql(sql)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 723, in sql
File "/usr/lib/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: Table or view '{target_table_name}' already exists in database '{target_db}'
Metadata
Metadata
Assignees
Labels
Type
Projects
Status