-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Hi Team,
I am facing an issue of duplicate record keys while data upserts into Hudi on EMR.
Hudi Jar -
hudi-spark3.1.2-bundle_2.12-0.10.1.jar
EMR Version -
emr-6.5.0
Workflow -
files on S3 -> EMR(hudi) -> Hudi Tables(S3)
Schedule - once in a day
Insert Data Size -
5 to 10 MB per batch
Hudi Configuration for Upsert -
hudi_options = {
'hoodie.table.name': "txn_table"
'hoodie.datasource.write.recordkey.field': "transaction_id",
'hoodie.datasource.write.partitionpath.field': 'billing_date',
'hoodie.datasource.write.table.name': "txn_table",
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'transaction_id',
'hoodie.index.type': "GLOBAL_BLOOM",
'hoodie.bloom.index.update.partition.path': "true",
'hoodie.upsert.shuffle.parallelism': 10,
'hoodie.insert.shuffle.parallelism': 10,
'hoodie.datasource.hive_sync.database': "dwh",
'hoodie.datasource.hive_sync.table': "txn_table",
'hoodie.datasource.hive_sync.partition_fields': "billing_date",
'hoodie.datasource.write.hive_style_partitioning': "true",
'hoodie.datasource.hive_sync.enable': "true",
'hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled': "true",
'hoodie.datasource.hive_sync.support_timestamp': "true",
'hoodie.metadata.enable': "true"
}
Issue Occurrence -
It's been around a month while running our job in production but this issue has been seen for the first time.
Even when I tried to reproduce the issue with the same dataset it was not reproducible, records updated successfully.
Issue Steps -
1 - There is a batch of data for which first we do insert in txn_table, which has unique id through out the partition i.e transaction_id(defined as record key)
2 - Next day, on the update of the record key a new row is created with same record key in same partition with updated value.
3 - both the duplicate rows were able to be read but when I try to update then it updates only the latest row.
4 - On checking the parquet file, a duplicate record with updated value was present in a different file in the same partition.
Steps to Reproduce -
Issue is not reproducible, even when same dataset tried to ingest again with same configuration Upsert was fine.
Please let me know If I am missing some configuration.
Thanks
Raghvendra
Metadata
Metadata
Assignees
Labels
Type
Projects
Status