Duplicate Row in Same Partition using Global Bloom Index

Hi Team,

I am facing an issue of duplicate record keys while data upserts into Hudi on EMR.

Hudi Jar - 
hudi-spark3.1.2-bundle_2.12-0.10.1.jar

EMR Version - 
emr-6.5.0

Workflow - 
files on S3 -> EMR(hudi) -> Hudi Tables(S3)

Schedule - once in a day

Insert Data Size - 
5 to 10 MB per batch

Hudi Configuration for Upsert - 

hudi_options = {
            'hoodie.table.name': "txn_table"
            'hoodie.datasource.write.recordkey.field': "transaction_id",
            'hoodie.datasource.write.partitionpath.field': 'billing_date',
            'hoodie.datasource.write.table.name': "txn_table",
            'hoodie.datasource.write.operation': 'upsert',
            'hoodie.datasource.write.precombine.field': 'transaction_id',
            'hoodie.index.type': "GLOBAL_BLOOM",
            'hoodie.bloom.index.update.partition.path': "true",
            'hoodie.upsert.shuffle.parallelism': 10,
            'hoodie.insert.shuffle.parallelism': 10,
            'hoodie.datasource.hive_sync.database': "dwh",
            'hoodie.datasource.hive_sync.table': "txn_table",
            'hoodie.datasource.hive_sync.partition_fields': "billing_date",
            'hoodie.datasource.write.hive_style_partitioning': "true",
            'hoodie.datasource.hive_sync.enable': "true",
            'hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled': "true",
            'hoodie.datasource.hive_sync.support_timestamp': "true",
            'hoodie.metadata.enable': "true"
        }

Issue Occurrence - 
It's been around a month while running our job in production but this issue has been seen for the first time.
Even when I tried to reproduce the issue with the same dataset it was not reproducible, records updated successfully.

Issue Steps  - 

1 - There is a batch of data for which first we do insert in txn_table, which has unique id through out the partition i.e transaction_id(defined as record key)
2 - Next day, on the update of the record key a new row is created with same record key in same partition with updated value.
3 - both the duplicate rows were able to be read but when I try to update then it updates only the latest row.
4 - On checking the parquet file, a duplicate record with updated value was present in a different file in the same partition.

Steps to Reproduce - 

Issue is not reproducible, even when same dataset tried to ingest again with same configuration Upsert was fine.

Please let me know If I am missing some configuration.

Thanks
Raghvendra

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate Row in Same Partition using Global Bloom Index #9536

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Duplicate Row in Same Partition using Global Bloom Index #9536

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions