Skip to content

Duplicate Row in Same Partition using Global Bloom Index #9536

@Raghvendradubey

Description

@Raghvendradubey

Hi Team,

I am facing an issue of duplicate record keys while data upserts into Hudi on EMR.

Hudi Jar -
hudi-spark3.1.2-bundle_2.12-0.10.1.jar

EMR Version -
emr-6.5.0

Workflow -
files on S3 -> EMR(hudi) -> Hudi Tables(S3)

Schedule - once in a day

Insert Data Size -
5 to 10 MB per batch

Hudi Configuration for Upsert -

hudi_options = {
'hoodie.table.name': "txn_table"
'hoodie.datasource.write.recordkey.field': "transaction_id",
'hoodie.datasource.write.partitionpath.field': 'billing_date',
'hoodie.datasource.write.table.name': "txn_table",
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'transaction_id',
'hoodie.index.type': "GLOBAL_BLOOM",
'hoodie.bloom.index.update.partition.path': "true",
'hoodie.upsert.shuffle.parallelism': 10,
'hoodie.insert.shuffle.parallelism': 10,
'hoodie.datasource.hive_sync.database': "dwh",
'hoodie.datasource.hive_sync.table': "txn_table",
'hoodie.datasource.hive_sync.partition_fields': "billing_date",
'hoodie.datasource.write.hive_style_partitioning': "true",
'hoodie.datasource.hive_sync.enable': "true",
'hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled': "true",
'hoodie.datasource.hive_sync.support_timestamp': "true",
'hoodie.metadata.enable': "true"
}

Issue Occurrence -
It's been around a month while running our job in production but this issue has been seen for the first time.
Even when I tried to reproduce the issue with the same dataset it was not reproducible, records updated successfully.

Issue Steps -

1 - There is a batch of data for which first we do insert in txn_table, which has unique id through out the partition i.e transaction_id(defined as record key)
2 - Next day, on the update of the record key a new row is created with same record key in same partition with updated value.
3 - both the duplicate rows were able to be read but when I try to update then it updates only the latest row.
4 - On checking the parquet file, a duplicate record with updated value was present in a different file in the same partition.

Steps to Reproduce -

Issue is not reproducible, even when same dataset tried to ingest again with same configuration Upsert was fine.

Please let me know If I am missing some configuration.

Thanks
Raghvendra

Metadata

Metadata

Assignees

No one assigned

    Labels

    issue:data-consistencyData consistency issues (duplicates/phantoms)priority:criticalProduction degraded; pipelines stalled

    Type

    No type

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions