Skip to content

[SUPPORT] Spark3.2 encountered duplicate data while reading the hudi bucket MOR table #9244

@fujianhua168

Description

@fujianhua168

Describe the problem you faced
A few days ago in the production environment, a datanode in the Hadoop cluster downtimed, it causing the flink streaming write task( for hudi bucket mor table) failed. After restarting the Flink task, when we used Spark3.2 or Presto333 to read data from the table, we found duplicate data under the same primary key ,yet the duplicate records have the same Hudi system field values (_hoodie_commit_time, _hoodie_commit_seqno, _hoodie_filename) .
Note: This Flink write task has been running normally for several days,There were no duplicates record before a datanode downtimed.

image
6d83cedd6b4e7b3b21c76493f0836927
3dcb86d0cd346c65160acc88edb7d8ee

Environment Description

  • Hudi version :0.13.0
  • Spark version :3.2
  • Hive version :3.1
  • Hadoop version :3.0
  • Storage (HDFS/S3/GCS..) : HDFS
  • Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    issue:data-consistencyData consistency issues (duplicates/phantoms)priority:criticalProduction degraded; pipelines stalled

    Type

    No type

    Projects

    Status

    👤 User Action

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions