-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Open
Labels
issue:data-consistencyData consistency issues (duplicates/phantoms)Data consistency issues (duplicates/phantoms)priority:criticalProduction degraded; pipelines stalledProduction degraded; pipelines stalled
Description
Describe the problem you faced
A few days ago in the production environment, a datanode in the Hadoop cluster downtimed, it causing the flink streaming write task( for hudi bucket mor table) failed. After restarting the Flink task, when we used Spark3.2 or Presto333 to read data from the table, we found duplicate data under the same primary key ,yet the duplicate records have the same Hudi system field values (_hoodie_commit_time, _hoodie_commit_seqno, _hoodie_filename) .
Note: This Flink write task has been running normally for several days,There were no duplicates record before a datanode downtimed.
Environment Description
- Hudi version :0.13.0
- Spark version :3.2
- Hive version :3.1
- Hadoop version :3.0
- Storage (HDFS/S3/GCS..) : HDFS
- Running on Docker? (yes/no) : no
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
issue:data-consistencyData consistency issues (duplicates/phantoms)Data consistency issues (duplicates/phantoms)priority:criticalProduction degraded; pipelines stalledProduction degraded; pipelines stalled
Type
Projects
Status
👤 User Action


