[SUPPORT] Issue with Repartition on Kafka Input DataFrame and Same Precombine Value Rows In One Batch #10995
Labels
on-call-triaged
priority:minor
everything else; usability gaps; questions; feature reqs
spark
Issues related to spark
Describe the problem you faced
I'm operating a typical Hudi workload that involves using spark structured streaming to read CDC events from Kafka and perform Upserts into S3.
I've encountered an issue where, when rows with same precombine values are present in the same batch, applying repartition to a Kafka input dataframe results saving rows that are not the last.
I faced a slow performance issue during the Tagging stage. To address this, I applied repartition to the input dataframe, which indeed improved the performance. Unfortunately, this led to a situation where data with incorrect order was saved.
To Reproduce
Steps to reproduce the behavior:
Offset 1 :
Offset 2 :
produced 1000 records, incrementing the value of the c2 field by 1 for each (1 ~ 1000). c1 is the precombine field, and c2 is the field used to distinguish each row.
code sample
write config (hudiOptions)
Check the saved results. (I used AWS Athena.)
Repeat steps 1 and 3 to verify if the value of c2 continues to change.
The c2 value changes with each repetition. here is the saved results.
1.
If repartition() is removed from the code, the c2 value is correctly saved as the last offset value, 1000.
Since I'll be using a single Kafka partition, I considered using the Kafka offset number as the precombine field. However, Hudi does not support changes to the precombine field.
I want to improve the performance of the tagging stage through increased parallelism without causing issues with data ordering. How can I resolve this?
Environment Description
Hudi version : 0.10.1-amzn-0 (EMR 6.6.0)
Spark version : 3.2.0 (EMR 6.6.0)
Hive version : 3.1.2 (EMR 6.6.0)
Hadoop version : 3.2.1 (EMR 6.6.0)
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
The text was updated successfully, but these errors were encountered: