-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT]duplicate rows in my table #10781
Comments
@chenbodeng719 Can you please let us know what Hudi/Flink/Spark Versions you are using? |
I didnt try on flink. The problem happens when I use spark. |
@chenbodeng719 Can you post screenshot of the duplicate records. Are they belong to different file group? |
Is there any possibility that I bulk insert a dataset with some duplicate keys, then any following upsert key which is same with dup key, would update the item twice. Like the below photo |
Did you only used 0.14.1 only or is this the upgraded table from previous version? bulk_insert itself can ingest duplicates. Did you got duplicates after bulk_insert itself. Yes if that's the case, upsert is going to update both records. Did you confirmed if you had these duplicates after bulk_insert? Running bulk_insert twice on same data also can cause this issue. |
"if that's the case, upsert is going to update both records. " I guess it's my case. First, bulk insert brings some duplicate key into the table. Then when the upsert with duplicate key comes, it updates the duplicate rows with same key. In my case, two rows for one dup key has been changed. |
Yes thats correct, You should remove dups after insterting using
bull_insert or not use bulk insert at all in this case.
…On Thu, Feb 29, 2024 at 4:04 PM chenbodeng719 ***@***.***> wrote:
Did you only used 0.14.1 only or is this the upgraded table from previous
version? can you provide values for hudi meta columns also?
bulk_insert itself can ingest duplicates. Did you got duplicates after
bulk_insert itself. Yes if that's the case, upsert is going to update both
records. Did you confirmed if you had these duplicates after bulk_insert?
Running bulk_insert twice on same data also can cause this issue.
"if that's the case, upsert is going to update both records. " I guess
it's my case. First, bulk insert brings some duplicate key into the table.
Then when the upsert with duplicate key comes, it updates the duplicate
rows with same key. In my case, two rows for one dup key has been changed.
I wonder if there are five rows for one dup key, it updates the five rows?
—
Reply to this email directly, view it on GitHub
<#10781 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/APD55YQZYWWO3TQ7UAOZBPTYV4B4ZAVCNFSM6AAAAABD7P3VEOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZQHA2TAMJTGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
A clear and concise description of the problem.
To Reproduce
Steps to reproduce the behavior:
I have duplicate rows in my table .
The below is my flink hudi config. By consuming data from kafka to upsert hudi_sink table. And I use pyspark to read the table, but I get duplicate data.
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
Hudi version : 0.14.1
Spark version : 3.3.0
Flink version : 1.16.0
Hive version :
Hadoop version : 3.3.3
Storage (HDFS/S3/GCS..) : s3
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
The text was updated successfully, but these errors were encountered: