[SUPPORT]duplicate rows in my table #10781

chenbodeng719 · 2024-02-29T08:39:39Z

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

I have duplicate rows in my table .
The below is my flink hudi config. By consuming data from kafka to upsert hudi_sink table. And I use pyspark to read the table, but I get duplicate data.

# flink write hudi conf
        CREATE TABLE hudi_sink(
            new_uid STRING PRIMARY KEY NOT ENFORCED,
            uid STRING,
            oridata STRING,
            part INT,
            user_update_date STRING,
            update_time TIMESTAMP_LTZ(3) 
        ) PARTITIONED BY (
            `part`
        ) WITH (
            'table.type' = 'MERGE_ON_READ',
            'connector' = 'hudi',
            'path' = '%s',
            'write.operation' = 'upsert',
            'precombine.field' = 'update_time',
            'write.tasks' = '%s',
            'index.type' = 'BUCKET',
            'hoodie.bucket.index.hash.field' = 'new_uid',
            'hoodie.bucket.index.num.buckets' = '%s',
            'clean.retain_commits' = '0',
            'compaction.async.enabled' = 'false'
        )

# spark read

            readOptions = {
            }
            prof_df = sqlc.read \
                .format('org.apache.hudi') \
                .options(**readOptions) \
                .load(tpath)

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Hudi version : 0.14.1
Spark version : 3.3.0
Flink version : 1.16.0
Hive version :
Hadoop version : 3.3.3
Storage (HDFS/S3/GCS..) : s3
Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

The text was updated successfully, but these errors were encountered:

ad1happy2go · 2024-02-29T08:56:27Z

@chenbodeng719 Can you please let us know what Hudi/Flink/Spark Versions you are using?
Are you are getting duplicate rows when reading using spark only OR you are getting the same behaviour when you try to read back using flink too.

chenbodeng719 · 2024-02-29T09:01:28Z

@chenbodeng719 Can you please let us know what Hudi/Flink/Spark Versions you are using? Are you are getting duplicate rows when reading using spark only OR you are getting the same behaviour when you try to read back using flink too.

I didnt try on flink. The problem happens when I use spark.

ad1happy2go · 2024-02-29T09:27:11Z

@chenbodeng719 Can you post screenshot of the duplicate records. Are they belong to different file group?

chenbodeng719 · 2024-02-29T09:40:30Z

@chenbodeng719 Can you post screenshot of the duplicate records. Are they belong to different file group?

Is there any possibility that I bulk insert a dataset with some duplicate keys, then any following upsert key which is same with dup key, would update the item twice. Like the below photo

ad1happy2go · 2024-02-29T10:13:45Z

Did you only used 0.14.1 only or is this the upgraded table from previous version?
can you provide values for hudi meta columns also?

bulk_insert itself can ingest duplicates. Did you got duplicates after bulk_insert itself. Yes if that's the case, upsert is going to update both records. Did you confirmed if you had these duplicates after bulk_insert?

Running bulk_insert twice on same data also can cause this issue.

chenbodeng719 · 2024-02-29T10:34:38Z

Did you only used 0.14.1 only or is this the upgraded table from previous version? can you provide values for hudi meta columns also?

bulk_insert itself can ingest duplicates. Did you got duplicates after bulk_insert itself. Yes if that's the case, upsert is going to update both records. Did you confirmed if you had these duplicates after bulk_insert?

Running bulk_insert twice on same data also can cause this issue.

"if that's the case, upsert is going to update both records. " I guess it's my case. First, bulk insert brings some duplicate key into the table. Then when the upsert with duplicate key comes, it updates the duplicate rows with same key. In my case, two rows for one dup key has been changed.
I wonder if there are five rows for one dup key, it updates the five rows?

ad1happy2go · 2024-02-29T13:06:33Z

Yes thats correct, You should remove dups after insterting using bull_insert or not use bulk insert at all in this case.

…

On Thu, Feb 29, 2024 at 4:04 PM chenbodeng719 ***@***.***> wrote: Did you only used 0.14.1 only or is this the upgraded table from previous version? can you provide values for hudi meta columns also? bulk_insert itself can ingest duplicates. Did you got duplicates after bulk_insert itself. Yes if that's the case, upsert is going to update both records. Did you confirmed if you had these duplicates after bulk_insert? Running bulk_insert twice on same data also can cause this issue. "if that's the case, upsert is going to update both records. " I guess it's my case. First, bulk insert brings some duplicate key into the table. Then when the upsert with duplicate key comes, it updates the duplicate rows with same key. In my case, two rows for one dup key has been changed. I wonder if there are five rows for one dup key, it updates the five rows? — Reply to this email directly, view it on GitHub <#10781 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APD55YQZYWWO3TQ7UAOZBPTYV4B4ZAVCNFSM6AAAAABD7P3VEOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZQHA2TAMJTGI> . You are receiving this because you commented.Message ID: ***@***.***>

codope added priority:major degraded perf; unable to move forward; potential bugs data-consistency phantoms, duplicates, write skew, inconsistent snapshot on-call-triaged labels Feb 29, 2024

chenbodeng719 closed this as completed Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT]duplicate rows in my table #10781

[SUPPORT]duplicate rows in my table #10781

chenbodeng719 commented Feb 29, 2024 •

edited

Loading

ad1happy2go commented Feb 29, 2024

chenbodeng719 commented Feb 29, 2024

ad1happy2go commented Feb 29, 2024

chenbodeng719 commented Feb 29, 2024 •

edited

Loading

ad1happy2go commented Feb 29, 2024

chenbodeng719 commented Feb 29, 2024

ad1happy2go commented Feb 29, 2024 via email

[SUPPORT]duplicate rows in my table #10781

[SUPPORT]duplicate rows in my table #10781

Comments

chenbodeng719 commented Feb 29, 2024 • edited Loading

ad1happy2go commented Feb 29, 2024

chenbodeng719 commented Feb 29, 2024

ad1happy2go commented Feb 29, 2024

chenbodeng719 commented Feb 29, 2024 • edited Loading

ad1happy2go commented Feb 29, 2024

chenbodeng719 commented Feb 29, 2024

ad1happy2go commented Feb 29, 2024 via email

chenbodeng719 commented Feb 29, 2024 •

edited

Loading

chenbodeng719 commented Feb 29, 2024 •

edited

Loading