Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT]duplicate rows in my table #10781

Closed
chenbodeng719 opened this issue Feb 29, 2024 · 7 comments
Closed

[SUPPORT]duplicate rows in my table #10781

chenbodeng719 opened this issue Feb 29, 2024 · 7 comments
Labels
data-consistency phantoms, duplicates, write skew, inconsistent snapshot on-call-triaged priority:major degraded perf; unable to move forward; potential bugs

Comments

@chenbodeng719
Copy link

chenbodeng719 commented Feb 29, 2024

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

I have duplicate rows in my table .
The below is my flink hudi config. By consuming data from kafka to upsert hudi_sink table. And I use pyspark to read the table, but I get duplicate data.

# flink write hudi conf
        CREATE TABLE hudi_sink(
            new_uid STRING PRIMARY KEY NOT ENFORCED,
            uid STRING,
            oridata STRING,
            part INT,
            user_update_date STRING,
            update_time TIMESTAMP_LTZ(3) 
        ) PARTITIONED BY (
            `part`
        ) WITH (
            'table.type' = 'MERGE_ON_READ',
            'connector' = 'hudi',
            'path' = '%s',
            'write.operation' = 'upsert',
            'precombine.field' = 'update_time',
            'write.tasks' = '%s',
            'index.type' = 'BUCKET',
            'hoodie.bucket.index.hash.field' = 'new_uid',
            'hoodie.bucket.index.num.buckets' = '%s',
            'clean.retain_commits' = '0',
            'compaction.async.enabled' = 'false'
        )

# spark read

            readOptions = {
            }
            prof_df = sqlc.read \
                .format('org.apache.hudi') \
                .options(**readOptions) \
                .load(tpath)

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.14.1

  • Spark version : 3.3.0

  • Flink version : 1.16.0

  • Hive version :

  • Hadoop version : 3.3.3

  • Storage (HDFS/S3/GCS..) : s3

  • Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

@ad1happy2go
Copy link
Collaborator

@chenbodeng719 Can you please let us know what Hudi/Flink/Spark Versions you are using?
Are you are getting duplicate rows when reading using spark only OR you are getting the same behaviour when you try to read back using flink too.

@chenbodeng719
Copy link
Author

@chenbodeng719 Can you please let us know what Hudi/Flink/Spark Versions you are using? Are you are getting duplicate rows when reading using spark only OR you are getting the same behaviour when you try to read back using flink too.

I didnt try on flink. The problem happens when I use spark.

@ad1happy2go
Copy link
Collaborator

@chenbodeng719 Can you post screenshot of the duplicate records. Are they belong to different file group?

@chenbodeng719
Copy link
Author

chenbodeng719 commented Feb 29, 2024

@chenbodeng719 Can you post screenshot of the duplicate records. Are they belong to different file group?

Is there any possibility that I bulk insert a dataset with some duplicate keys, then any following upsert key which is same with dup key, would update the item twice. Like the below photo

image

@ad1happy2go
Copy link
Collaborator

Did you only used 0.14.1 only or is this the upgraded table from previous version?
can you provide values for hudi meta columns also?

bulk_insert itself can ingest duplicates. Did you got duplicates after bulk_insert itself. Yes if that's the case, upsert is going to update both records. Did you confirmed if you had these duplicates after bulk_insert?

Running bulk_insert twice on same data also can cause this issue.

@chenbodeng719
Copy link
Author

Did you only used 0.14.1 only or is this the upgraded table from previous version? can you provide values for hudi meta columns also?

bulk_insert itself can ingest duplicates. Did you got duplicates after bulk_insert itself. Yes if that's the case, upsert is going to update both records. Did you confirmed if you had these duplicates after bulk_insert?

Running bulk_insert twice on same data also can cause this issue.

"if that's the case, upsert is going to update both records. " I guess it's my case. First, bulk insert brings some duplicate key into the table. Then when the upsert with duplicate key comes, it updates the duplicate rows with same key. In my case, two rows for one dup key has been changed.
I wonder if there are five rows for one dup key, it updates the five rows?

@ad1happy2go
Copy link
Collaborator

ad1happy2go commented Feb 29, 2024 via email

@codope codope added priority:major degraded perf; unable to move forward; potential bugs data-consistency phantoms, duplicates, write skew, inconsistent snapshot on-call-triaged labels Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-consistency phantoms, duplicates, write skew, inconsistent snapshot on-call-triaged priority:major degraded perf; unable to move forward; potential bugs
Projects
Archived in project
Development

No branches or pull requests

3 participants