Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes #11016

Open
juice411 opened this issue Apr 15, 2024 · 23 comments
Open
Labels
data-loss loss of data only, use data-consistency label for inconsistent view priority:critical production down; pipelines stalled; Need help asap.

Comments

@juice411
Copy link

Description:
I have created a Hudi table named ods_table_v1 using the following SQL command:

sql
CREATE TABLE if not exists test_simulated_data.ods_table_v1(
id int,
count_field double,
write_time timestamp(0),
_part string,
proc_time timestamp(3),
WATERMARK FOR write_time AS write_time
)
PARTITIONED BY (_part)
WITH(
'connector'='hudi',
'path'='hdfs://masters/test_simulated_data/ods_table_v1',
'table.type'='MERGE_ON_READ',
'hoodie.datasource.write.recordkey.field'='id',
'hoodie.datasource.write.precombine.field'='write_time',
'compaction.async.enabled'='true',
'compaction.schedule.enabled'='true',
'compaction.trigger.strategy'='time_elapsed',
'compaction.delta_seconds'='600',
'compaction.delta_commits'='1',
'read.streaming.enabled'='true',
'read.streaming.skip_compaction'='true',
'read.start-commit'='earliest',
'changelog.enabled'='true',
'hive_sync.enable'='true',
'hive_sync.mode'='hms',
'hive_sync.metastore.uris'='thrift://h35:9083',
'hive_sync.db'='test_simulated_data',
'hive_sync.table'='hive_ods_table'
);
This table, ods_table_v1, has continuous data writes. However, after three days of continuous writes, I noticed an issue with the data. When querying the table for all data, I found that the earliest batch of written data is missing. No matter what conditions I add, I cannot retrieve the earliest written data.

I am urgently seeking answers to understand the cause of this data loss. Has anyone encountered a similar issue with Hudi tables? Is there a known issue or configuration mistake that could have led to this? Any help or guidance would be greatly appreciated.

Thank you in advance for your time and assistance.

@juice411
Copy link
Author

By the way, I performed the operation using Flink SQL.

@juice411 juice411 changed the title Data Loss Issue with Hudi Table After 3 Days of Continuous Writes[SUPPORT] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes Apr 15, 2024
@danny0405
Copy link
Contributor

what hudi release did you use then, we did fond a weird data loss issue about compaction in release 0.14.0, it is fixed in master and 1.0.x branch now.

Are you talking about streaming read data loss or batch queries?

@juice411
Copy link
Author

We are using Hudi version 0.14.1, and we have tried both streaming reads and batch queries, but we cannot read the earliest written data. If this is the issue you mentioned, we will try upgrading to the master branch.

@danny0405
Copy link
Contributor

How did you write the earliest data set, are they got updated or just got lost?

@juice411
Copy link
Author

The data is written through the flink-mysql-cdc method, from January 1, 2024, to March 31, 2024, with 10,000 records being written to MySQL every day. After completing one round of writing, it starts writing from the first day again and continues to cycle for several rounds. However, when I query the Hudi table, I can only find all the data for February 25, 2024, and cannot find any other data. By the way, we have configured metadata synchronization to Hive, and all the written data can be found from the Hive end.

@danny0405
Copy link
Contributor

I can only find all the data for February 25, 2024, and cannot find any other data. By the way, we have configured metadata synchronization to Hive, and all the written data can be found from the Hive end.

What engine did you use when you found the data loss?

@juice411
Copy link
Author

We discovered that the incomplete data retrieval was due to the use of Flink-SQL, specifically a simple "select * from hudi_ods_table" query.

@juice411
Copy link
Author

juice411 commented Apr 16, 2024

image
image
Upon further testing after upgrading to the new master version, we have discovered missing data. As per our testing expectations, the results for all days should be consistent and equal to the data from the first day. However, as evident from the screenshot attached, the data for subsequent days is inconsistent. I have confirmed that the entire data system has been stopped for more than half an hour, ruling out the possibility of any pending or unfinished data processing.

@danny0405
Copy link
Contributor

Did you specify the preCombine field correctly?

@juice411
Copy link
Author

juice411 commented Apr 16, 2024

image
image
I appreciate the clarification. While I'm not entirely sure about the significance of preCombine, I've learned from the developer that hoodie.datasource.write.precombine.field is set to write_time. the write_time is a timestamp field representing the time when data was written, formatted like '2024-01-01 18:59:25.000000000'.

Could you elaborate on the impacts or benefits of this setting? For instance, how does it enhance data processing, query efficiency, or data consistency?

@ad1happy2go
Copy link
Collaborator

@juice411 precombine field is used as ordering field to deduplicate. For example if we have two records in source with same record key, then hudi will pick up the record with higher precombine value and skip the other one. This happens when we use upsert operation type. For bulk_insert and insert it will insert both of them.

@danny0405
Copy link
Contributor

And can you also supplement the Hudi and Flink release you use here?

@juice411
Copy link
Author

@danny0405 The previous versions we were using were Hudi 0.14.1 and Flink 1.17.2. Also, we believe our issue is not related to the precombine field as we have a unique ID to identify each data entry.

@juice411
Copy link
Author

@danny0405image
As shown in the screenshot, Flink has created a job that fetches data from an upstream Hudi table and performs a count calculation. However, from the screenshot, it appears that after several minutes, the job has only fetched 6 records from the upstream table. This raises the question of where the other data might have gone.
How can I obtain the full dataset from the upstream Hudi table?

@danny0405
Copy link
Contributor

How can I obtain the full dataset from the upstream Hudi table?

Specifies the read.start-commit as earliest. By default the streaming source consumes from the latest commit of a Hudi table.

@juice411
Copy link
Author

@danny0405 We set read.start-commit as earliest because it did not work as expected, and we have been very anxious about it. We have tried various solutions to obtain the full data set, but none of them worked. This issue has been plaguing us for more than three days. Is there any other approach we can take?

@danny0405
Copy link
Contributor

Can you share you source table definitions?

@juice411
Copy link
Author

CREATE TABLE if not exists test_simulated_data.ods_table_v1(
id int,
count_field double,
write_time timestamp(0),
_part string,
proc_time timestamp(3),
WATERMARK FOR write_time AS write_time
)
PARTITIONED BY (_part)
WITH(
'connector'='hudi',
'path'='hdfs://masters/test_simulated_data/ods_table_v1',
'table.type'='MERGE_ON_READ',
'hoodie.datasource.write.recordkey.field'='id',
'hoodie.datasource.write.precombine.field'='write_time',
'compaction.async.enabled'='true',
'compaction.schedule.enabled'='true',
'compaction.trigger.strategy'='time_elapsed',
'compaction.delta_seconds'='600',
'compaction.delta_commits'='1',
'read.streaming.enabled'='true',
'read.streaming.skip_compaction'='true',
'read.start-commit'='earliest',
'changelog.enabled'='true',
'hive_sync.enable'='true',
'hive_sync.mode'='hms',
'hive_sync.metastore.uris'='thrift://h35:9083',
'hive_sync.db'='test_simulated_data',
'hive_sync.table'='hive_ods_table'
);

1 similar comment
@juice411
Copy link
Author

CREATE TABLE if not exists test_simulated_data.ods_table_v1(
id int,
count_field double,
write_time timestamp(0),
_part string,
proc_time timestamp(3),
WATERMARK FOR write_time AS write_time
)
PARTITIONED BY (_part)
WITH(
'connector'='hudi',
'path'='hdfs://masters/test_simulated_data/ods_table_v1',
'table.type'='MERGE_ON_READ',
'hoodie.datasource.write.recordkey.field'='id',
'hoodie.datasource.write.precombine.field'='write_time',
'compaction.async.enabled'='true',
'compaction.schedule.enabled'='true',
'compaction.trigger.strategy'='time_elapsed',
'compaction.delta_seconds'='600',
'compaction.delta_commits'='1',
'read.streaming.enabled'='true',
'read.streaming.skip_compaction'='true',
'read.start-commit'='earliest',
'changelog.enabled'='true',
'hive_sync.enable'='true',
'hive_sync.mode'='hms',
'hive_sync.metastore.uris'='thrift://h35:9083',
'hive_sync.db'='test_simulated_data',
'hive_sync.table'='hive_ods_table'
);

@danny0405
Copy link
Contributor

It should work for option 'read.start-commit'='earliest',, what is the current behavior now, comsuming from the latest commit or a very specific one?

@juice411
Copy link
Author

we want to start re-acquiring data from the first record of the upstream Hudi table and rebuild the downstream table, but the issue is that we can't access older data.

@danny0405
Copy link
Contributor

but the issue is that we can't access older data.

If you table is ingested in streaming upsert, then you just specify the read.start-commit as the first commit instant time on the timeline, and skip the compaction. Only instant that has not been cleaned can be consumed.

It actually depends on how you write the history dataset, because bulk_insert does not guarantee the payload sequence of one key, so if the table is boostraped with bulk_insert, the only way is to consume from earliest.

@ad1happy2go
Copy link
Collaborator

@juice411 Do you have any other help on this. Please let us know if you are good. Thanks.

@codope codope added priority:critical production down; pipelines stalled; Need help asap. data-loss loss of data only, use data-consistency label for inconsistent view labels May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-loss loss of data only, use data-consistency label for inconsistent view priority:critical production down; pipelines stalled; Need help asap.
Projects
Status: Awaiting Triage
Development

No branches or pull requests

4 participants