-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes #11016
Comments
By the way, I performed the operation using Flink SQL. |
what hudi release did you use then, we did fond a weird data loss issue about compaction in release 0.14.0, it is fixed in master and 1.0.x branch now. Are you talking about streaming read data loss or batch queries? |
We are using Hudi version 0.14.1, and we have tried both streaming reads and batch queries, but we cannot read the earliest written data. If this is the issue you mentioned, we will try upgrading to the master branch. |
How did you write the earliest data set, are they got updated or just got lost? |
The data is written through the flink-mysql-cdc method, from January 1, 2024, to March 31, 2024, with 10,000 records being written to MySQL every day. After completing one round of writing, it starts writing from the first day again and continues to cycle for several rounds. However, when I query the Hudi table, I can only find all the data for February 25, 2024, and cannot find any other data. By the way, we have configured metadata synchronization to Hive, and all the written data can be found from the Hive end. |
What engine did you use when you found the data loss? |
We discovered that the incomplete data retrieval was due to the use of Flink-SQL, specifically a simple "select * from hudi_ods_table" query. |
Did you specify the |
@juice411 precombine field is used as ordering field to deduplicate. For example if we have two records in source with same record key, then hudi will pick up the record with higher precombine value and skip the other one. This happens when we use upsert operation type. For bulk_insert and insert it will insert both of them. |
And can you also supplement the Hudi and Flink release you use here? |
@danny0405 The previous versions we were using were Hudi 0.14.1 and Flink 1.17.2. Also, we believe our issue is not related to the precombine field as we have a unique ID to identify each data entry. |
@danny0405 |
Specifies the |
@danny0405 We set read.start-commit as earliest because it did not work as expected, and we have been very anxious about it. We have tried various solutions to obtain the full data set, but none of them worked. This issue has been plaguing us for more than three days. Is there any other approach we can take? |
Can you share you source table definitions? |
CREATE TABLE if not exists test_simulated_data.ods_table_v1( |
1 similar comment
CREATE TABLE if not exists test_simulated_data.ods_table_v1( |
It should work for option |
we want to start re-acquiring data from the first record of the upstream Hudi table and rebuild the downstream table, but the issue is that we can't access older data. |
If you table is ingested in streaming It actually depends on how you write the history dataset, because |
@juice411 Do you have any other help on this. Please let us know if you are good. Thanks. |
Description:
I have created a Hudi table named ods_table_v1 using the following SQL command:
sql
CREATE TABLE if not exists test_simulated_data.ods_table_v1(
id int,
count_field double,
write_time timestamp(0),
_part string,
proc_time timestamp(3),
WATERMARK FOR write_time AS write_time
)
PARTITIONED BY (_part)
WITH(
'connector'='hudi',
'path'='hdfs://masters/test_simulated_data/ods_table_v1',
'table.type'='MERGE_ON_READ',
'hoodie.datasource.write.recordkey.field'='id',
'hoodie.datasource.write.precombine.field'='write_time',
'compaction.async.enabled'='true',
'compaction.schedule.enabled'='true',
'compaction.trigger.strategy'='time_elapsed',
'compaction.delta_seconds'='600',
'compaction.delta_commits'='1',
'read.streaming.enabled'='true',
'read.streaming.skip_compaction'='true',
'read.start-commit'='earliest',
'changelog.enabled'='true',
'hive_sync.enable'='true',
'hive_sync.mode'='hms',
'hive_sync.metastore.uris'='thrift://h35:9083',
'hive_sync.db'='test_simulated_data',
'hive_sync.table'='hive_ods_table'
);
This table, ods_table_v1, has continuous data writes. However, after three days of continuous writes, I noticed an issue with the data. When querying the table for all data, I found that the earliest batch of written data is missing. No matter what conditions I add, I cannot retrieve the earliest written data.
I am urgently seeking answers to understand the cause of this data loss. Has anyone encountered a similar issue with Hudi tables? Is there a known issue or configuration mistake that could have led to this? Any help or guidance would be greatly appreciated.
Thank you in advance for your time and assistance.
The text was updated successfully, but these errors were encountered: