Skip to content

Data loss due to the inability to obtain the completion time #14365

@TheR1sing3un

Description

@TheR1sing3un

Bug Description

What happened:
Recently, we encountered a data compaction where some of the data written to the involved log did not appear in the new base file.
After investigation, we found that this log did not appear as scheduled in the compaction plan, even though the completion time of this log file was significantly earlier than that of compaction instant. In our scenario, there is an interval of nearly three days between writing and merging.

How did we investigate

  1. First, we found that the logic of the log file was filtered out here:
Image
  1. The reason for being filtered is that we read an instant from three days ago to find its corresponding completion time. This instant has already been archived, and we also queried the completion time in the archive timeline. It was found that it was also three days ago, which was correct.
  2. However, this completion time was not obtained correctly here. Instead, null was returned, resulting in this log being determined to have been completed after this compaction instant, and thus this log was filtered out from the plan.

Then why didn't the compaction task retrieve the completion time from the archive timeline?

Let's construct a case:

  • There are a total of 10 instant items from 1 to 10, and items 1 to 6 have all been archived.
  • And 1 to 2 are archived into a 1_2 archive parquet file
  • And 3 to 6 are archived into a 3_6 archive parquet file

archived: [1_2.parquet, 3_6.parquet] ; active: [7-10]

  1. Initialize the CompletionTimeQueryViewV2, and the cursor is located to the first active instant:
Image 2. now we have stored instant from 7 to 10 in memory.
  1. we try to get completion time for instant 5, it will lazy load instants started from 5:
Image
  1. In the following scanning and loading logic, we will scan to file 3_6.parquet and read instant 5 and 6 from it and store the instants: [5, 6] in memory:
Image 5. And now, we try to get completion time for instant 4, it will trigger lazy load again, and now it will load with filter: [4, 5)
  1. But, this time, we can't obtine the correct completion time, because we skipped reading the 3_6.parquet, and instant 4 is exactly in this file:
Image 7. As for why the file is filtered out, is it because the situation where the boundary of the filter is entirely contained within the min max of the file has not been taken into consider: Image

Steps to reproduce:
You can reproduce by this simple ut:

Image Image

Environment

Hudi version:
1.x
Query engine: (Spark/Flink/Trino etc)

Relevant configs:

Logs and Stack Trace

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:bugBug reports and fixes

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions