Skip to content

Compactor: Race condition between cleaner and compactor of partitioning compaction could cause incorrect compaction results #7075

@alexqyle

Description

@alexqyle

Describe the bug
A race condition could happen when cleaner deleting previous version of partitioned group info file and its visit marers while another compactor picked up compaction job from previous version of partitioned group info file. This race condition would result in extra result block being compacted based on previous version of partitioned group info file. If there would be a new version of partitioned group info file being generated later on, that extra result block would be treated as result block from new version. Since the result block from previous version always contains few samples than new version, incorrect compaction results would occur under such situation.

To Reproduce
It is hard to setup a controllable reproduce steps. Will provide the timeline the race condition happens:

  1. compactor x: Generate partitioned group version 1 with total 8 partitions.
  2. compactor n..m: Finished compacting all 8 partitions for partitioned group version 1.
  3. compactor 1: Load partitioned group version 1 from storage along with all existing and new partitioned groups. And iterating through to pick a compaction job out of them.
  4. cleaner: Found partitioned group version 1 got all partitions completed. Delete partitioned group version 1 from bucket store along with all visit markers associated with this partitioned group.
  5. compactor 1: Picked partition 0 from partitioned group version 1 for compaction and checked if there was visit marker for this partition. It started compaction since the visit marker was deleted by cleaner.
  6. compactor 2: Found there was new lower level block from the same time range got uploaded and generated partitioned group version 2 with total 4 partitions. The partitioned group would contain newly uploaded blocks and all 8 result blocks from partitioned group version 1.
  7. compactor n..m: Started compacting partition 1, 2, 3 from partitioned group version 2 since partition 0 got visit marker that was held by compactor 1.
  8. compactor 1: Finished compacting partition 0 from partitioned group version 1. Visit marker got updated to complete state for partition 0.
  9. compactor n..m: Finished compacting partition 1, 2, 3 from partitioned group version 2. Visit markers got updated to complete state for partition 1, 2, 3.
  10. cleaner: Found all 4 partitions were completed based on visit markers. It considered partitioned group version 2 was complete. It marked all parent blocks in partitioned group version 2 for deletion. Those blocks contains newly uploaded blocks and 8 result blocks from partitioned group version 1.

At the end, blocks got deleted:

  • block1: partition 0 from partitioned group version 1
  • block2: partition 1 from partitioned group version 1
  • block3: partition 2 from partitioned group version 1
  • block4: partition 3 from partitioned group version 1
  • block5: partition 4 from partitioned group version 1
  • block6: partition 5 from partitioned group version 1
  • block7: partition 6 from partitioned group version 1
  • block8: partition 7 from partitioned group version 1

Blocks were active:

  • block9: partition 0 from partitioned group version 1
  • block10: partition 1 from partitioned group version 2
  • block11: partition 2 from partitioned group version 2
  • block12: partition 3 from partitioned group version 2

There was no incorrect compaction in block10, block11, and block12.

There was incorrect compaction in block9 because:

  • Partition 0 from partitioned group version 2 should contain:
    • 1/4 of the data from newly uploaded block
    • all data from block1 and block5
  • block9 only contained all data from block1

Expected behavior
No incorrect compaction results

Environment:

  • Infrastructure: Any distributed environment that has multiple compactors running at same time

Possible Fix

  • Short term:
    • Use CreationTime from PartitionedGroupInfo as a version tracker to make sure the picked compaction job matches the latest version of partitioned group in bucket store.
    • Planner could do such check. After planner passed check for visit marker, it could also make sure partitioned group info still exists in bucket store. If it exists, planner could check CreationTime matches.
  • Long term:
    • Use leader/worker architecture for compactor.
    • Leader should deal with all the coordinations and cleanup to avoid race condition.
    • Worker should purely do actual compaction only.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions