Skip to content

Race Condition Between GC Chunk Reset and Background Expired ReplReq Cleanup #401

@Besroy

Description

@Besroy

During GC task processing, a move_to_chunk is selected from m_reserved_chunk_queue and reset via purge_reserved_chunk() to ensure the chunk is completely clean. This operation resets the old append blk allocator and creates a new one to replace it.

At the time of chunk reset, there may exist stale rreqs associated with the move_to_chunk. When these rreqs are cleaned up by the background gc_repl_reqs(), they will free blocks on the chunk, creating two potential race conditions:

Risk 1: Free on New Allocator

  • If the expired rreq is cleaned up AFTER the new allocator is created, the free operation targets the new allocator instead of the old one
  • Currently no immediate impact because append allocator's free() only increments m_freeable_nblks counter
  • Future risk: If real free operations are implemented, this will cause unexpected free on wrong allocator

Risk 2: Free on Destroyed Allocator

  • If the expired rreq is cleaned up AFTER the old allocator is reset but BEFORE the new allocator is created, the free operation accesses a destroyed allocator
  • This causes crashes due to accessing freed memory (e.g., destroyed superblock). Here is a timeline of observed crash:
    T1: cp_flush obtains pointer to old allocator
    T2: GC resets allocator (destroys old superblock), set m_is_dirty=false on old alloactor
    T3: gc_repl_reqs frees expired rreq → sets m_is_dirty=true on old allocator pointer
    T4: cp_flush executes on old allocator → enter m_sb write due to m_is_dirty is true -> accesses destroyed superblock → crash

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions