Skip to content

FlushTracker stall: consumer killed without terminate callback (OOM, :kill signal) #3981

@alco

Description

@alco

Parent: #3980

Scenario

A consumer process is killed in a way that bypasses the terminate/2 callback:

  • :kill signal: Even with Process.flag(:trap_exit, true) (consumer.ex:108), a :kill signal terminates the process immediately without calling terminate. From Erlang docs: "If the process receives a kill signal, it terminates, regardless of the trap_exit flag."
  • OOM killer: The OS kills the BEAM process or the process is killed by the VM's memory limits.
  • :brutal_kill supervisor shutdown: If a supervisor is configured with shutdown: :brutal_kill, child processes receive :kill.

What happens

  1. A transaction arrives. ShapeLogCollector.publishConsumerRegistry.publishbroadcast delivers the event to the consumer.
  2. The consumer processes the event, replies :ok. FlushTracker.handle_txn_fragment records the shape in last_flushed and min_incomplete_flush_tree.
  3. The consumer is killed (:kill signal, OOM, etc.) before the storage flush callback fires and notify_flushed is called.
  4. terminate/2 does NOT run. No cleanup happens. No handle_writer_termination, no remove_shape_async, no FlushTracker.handle_shape_removed.
  5. The shape's entry in FlushTracker becomes the permanent minimum, blocking last_global_flushed_offset from advancing.

Why existing fixes don't help

Fix

This scenario can only be addressed by an active detection mechanism in ShapeLogCollector (see parent issue #3980). The terminate callback path is insufficient by definition — no amount of improvement to terminate or handle_writer_termination can help when the callback doesn't execute.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions