Skip to content

ShapeLogCollector memory growth to 3.5GB causing OOM crash on maxwell #3787

@robacourt

Description

@robacourt

Summary

On 27 Jan 2026 at ~18:09-18:11 UTC, the ShapeLogCollector on the maxwell production instance grew to 3.5GB of memory, causing the server to crash. Telemetry confirms the ShapeLogCollector was the primary memory consumer.

Timeline

  • 18:10:09.998 - Two new shapes with subquery dependencies created:
    • 97489818-1769537409997837 (offers shape with projects dependency)
    • 30042127-1769537409998609 (offer_items shape depending on offers)
  • 18:10:10.437 - Materializer for shape 97489818 crashes with "Key already exists" error
  • 18:10:10.449 - ShapeLogCollector crashes with FunctionClauseError in DependencyLayers.add_after_dependencies/3
  • 18:10:10.45x - Cascade of 80+ Materializer/Consumer shutdowns begins
  • 18:10:29.702 - Memory alarm triggered: {:process_memory_high_watermark, #PID<0.179627.0>}
  • 18:12 - Container restarts

Root Cause Analysis

Bug 1: Materializer "Key already exists" race condition

The Materializer received a NewRecord for a key that already existed in its index. This happened because:

  1. Shape 97489818 was created for public.offers with a subquery dependency on public.projects
  2. The Materializer started, subscribed to the Consumer, then began reading from storage
  3. A transaction arrived with a move-in event for offer d3c8d8a5-5060-4a36-a67d-240de0c95a88
  4. The record was already in the snapshot (matched via is_template = true OR the subquery), AND was delivered via replication with move_tags
  5. The Materializer's apply_changes raised at line 317:
if is_map_key(index, key), do: raise("Key #{key} already exists")

The race window exists because the Materializer subscribes to the Consumer BEFORE reading from storage:

# In handle_continue(:start_materializer, ...)
Consumer.subscribe_materializer(stack_id, shape_handle, self())  # <- Subscribes first
{:noreply, state, {:continue, {:read_stream, shape_storage}}}    # <- Then reads storage

Bug 2: DependencyLayers missing function clause

When shape 30042127 (which depends on 97489818) tried to register after 97489818 crashed:

Electric.Shapes.DependencyLayers.add_after_dependencies([], "30042127-...", MapSet.new(["97489818-..."]))

The function has no clause for when layers is empty but deps_to_find is NOT empty:

# This clause only matches when deps_to_find is empty
defp add_after_dependencies([], shape_handle, deps_to_find) when map_size(deps_to_find) == 0 do
  [MapSet.new([shape_handle])]
end
# Missing: clause for when layers is empty but deps_to_find is not

Memory Growth Hypothesis

The 3.5GB growth in ShapeLogCollector is likely caused by:

  1. Message queue accumulation - During the cascade failure, messages piled up faster than processing:

    • Shape registration updates
    • Transaction fragments from replication
    • Flush notifications
    • Shutdown/down notifications
  2. State accumulation during failed recovery - The system spent ~2 minutes (18:10:10 to 18:12) in a failed state with continuous "Stack not ready" errors, potentially accumulating state.

  3. Binary heap growth - JSON-encoded log entries and transaction data accumulating without GC during the cascade.

Evidence

Crash logs

18:10:10.437 [error] GenServer Materializer "97489818-..." terminating
** (RuntimeError) Key "public"."offers"/"d3c8d8a5-5060-4a36-a67d-240de0c95a88" already exists

18:10:10.449 [error] GenServer ShapeLogCollector terminating  
** (FunctionClauseError) no function clause matching in Electric.Shapes.DependencyLayers.add_after_dependencies/3

Transaction details

%Electric.Replication.Changes.NewRecord{
  relation: {"public", "offers"},
  record: %{"id" => "d3c8d8a5-5060-4a36-a67d-240de0c95a88"},
  key: "\"public\".\"offers\"/\"d3c8d8a5-5060-4a36-a67d-240de0c95a88\"",
  move_tags: ["e12422d3af57a36d01a50b4645a517e4"]  # <- Move-in event
}

Proposed Fixes

Fix 1: Make Materializer idempotent

In lib/electric/shapes/consumer/materializer.ex, skip duplicates instead of raising:

%Changes.NewRecord{key: key, record: record, move_tags: move_tags},
{{index, tag_indices}, counts_and_events} ->
  if is_map_key(index, key) do
    # Already exists - skip duplicate (can happen during snapshot/replication race)
    {{index, tag_indices}, counts_and_events}
  else
    {value, original_string} = cast!(record, state)
    index = Map.put(index, key, value)
    # ...rest of logic
  end

Fix 2: Add missing DependencyLayers clause

In lib/electric/shapes/dependency_layers.ex:

# Handle case where dependency shapes haven't been added yet
defp add_after_dependencies([], shape_handle, deps_to_find) when map_size(deps_to_find) > 0 do
  # Log warning about missing dependencies
  Logger.warning("Adding shape #{shape_handle} but dependencies #{inspect(deps_to_find)} not found in layers")
  [MapSet.new([shape_handle])]
end

Fix 3: Reorder Materializer startup (optional)

Subscribe to Consumer AFTER reading from storage to minimize the race window.

Environment

  • Instance: maxwell (eu-west-1)
  • Version: Electric 1.3.3
  • Stack ID: 2a649dc5-b661-4918-b283-06999429a156

Related

  • Crash dump not available (ephemeral storage wiped on restart)
  • See also: issue for crash dump persistence

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions