Skip to content

Partitions vanish and re-appear randomly #24

@aphyr

Description

@aphyr

After being acknowledged, partitions are supposed to always exist. However, queries and RPC requests often fail to find partitions. For example, take this test run of ebf0f3e: 20250506T110419.058-0500.zip. In it, we created eight partitions, each holding a single unique integer field called foo:

# Just a box containing a single int.
class Intbox(Node):
    foo: int = Field(default=-1)

Concurrently, we tried to read all partitions by issuing a series of simple queries, like

query {:code (str "select('SOME_PARTITION_KEY')")}

Throughout the test, these queries disagreed on which partitions existed. At the end of the test, we asked each node to read again. Note that partitions 1, 2, 4, 5, 6, 7, and 8 all completed successfully before these reads began, but the queries missed these committed partitions. (Partition 3's creation crashed.)

0	:ok	:read	[nil nil nil 5 6 nil 8]
22	:ok	:read	[nil 2 4 5 nil 7 8]
13	:ok	:read	[1 2 nil 5 nil 7 8]
3	:ok	:read	[1 2 nil 5 nil 7 8]
39	:ok	:read	[1 2 nil 5 nil nil 8]
5	:ok	:read	[nil nil nil 5 6 nil 8]
6	:ok	:read	[nil nil 4 5 6 7 nil]
11	:ok	:read	[nil nil 4 5 6 7 nil]

You can reproduce this easily in healthy clusters by running

lein run test --tarball capela-2025-04-10-ebf0f3e.tar.gz -w partition-set --rate 1/3 --nemesis none --time-limit 60

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions