Skip to content

Dequeue sometimes skips a prefix of a sub-queue #437

@fhackett

Description

@fhackett

This might be related to 1 or 2 other open issues, but I can't prove it's the same thing. I read the README very carefully, and I want to check whether I understood correctly. If so, I might have a reproducer for an issue others have reported.

Key question(s): is this a bug? If so, how can I help you diagnose it further, and if not, what is expected so I can revise my model?

For context, I'm part of a research project that chose to use your concurrent queue (among some other systems) to evaluate our new analysis tool. The tool fuzz tests target code, both with fuzzy call sequences and fuzzy scheduler behavior (via rr chaos mode https://rr-project.org/). We record timeboxes within which each operation happened, as well as what went into / out of the data structure.

Given these detailed logs, we use a model checker to compare them against a relatively complete TLA+ model of the collection-of-subqueues design your data structure uses. Most of the time our model agrees with the documented non-linearizable behavior (pseudo randomly picking which subqueue to dequeue from, switching subqueues whenever-ly). Note that we use the model checker to explore every interleaving of behaviors that are reported as overlapping, and every interpretation of operations that are ambiguous.

Sometimes, our validation process fails in a scenario like this (each operation happens in a distinct period of real time, all multiplexed onto one single CPU core):

  • Thread 1 enqueues [1, 2, 3, 4, 5]
  • Thread 2 dequeues [3, 4, 5]
  • Thread 3 dequeues [1, 2]

The real scenarios are more complex, but ultimately no possible interpretation where the data structure returns elements in per-producer enqueue order is valid. The skip-forward pattern above is common, where subsequences keep their order internally (don't see [2, 1] for instance), but the subsequences come out wrong (assuming I understand your docs).

Because we use rr, we have full record-replay debugging of these bug scenarios, and I can answer very specific questions if I know what to look for in the code. I am also willing to make edits to my testing setup, share data, etc.

The testing rig is public here: https://github.com/DistCompiler/pgo/tree/main/omnilink. You're welcome to check it out. The code calling your data structure is in concurrentqueue/workload. The build system can recompile it against any version of your code on demand. Note that the whole thing is "research grade" software, in that it is robust but not well documented, and in flux as we approach a paper submission deadline.

More details:

  • I am calling token-less bulk enqueue from multiple threads (but I am tracking which thread did what; your data structure uses an implicit per-thread token in this case right?)
  • my dequeue operation, again from any thread, is token-less bulk dequeue
  • to make fuzzing easier, I deliberately configured a small block size of 4
  • each producer enqueues a characteristic range of integers, so we can tell them apart easily

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions