Skip to content

py_thread_callback_SUITE: concurrent threadpool tests flaky under load #63

@benoitc

Description

@benoitc

Symptom

py_thread_callback_SUITE intermittently fails on slower CI environments (observed on FreeBSD 14 / Python 3.12 in run 25273995207). A re-run of the same commit passes, so the failures are not deterministic.

Failures observed

threadpool_basic_test:           {badmatch,<<"ok">>}
threadpool_concurrent_test:      {badmatch,{error,{'RuntimeError',"Empty callback response"}}}
threadpool_multiple_calls_test:  {badmatch,[3,2,3,4,7]}    %% expected [3,4,5,6,7]
threadpool_nested_callback_test: {badmatch,[1,14,4,2,5]}   %% expected [1,2,5,10,17]
threadpool_error_handling_test:  {badmatch,[0,<<"error">>,9,<<"error">>]}  %% expected [0,_,4,_,16]
simple_thread_concurrent_test:   {badmatch,[0,16,4,6,8]}   %% expected [0,2,4,6,8]
... (9 cases total)

The wrong-value lists (e.g. [3,2,3,4,7] where each element should be x + 3 for x ∈ 0..4) are not a simple ordering issue — values themselves are wrong. This looks like cross-call state interference inside the Python concurrent.futures.ThreadPoolExecutor callback path: workers see callbacks delivering arguments or results from the wrong invocation.

Hypothesis

The C-side reentrant-callback machinery (erlang.call(...) from a Python thread that isn't the worker pthread) maintains thread-local request state. When ThreadPoolExecutor runs the lambda on N worker threads simultaneously, each calling erlang.call('add_one', x) for a different x, the result delivery may briefly cross threads on hosts where the callback dispatch is slower than the test's pacing. We see the bad result on the slow FreeBSD VM and never on Linux/macOS runners.

Suggested next steps

  1. Add tracing inside nif_thread_worker_call / the per-thread callback request slot to confirm the suspected cross-thread mix-up.
  2. Consider giving each Python thread its own request slot keyed by pthread_self() rather than by the worker's logical ID.
  3. As an interim, mark the affected cases ?GROUP({thread_callback, [parallel_groups: 1]}) or similar so they don't compete for the executor pool with other suites in the same run — that often masks the race long enough to ship.

Not a release blocker: PR #62 merged after a clean rerun. File follow-up to debug under controlled load.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions