Symptom
py_thread_callback_SUITE intermittently fails on slower CI environments (observed on FreeBSD 14 / Python 3.12 in run 25273995207). A re-run of the same commit passes, so the failures are not deterministic.
Failures observed
threadpool_basic_test: {badmatch,<<"ok">>}
threadpool_concurrent_test: {badmatch,{error,{'RuntimeError',"Empty callback response"}}}
threadpool_multiple_calls_test: {badmatch,[3,2,3,4,7]} %% expected [3,4,5,6,7]
threadpool_nested_callback_test: {badmatch,[1,14,4,2,5]} %% expected [1,2,5,10,17]
threadpool_error_handling_test: {badmatch,[0,<<"error">>,9,<<"error">>]} %% expected [0,_,4,_,16]
simple_thread_concurrent_test: {badmatch,[0,16,4,6,8]} %% expected [0,2,4,6,8]
... (9 cases total)
The wrong-value lists (e.g. [3,2,3,4,7] where each element should be x + 3 for x ∈ 0..4) are not a simple ordering issue — values themselves are wrong. This looks like cross-call state interference inside the Python concurrent.futures.ThreadPoolExecutor callback path: workers see callbacks delivering arguments or results from the wrong invocation.
Hypothesis
The C-side reentrant-callback machinery (erlang.call(...) from a Python thread that isn't the worker pthread) maintains thread-local request state. When ThreadPoolExecutor runs the lambda on N worker threads simultaneously, each calling erlang.call('add_one', x) for a different x, the result delivery may briefly cross threads on hosts where the callback dispatch is slower than the test's pacing. We see the bad result on the slow FreeBSD VM and never on Linux/macOS runners.
Suggested next steps
- Add tracing inside
nif_thread_worker_call / the per-thread callback request slot to confirm the suspected cross-thread mix-up.
- Consider giving each Python thread its own request slot keyed by
pthread_self() rather than by the worker's logical ID.
- As an interim, mark the affected cases
?GROUP({thread_callback, [parallel_groups: 1]}) or similar so they don't compete for the executor pool with other suites in the same run — that often masks the race long enough to ship.
Not a release blocker: PR #62 merged after a clean rerun. File follow-up to debug under controlled load.
Symptom
py_thread_callback_SUITEintermittently fails on slower CI environments (observed on FreeBSD 14 / Python 3.12 in run 25273995207). A re-run of the same commit passes, so the failures are not deterministic.Failures observed
The wrong-value lists (e.g.
[3,2,3,4,7]where each element should bex + 3forx ∈ 0..4) are not a simple ordering issue — values themselves are wrong. This looks like cross-call state interference inside the Pythonconcurrent.futures.ThreadPoolExecutorcallback path: workers see callbacks delivering arguments or results from the wrong invocation.Hypothesis
The C-side reentrant-callback machinery (
erlang.call(...)from a Python thread that isn't the worker pthread) maintains thread-local request state. WhenThreadPoolExecutorruns the lambda on N worker threads simultaneously, each callingerlang.call('add_one', x)for a differentx, the result delivery may briefly cross threads on hosts where the callback dispatch is slower than the test's pacing. We see the bad result on the slow FreeBSD VM and never on Linux/macOS runners.Suggested next steps
nif_thread_worker_call/ the per-thread callback request slot to confirm the suspected cross-thread mix-up.pthread_self()rather than by the worker's logical ID.?GROUP({thread_callback, [parallel_groups: 1]})or similar so they don't compete for the executor pool with other suites in the same run — that often masks the race long enough to ship.Not a release blocker: PR #62 merged after a clean rerun. File follow-up to debug under controlled load.