[Bug] Race condition in TIR ComputationCache in transforms common subexpression elimination #17072
Labels
needs-triage
PRs or issues that need to be investigated by maintainers to find the right assignees to address it
type: bug
There is an issue due to a race in the TVM/TIR optimization passes when several distinct python threads are each compiling some operator.
The race occurs in the common subexpression elimination which uses a cache of expression/statements map which is declared static.
The cache should be attached to the build context or at least declared
thread_local
The faulty cache declaration is located there: https://github.com/apache/tvm/blob/v0.16.0/src/tir/transforms/common_subexpr_elim_tools.h#L115
Expected behavior
Non faulty code generation on operator build when several python threads are each compiling different module/operators.
Actual behavior
When launching in parallel, for instance in a thread pool the creation and build of an operator, one may encounter a Segfault on HashTable insertion (race on iteration/insert).
This bug is flaky as it is highly dependent of the machine/number of threads/compiled workload.
Environment
TVM: v0.16.0
Target device: llvm host cpu
LLVM: llvm-config-12
Kernel: Linux 5.10.0-27-amd
Distro: Debian 5.10.205-2 (2023-12-31) x86_64 GNU/Linux
Archi: 52 Cores Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz
Steps to reproduce
The problem arises due to a statically declared cache in: https://github.com/apache/tvm/blob/v0.16.0/src/tir/transforms/common_subexpr_elim_tools.h#L115
A simple fix is to define the cache
thread_local
in this declaration and at the definition point.Though there may be some more elegant fix such as not using a static cache but a per compilation context cache.
Find there a test script which reproduce the issue by massively launching parallel build of a matmul operator: multithreaded-bug.py.gz
Launch it with (note that I use
setarch -R
in an attempt to be more reproducible), this was run on the machine described above (52 cores):Note that the bug is flaky, if not reproduced, try to play with the number of parallel threads and the total number of tasks, for instance:
Also, one can play on the
sys.setswitchinterval(0.00001)
in the file, by lowering or increasing the context switch interval.By applying the simple
thread_local
fix, the bug is not visible anymore. Ref to the attached patch file for the fix:0001-Bugfix-TIR-Fix-race-on-ComputationCache.patch.gz
Triage
The text was updated successfully, but these errors were encountered: