New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kokkos with openmp support: random test failures due to failed kokkos initialization #14641
Comments
For reference, I do not observe this issue on any other image (Ubuntu 18.04, 20.04, 22.04, 23.04, Debian stable, testing). I also observe random tests with Timeouts - I am reasonably confident that have ruled out load and IO issues (in particular I observe no such thing on the Debian and Ubuntu images). I suspect this is related. |
Do you see that problem if you try to run the test alone or only when you run the entire CI? |
@Rombur I will investigate. |
Partially fixed with #14705 While the above pull request seems to reduce the number of test failures I still see a number of tests randomly failing with the initialization error. For example, https://cdash.dealii.org/test/4523840 |
@Rombur I can trigger this problem reliably (~50% success rate) by simple running a bare test executable repeatedly. For example I think we have a race condition problem. Here is a brain dump of what I tried so far: |
In order to gain some insight I have applied the following patch (includes some beautiful cout debugging):
|
And sometimes execution fails:
gdb stacktrace (breakpoint on
This is interesting! On startup (during static object initialization we detect that kokkos is not initialized and call |
Another note: With |
This is funny, so setting So I am suspecting this is
I will rebuild Trilinos without openmp support and try again. Confirmed: Disabling openmp support for Trilinos/Kokkos resolves the issue. |
It's interesting because in Kokkos we used to have |
@Rombur I do not observe this issue with Kokkos 3.7.1 (and no Trilinos), see https://cdash.dealii.org/build/447 |
hmmm 3.7.1 still uses the |
I don't think we will address this issue during this release cycle. |
On my Gentoo container with an external Trilinos+Kokkos 13.4.1 I observe random test failures for a subset of our regression tests: see for example https://cdash.dealii.org/viewTest.php?onlydelta&buildid=17, for example https://cdash.dealii.org/test/188917
The text was updated successfully, but these errors were encountered: