Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kokkos with openmp support: random test failures due to failed kokkos initialization #14641

Open
tamiko opened this issue Jan 4, 2023 · 13 comments

Comments

@tamiko
Copy link
Member

tamiko commented Jan 4, 2023

On my Gentoo container with an external Trilinos+Kokkos 13.4.1 I observe random test failures for a subset of our regression tests: see for example https://cdash.dealii.org/viewTest.php?onlydelta&buildid=17, for example https://cdash.dealii.org/test/188917

fe/abf_projection_01.release: BUILD successful.
fe/abf_projection_01.release: RUN failed. ------ Return code 134
fe/abf_projection_01.release: RUN failed. ------ Result: /scratch/users/testsuite/build-F24cAOtY/tests/fe/abf_projection_01.release/failing_output
fe/abf_projection_01.release: RUN failed. ------ Partial output:
JobId b1 Wed Jan  4 06:19:07 2023
DEAL::Dofs/cell 6Dofs/face 1
DEAL::Dofs total 6
DEAL::MM created
DEAL::RHS created
DEAL::Solver stopped within 8 - 10 iterations

fe/abf_projection_01.release: RUN failed. ------ Additional output on stdout/stderr:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Constructing View and initializing data with uninitialized execution space


fe/abf_projection_01.release: ******    RUN failed    *******
@tamiko
Copy link
Member Author

tamiko commented Jan 4, 2023

For reference, I do not observe this issue on any other image (Ubuntu 18.04, 20.04, 22.04, 23.04, Debian stable, testing).

I also observe random tests with Timeouts - I am reasonably confident that have ruled out load and IO issues (in particular I observe no such thing on the Debian and Ubuntu images). I suspect this is related.

@Rombur
Copy link
Member

Rombur commented Jan 4, 2023

Do you see that problem if you try to run the test alone or only when you run the entire CI?

@tamiko
Copy link
Member Author

tamiko commented Jan 4, 2023

@Rombur I will investigate.

@tamiko
Copy link
Member Author

tamiko commented Jan 21, 2023

Partially fixed with #14705

While the above pull request seems to reduce the number of test failures I still see a number of tests randomly failing with the initialization error. For example, https://cdash.dealii.org/test/4523840

@tamiko
Copy link
Member Author

tamiko commented Jan 23, 2023

@Rombur I can trigger this problem reliably (~50% success rate) by simple running a bare test executable repeatedly. For example sacado/step-44-helper_res_lin_01_1.release (possible with other tests as well, but that was just the first I tried).

I think we have a race condition problem. Here is a brain dump of what I tried so far:

@tamiko
Copy link
Member Author

tamiko commented Jan 23, 2023

In order to gain some insight I have applied the following patch (includes some beautiful cout debugging):

diff --git a/source/base/kokkos.cc b/source/base/kokkos.cc
index 7cf0a34d6a..ef98179b96 100644
--- a/source/base/kokkos.cc
+++ b/source/base/kokkos.cc
@@ -15,25 +15,60 @@
 
 #include <deal.II/base/kokkos.h>
 
+#include <deal.II/base/mutex.h>
 #include <deal.II/lac/la_parallel_vector.h>
 #include <deal.II/lac/vector_memory.h>
 
 #include <Kokkos_Core.hpp>
 
+#include <atomic>
+
 DEAL_II_NAMESPACE_OPEN
 
 namespace internal
 {
   bool dealii_initialized_kokkos = false;
 
+  Threads::Mutex kokkos_initialization_mutex;
+  std::atomic_bool kokkos_initialized = false;
+
   void
   ensure_kokkos_initialized()
   {
-    if (!Kokkos::is_initialized())
+    std::cout << "DEBUG: ensure_kokkos_initialized()" << std::endl;
+    if (kokkos_initialized == true)
+      {
+        if(Kokkos::is_initialized())
+          {
+            std::cout << "DEBUG: true! return." << std::endl;
+            return;
+          }
+        else
+          {
+            std::cout << "ERROR: uninitialized kokkos.." << std::endl;
+            __builtin_trap();
+          }
+      }
+    std::cout << "DEBUG: false! taking initialization lock." << std::endl;
+
+    std::lock_guard<std::mutex> lock(kokkos_initialization_mutex);
+
+    if (kokkos_initialized == false)
       {
-        dealii_initialized_kokkos = true;
-        Kokkos::initialize();
-        std::atexit(Kokkos::finalize);
+        std::cout << "DEBUG: initializing kokkos." << std::endl;
+        kokkos_initialized = true;
+        if(!Kokkos::is_initialized())
+          {
+            std::cout << "DEBUG: initializing kokkos." << std::endl;
+            dealii_initialized_kokkos = true;
+            Kokkos::initialize();
+            std::atexit(Kokkos::finalize);
+          }
+        else
+          {
+            std::cout << "ERROR: someone just initialized kokkos." << std::endl;
+            __builtin_trap();
+          }
       }
   }
 } // namespace internal

@tamiko
Copy link
Member Author

tamiko commented Jan 23, 2023

And sometimes execution fails:

DEBUG: ensure_kokkos_initialized()
DEBUG: false! taking initialization lock.
DEBUG: initializing kokkos.
DEBUG: initializing kokkos.
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Grid:
         Reference volume: 1e-06
Triangulation:
         Number of active cells: 4
         Number of degrees of freedom: 26
    Setting up quadrature point data...
DEBUG: ensure_kokkos_initialized()
DEBUG: true! return.


----------------------------------------------------
Exception on processing: 
Constructing View and initializing data with uninitialized execution space
Aborting!
----------------------------------------------------
DEBUG: ensure_kokkos_initialized()
DEBUG: true! return.
DEBUG: ensure_kokkos_initialized()
DEBUG: true! return.
DEBUG: ensure_kokkos_initialized()
DEBUG: true! return.
DEBUG: ensure_kokkos_initialized()
DEBUG: true! return.
DEBUG: ensure_kokkos_initialized()
DEBUG: true! return.
DEBUG: ensure_kokkos_initialized()
DEBUG: true! return.
DEBUG: ensure_kokkos_initialized()
DEBUG: true! return.
DEBUG: ensure_kokkos_initialized()
DEBUG: true! return.

gdb stacktrace (breakpoint on Kokkos::Impl::throw_runtime_exception):

#0  0x00007fffedf52440 in Kokkos::Impl::throw_runtime_exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)@plt ()
   from /srv/temp/build/lib/libdeal_II.so.9.5.0-pre
#1  0x00007ffff1a44bdc in Kokkos::View<double*, Kokkos::HostSpace>::View<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&, std::enable_if<!Kokkos::Impl::ViewCtorProp<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::has_pointer, Kokkos::LayoutRight>::type const&) () from /srv/temp/build/lib/libdeal_II.so.9.5.0-pre
#2  0x00007ffff1a4beb3 in dealii::MemorySpace::MemorySpaceData<double, dealii::MemorySpace::Host>::MemorySpaceData() () from /srv/temp/build/lib/libdeal_II.so.9.5.0-pre
#3  0x00007ffff1a4c265 in dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::Host>::Vector() () from /srv/temp/build/lib/libdeal_II.so.9.5.0-pre
#4  0x00007ffff16c39d7 in std::vector<dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::Host>, std::allocator<dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::Host> > >::_M_default_append(unsigned long) () from /srv/temp/build/lib/libdeal_II.so.9.5.0-pre
#5  0x00007ffff1aa7584 in dealii::LinearAlgebra::distributed::BlockVector<double>::reinit(std::vector<unsigned int, std::allocator<unsigned int> > const&, bool) ()
   from /srv/temp/build/lib/libdeal_II.so.9.5.0-pre
#6  0x00007ffff1aa79f5 in dealii::LinearAlgebra::distributed::BlockVector<double>::reinit(unsigned int, unsigned int, bool) ()
   from /srv/temp/build/lib/libdeal_II.so.9.5.0-pre
#7  0x00007fffee435c27 in void dealii::internal::DataOutImplementation::(anonymous namespace)::create_dof_vector<2, 2, dealii::BlockVector<double>, double, (dealii::BlockVector<double>*)0>(dealii::DoFHandler<2, 2> const&, dealii::BlockVector<double> const&, dealii::LinearAlgebra::distributed::BlockVector<double>&) ()
   from /srv/temp/build/lib/libdeal_II.so.9.5.0-pre
#8  0x00007fffee492d71 in void dealii::DataOut_DoFData<2, 2, 2, 2>::add_data_vector_internal<dealii::BlockVector<double> >(dealii::DoFHandler<2, 2> const*, dealii::BlockVector<double> const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, dealii::DataOut_DoFData<2, 2, 2, 2>::DataVectorType, std::vector<dealii::DataComponentInterpretation::DataComponentInterpretation, std::allocator<dealii::DataComponentInterpretation::DataComponentInterpretation> > const&, bool) () from /srv/temp/build/lib/libdeal_II.so.9.5.0-pre
#9  0x0000555555603668 in void dealii::DataOut_DoFData<2, 2, 2, 2>::add_data_vector<dealii::BlockVector<double> >(dealii::BlockVector<double> const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, dealii::DataOut_DoFData<2, 2, 2, 2>::DataVectorType, std::vector<dealii::DataComponentInterpretation::DataComponentInterpretation, std::allocator<dealii::DataComponentInterpretation::DataComponentInterpretation> > const&) ()
#10 0x000055555560665f in Step44::Solid<2, double, (dealii::Differentiation::AD::NumberTypes)3>::output_results() const ()
#11 0x000055555565b8d4 in Step44::Solid<2, double, (dealii::Differentiation::AD::NumberTypes)3>::run() ()
#12 0x00005555555d8e24 in main ()

This is interesting!

On startup (during static object initialization we detect that kokkos is not initialized and call Kokkos::initialize().
Subsequently Kokkos::is_initialized() returns true. BUT later on Kokkos throws an error that is not initialized.

@tamiko
Copy link
Member Author

tamiko commented Jan 23, 2023

Another note: With DEAL_II_NUM_THREADS=88 ./step-44-helper_res_lin_01_1.release I trigger the issue on average once every 3-4 invocations. On the other hand DEA_II_NUM_THREADS=1 did not produce a failure in 100 runs.

@tamiko
Copy link
Member Author

tamiko commented Jan 23, 2023

This is funny, so setting OMP_PROC_BIND=spread OMP_PLACES=threads also fixes the issue. But, OMP_PROC_BIND=none does not.

So I am suspecting this is

  • either a bug with Kokkos bundled in Trilinos 13.4.1 (because on Debian/Ubuntu versions with up to Trilinos 13.2 I do not see an issue)
  • or, more likely, a bug with the OpenMP backend in Kokkos (because none of the Debian/Ubuntu variants enable OpenMP support).

I will rebuild Trilinos without openmp support and try again.

Confirmed: Disabling openmp support for Trilinos/Kokkos resolves the issue.

@tamiko tamiko added this to the Release 9.5 milestone Jan 23, 2023
@tamiko tamiko changed the title Kokkos: random test failures due to failed/delayed kokkos initialization Kokkos with openmp support: random test failures due to failed kokkos initialization Jan 23, 2023
@masterleinad masterleinad self-assigned this Jan 23, 2023
@Rombur
Copy link
Member

Rombur commented Jan 23, 2023

This is funny, so setting OMP_PROC_BIND=spread OMP_PLACES=threads also fixes the issue. But, OMP_PROC_BIND=none does not.

It's interesting because in Kokkos we used to have thread_local variables that were set during the initialization and we checked if the variables were initialized to know if Kokkos::OpenMP was initialized. We did removed these variables in Kokkos develop. So it would be interesting to try with a newer version of Kokkos.

@tamiko
Copy link
Member Author

tamiko commented Jan 23, 2023

@Rombur I do not observe this issue with Kokkos 3.7.1 (and no Trilinos), see https://cdash.dealii.org/build/447

@Rombur
Copy link
Member

Rombur commented Jan 24, 2023

hmmm 3.7.1 still uses the thread_local variables but it was after I started to refactor the OpenMP backend initialization. Instead of checking that the thread_local variables were initialized, we switched to a singleton. On top of that before 3.7, Kokkos::OpenMP would only work if you would execute Kokkos functions from the master thread. You could not create a std::thread and then do a parallel_for from this new thread. This works in 3.7 but I haven't tried to initialized Kokkos::OpenMP from a std::thread. This should work with 4.0 where we finally removed the thread_local variables but I am not sure about 3.7.

@tamiko
Copy link
Member Author

tamiko commented Jun 20, 2023

I don't think we will address this issue during this release cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants