Access to FE* restriction and prolongation matrices is not thread-safe. #16169

bangerth · 2023-10-22T17:56:07Z

The two functions accessing these matrices look like this:

template <int dim, int spacedim>
const FullMatrix<double> &
FESystem<dim, spacedim>::get_restriction_matrix(
  const unsigned int         child,
  const RefinementCase<dim> &refinement_case) const
{
...

  // initialization upon first request
  if (this->restriction[refinement_case - 1][child].n() == 0)
    {
      std::lock_guard<std::mutex> lock(restriction_prolongation_matrix_mutex);

      // check if updated while waiting for lock
      if (this->restriction[refinement_case - 1][child].n() ==
          this->n_dofs_per_cell())
        return this->restriction[refinement_case - 1][child];

        ...actually compute matrices as a tmp object...

        ...swap into place...
          restriction.swap(const_cast<FullMatrix<double> &>(
            this->restriction[refinement_case - 1][child]));
    }

  return this->restriction[refinement_case - 1][child];
}

But a poorly timed access might read the initial if as true while the matrix in question isn't completely moved yet. We need to protect the return statement outside the if as well with the mutex. That is best achieved with a shared mutex as mentioned in #16168 .

EDIT: On closer look, pretty much all of the FE classes have this same idiom. They all need to be fixed.

The text was updated successfully, but these errors were encountered:

kronbichler · 2023-10-23T18:39:54Z

I would be curious on the performance implications of this. Grasping at least a traditional mutex can be quite expensive, especially since we pay the cost also when not using threads.
Regarding the mechanism in its own: This is correct, and it has bothered me some time that we rely on a particular form of store ordering between threads to really be sure that everything has arrived. If that can be done with standard facilities of the compiler that knows the environment (e.g. the ARM architectures vs x86-64), this is much better and future-proof. But as I said above, I would really like to avoid paying dozens of cycles or more to access a lock in case there is no multithreading.

bangerth · 2023-10-23T19:28:28Z

In practice, of course, the lock will rarely be contested. And the use of shared mutices will make the contested case even less likely. The question then is how expensive the situation really is. I don't have any experience with this -- do you know? My understanding was always that getting a non-contested mutex is actually quite cheap, but that may not actually be true.

tamiko · 2023-10-23T21:40:56Z

Wouldn't it be better to redesign the algorithm and use std::call_once for initialization instead of a shared mutex?

tamiko · 2023-10-23T21:58:16Z

As for using a shared mutex: Out of curiosity I have adapted the following cppreference example and looked at the assembly:

#include <mutex>
#include <shared_mutex>

class ThreadSafeCounter {
public:
  ThreadSafeCounter() = default;

  unsigned int get() const;
  void increment();
  void reset();

private:
  mutable std::shared_mutex mutex_;
  unsigned int value_{};
};

unsigned int ThreadSafeCounter::get() const {
  std::shared_lock lock(mutex_);
  return value_;
}

void ThreadSafeCounter::increment() {
  std::unique_lock lock(mutex_);
  ++value_;
}

void ThreadSafeCounter::reset() {
  std::unique_lock lock(mutex_);
  value_ = 0;
}

The get() function will always do two function calls:

[...]
call    pthread_rwlock_rdlock
[...]
call    pthread_rwlock_unlock
[...]

And the function in question has to do at least one lock xadd and one lock cmpxchg (if I am reading the assembly correctly). Both of these instructions are incredibly expensive, see for example the first answer here: https://stackoverflow.com/questions/4187914/average-latency-of-atomics-cmpxchg-instructions-on-intel-cpus

In contrast, this here:

#include <mutex>
#include <thread>

class ThreadSafeInitialize {
public:
  ThreadSafeInitialize() = default;

  unsigned int get();

private:
  std::once_flag flag;
  unsigned int value_{};
};

unsigned int ThreadSafeInitialize::get() {
  std::call_once(flag, [&]() { value_ = 42; });
  return value_;
}

reduces to a single call:

[...]
call    pthread_once

which looks significantly less scary (again - provided I was able to trace everything correctly):

000000000008af40 <__pthread_once@GLIBC_2.2.5>:
   8af40:       f3 0f 1e fa             endbr64
   8af44:       8b 07                   mov    (%rdi),%eax
   8af46:       a8 02                   test   $0x2,%al
   8af48:       74 06                   je     8af50 <__pthread_once@GLIBC_2.2.5+0x10>
   8af4a:       31 c0                   xor    %eax,%eax
   8af4c:       c3                      ret
   8af4d:       0f 1f 00                nopl   (%rax)
   8af50:       e9 0b fe ff ff          jmp    8ad60 <__pthread_mutexattr_settype@GLIBC_2.2.5+0x30>
   8af55:       66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
   8af5c:       00 00 00 
   8af5f:       90                      nop

bangerth · 2023-10-23T22:50:48Z

Yes, std::call_once is the right way to go in principle. Of course, std::once_flag must internally have some kind of mutex/futex/... to ensure that multiple threads don't step on each other, but it can be a specialized kind of mutex for this purpose. So I agree that this is the right direction -- the only problem is that now we have to have a std::once_flag for each matrix we build, and these flags are not copyable or movable, so it requires a bit more work.

Alternatively, we could decide that we want to build these objects unconditionally, every time. We could do that in the background, if we wanted, but these matrices are not cheap to build so I'm hesitant to do that because they are not used by all codes.

tamiko · 2023-10-24T01:15:29Z

@bangerth @kronbichler If this is of greater utility we could write a small wrapper class around std::call_once to avoid the non-movable and non-copyable issue. More specifically I was thinking of using both a std::once_flag and a plain boolean is_initialized and then calling a check() function in the hot path:

class Initializer {
public:
  inline void check()
    {
      if (!is_initialized) [[unlikely]]
        {
          std::call_once(flag, [&](){payload(); is_initialized = true});
        }
    }
private:
  bool is_initialized;
  std::once_flag flag;
};

This has several advantages:

the exceptional case (if is_initialized is false) is synchronized correctly via std::call_once.
we can move and copy the object (by moving and copying the bool and creating a new std::once_flag)
the passive call (when nothing has to be done) is as fast as the previous version (by checking a boolean in an inlined function). And the boolean doesn't need a memory fence - worst case is an unnecessary call to std::call_once shortly after initialization.

tamiko · 2023-10-24T05:00:18Z

@masterleinad Please give me 24h to implement the little helper struct outlined above :-)

bangerth · 2023-10-24T15:15:08Z

@tamiko I don't think you need the is_initialized flag. std::once_flag is likely written in exactly this way, with an atomic flag that if true indicates that the action has run and if false requires locking of a mutex. I would not bother with your class, I think we can do what we want with just standard facilities.

tamiko · 2023-10-24T16:11:58Z

@bangerth The std::once_flag unfortunately has much more overhead (a function call and a locking cmpxchg) compared to what I suggest. Also hiding the move semantics requires some tricks that already warrants a class. I'll create a pull request in a moment to show you the prototype.

bangerth · 2023-10-24T23:31:37Z

On second look, it isn't just FESystem that uses this thread-unsafe scheme, but all of the other FE classes as well.

tamiko · 2023-10-27T05:10:38Z

@bangerth I plan to clean this up by wrapping Lazy around the prolongation and restriction matrices in the FiniteElement base class. This will require to fix all other places at the same time.

Incidentally, I will also make the logic whether we have prolongation matrices more explicit via Lazy<T>::has_value() instead of checking the internal matrices for appropriate size...

bangerth · 2023-10-27T13:10:25Z

Excellent!

bangerth · 2023-11-08T03:42:37Z

This is a duplicate of #331. Let's close this one given that it has accumulated a bunch of other things that are difficult to read through when just looking for the original issue.

bangerth added this to the Release 9.6 milestone Oct 22, 2023

bangerth mentioned this issue Oct 23, 2023

Split a mutex into two, give them better names. #16174

Merged

masterleinad mentioned this issue Oct 24, 2023

Use static bool for initializing restriction and prolongation matrices once #16178

Closed

tamiko mentioned this issue Oct 24, 2023

Implement a Lazy<T> wrapper providing thread safe, lazy initialization #16182

Merged

bangerth changed the title ~~Access to FESystem restriction and prolongation matrices is not thread-safe.~~ Access to FE* restriction and prolongation matrices is not thread-safe. Oct 24, 2023

tamiko added Bug Finite Element labels Oct 27, 2023

tamiko mentioned this issue Oct 27, 2023

FiniteElement: Implement lazy restriction/prolongation via dealii::Lazy<> #16204

Draft

bangerth closed this as completed Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access to FE* restriction and prolongation matrices is not thread-safe. #16169

Access to FE* restriction and prolongation matrices is not thread-safe. #16169

bangerth commented Oct 22, 2023 •

edited

kronbichler commented Oct 23, 2023

bangerth commented Oct 23, 2023

tamiko commented Oct 23, 2023

tamiko commented Oct 23, 2023

bangerth commented Oct 23, 2023

tamiko commented Oct 24, 2023 •

edited

tamiko commented Oct 24, 2023

bangerth commented Oct 24, 2023

tamiko commented Oct 24, 2023

bangerth commented Oct 24, 2023

tamiko commented Oct 27, 2023

bangerth commented Oct 27, 2023

bangerth commented Nov 8, 2023

Access to FE* restriction and prolongation matrices is not thread-safe. #16169

Access to FE* restriction and prolongation matrices is not thread-safe. #16169

Comments

bangerth commented Oct 22, 2023 • edited

kronbichler commented Oct 23, 2023

bangerth commented Oct 23, 2023

tamiko commented Oct 23, 2023

tamiko commented Oct 23, 2023

bangerth commented Oct 23, 2023

tamiko commented Oct 24, 2023 • edited

tamiko commented Oct 24, 2023

bangerth commented Oct 24, 2023

tamiko commented Oct 24, 2023

bangerth commented Oct 24, 2023

tamiko commented Oct 27, 2023

bangerth commented Oct 27, 2023

bangerth commented Nov 8, 2023

bangerth commented Oct 22, 2023 •

edited

tamiko commented Oct 24, 2023 •

edited