Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access to FE* restriction and prolongation matrices is not thread-safe. #16169

Closed
bangerth opened this issue Oct 22, 2023 · 13 comments · May be fixed by #16204
Closed

Access to FE* restriction and prolongation matrices is not thread-safe. #16169

bangerth opened this issue Oct 22, 2023 · 13 comments · May be fixed by #16204

Comments

@bangerth
Copy link
Member

bangerth commented Oct 22, 2023

The two functions accessing these matrices look like this:

template <int dim, int spacedim>
const FullMatrix<double> &
FESystem<dim, spacedim>::get_restriction_matrix(
  const unsigned int         child,
  const RefinementCase<dim> &refinement_case) const
{
...

  // initialization upon first request
  if (this->restriction[refinement_case - 1][child].n() == 0)
    {
      std::lock_guard<std::mutex> lock(restriction_prolongation_matrix_mutex);

      // check if updated while waiting for lock
      if (this->restriction[refinement_case - 1][child].n() ==
          this->n_dofs_per_cell())
        return this->restriction[refinement_case - 1][child];

        ...actually compute matrices as a tmp object...

        ...swap into place...
          restriction.swap(const_cast<FullMatrix<double> &>(
            this->restriction[refinement_case - 1][child]));
    }

  return this->restriction[refinement_case - 1][child];
}

But a poorly timed access might read the initial if as true while the matrix in question isn't completely moved yet. We need to protect the return statement outside the if as well with the mutex. That is best achieved with a shared mutex as mentioned in #16168 .

EDIT: On closer look, pretty much all of the FE classes have this same idiom. They all need to be fixed.

@kronbichler
Copy link
Member

I would be curious on the performance implications of this. Grasping at least a traditional mutex can be quite expensive, especially since we pay the cost also when not using threads.
Regarding the mechanism in its own: This is correct, and it has bothered me some time that we rely on a particular form of store ordering between threads to really be sure that everything has arrived. If that can be done with standard facilities of the compiler that knows the environment (e.g. the ARM architectures vs x86-64), this is much better and future-proof. But as I said above, I would really like to avoid paying dozens of cycles or more to access a lock in case there is no multithreading.

@bangerth
Copy link
Member Author

In practice, of course, the lock will rarely be contested. And the use of shared mutices will make the contested case even less likely. The question then is how expensive the situation really is. I don't have any experience with this -- do you know? My understanding was always that getting a non-contested mutex is actually quite cheap, but that may not actually be true.

@tamiko
Copy link
Member

tamiko commented Oct 23, 2023

Wouldn't it be better to redesign the algorithm and use std::call_once for initialization instead of a shared mutex?

@tamiko
Copy link
Member

tamiko commented Oct 23, 2023

As for using a shared mutex: Out of curiosity I have adapted the following cppreference example and looked at the assembly:

#include <mutex>
#include <shared_mutex>

class ThreadSafeCounter {
public:
  ThreadSafeCounter() = default;

  unsigned int get() const;
  void increment();
  void reset();

private:
  mutable std::shared_mutex mutex_;
  unsigned int value_{};
};

unsigned int ThreadSafeCounter::get() const {
  std::shared_lock lock(mutex_);
  return value_;
}

void ThreadSafeCounter::increment() {
  std::unique_lock lock(mutex_);
  ++value_;
}

void ThreadSafeCounter::reset() {
  std::unique_lock lock(mutex_);
  value_ = 0;
}

The get() function will always do two function calls:

[...]
call    pthread_rwlock_rdlock
[...]
call    pthread_rwlock_unlock
[...]

And the function in question has to do at least one lock xadd and one lock cmpxchg (if I am reading the assembly correctly). Both of these instructions are incredibly expensive, see for example the first answer here: https://stackoverflow.com/questions/4187914/average-latency-of-atomics-cmpxchg-instructions-on-intel-cpus

In contrast, this here:

#include <mutex>
#include <thread>

class ThreadSafeInitialize {
public:
  ThreadSafeInitialize() = default;

  unsigned int get();

private:
  std::once_flag flag;
  unsigned int value_{};
};

unsigned int ThreadSafeInitialize::get() {
  std::call_once(flag, [&]() { value_ = 42; });
  return value_;
}

reduces to a single call:

[...]
call    pthread_once

which looks significantly less scary (again - provided I was able to trace everything correctly):

000000000008af40 <__pthread_once@GLIBC_2.2.5>:
   8af40:       f3 0f 1e fa             endbr64
   8af44:       8b 07                   mov    (%rdi),%eax
   8af46:       a8 02                   test   $0x2,%al
   8af48:       74 06                   je     8af50 <__pthread_once@GLIBC_2.2.5+0x10>
   8af4a:       31 c0                   xor    %eax,%eax
   8af4c:       c3                      ret
   8af4d:       0f 1f 00                nopl   (%rax)
   8af50:       e9 0b fe ff ff          jmp    8ad60 <__pthread_mutexattr_settype@GLIBC_2.2.5+0x30>
   8af55:       66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
   8af5c:       00 00 00 
   8af5f:       90                      nop

@bangerth
Copy link
Member Author

Yes, std::call_once is the right way to go in principle. Of course, std::once_flag must internally have some kind of mutex/futex/... to ensure that multiple threads don't step on each other, but it can be a specialized kind of mutex for this purpose. So I agree that this is the right direction -- the only problem is that now we have to have a std::once_flag for each matrix we build, and these flags are not copyable or movable, so it requires a bit more work.

Alternatively, we could decide that we want to build these objects unconditionally, every time. We could do that in the background, if we wanted, but these matrices are not cheap to build so I'm hesitant to do that because they are not used by all codes.

@tamiko
Copy link
Member

tamiko commented Oct 24, 2023

@bangerth @kronbichler If this is of greater utility we could write a small wrapper class around std::call_once to avoid the non-movable and non-copyable issue. More specifically I was thinking of using both a std::once_flag and a plain boolean is_initialized and then calling a check() function in the hot path:

class Initializer {
public:
  inline void check()
    {
      if (!is_initialized) [[unlikely]]
        {
          std::call_once(flag, [&](){payload(); is_initialized = true});
        }
    }
private:
  bool is_initialized;
  std::once_flag flag;
};

This has several advantages:

  • the exceptional case (if is_initialized is false) is synchronized correctly via std::call_once.
  • we can move and copy the object (by moving and copying the bool and creating a new std::once_flag)
  • the passive call (when nothing has to be done) is as fast as the previous version (by checking a boolean in an inlined function). And the boolean doesn't need a memory fence - worst case is an unnecessary call to std::call_once shortly after initialization.

@tamiko
Copy link
Member

tamiko commented Oct 24, 2023

@masterleinad Please give me 24h to implement the little helper struct outlined above :-)

@bangerth
Copy link
Member Author

@tamiko I don't think you need the is_initialized flag. std::once_flag is likely written in exactly this way, with an atomic flag that if true indicates that the action has run and if false requires locking of a mutex. I would not bother with your class, I think we can do what we want with just standard facilities.

@tamiko
Copy link
Member

tamiko commented Oct 24, 2023

@bangerth The std::once_flag unfortunately has much more overhead (a function call and a locking cmpxchg) compared to what I suggest. Also hiding the move semantics requires some tricks that already warrants a class. I'll create a pull request in a moment to show you the prototype.

@bangerth bangerth changed the title Access to FESystem restriction and prolongation matrices is not thread-safe. Access to FE* restriction and prolongation matrices is not thread-safe. Oct 24, 2023
@bangerth
Copy link
Member Author

On second look, it isn't just FESystem that uses this thread-unsafe scheme, but all of the other FE classes as well.

@tamiko
Copy link
Member

tamiko commented Oct 27, 2023

@bangerth I plan to clean this up by wrapping Lazy around the prolongation and restriction matrices in the FiniteElement base class. This will require to fix all other places at the same time.

Incidentally, I will also make the logic whether we have prolongation matrices more explicit via Lazy<T>::has_value() instead of checking the internal matrices for appropriate size...

@bangerth
Copy link
Member Author

Excellent!

@bangerth
Copy link
Member Author

bangerth commented Nov 8, 2023

This is a duplicate of #331. Let's close this one given that it has accumulated a bunch of other things that are difficult to read through when just looking for the original issue.

@bangerth bangerth closed this as completed Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment