Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster mutex implementation (optionally reentrant) #8295

Merged
merged 1 commit into from Oct 9, 2019

Conversation

@waj
Copy link
Member

waj commented Oct 8, 2019

I made some improvements in the Mutex implementation to optimize the time it takes to lock/unlock and handoff the ownership to waiting fibers.

Current implementation is quite slow (specially in MT mode) because it always tries to be extremely fair by passing the ownership to the first fiber in the waiting queue. But that fiber might be running in a thread that is already sleeping, and waking up threads is costly and slow. So now the fiber is still awaken but it doesn't automatically have the ownership of the lock. Instead it will compete again with other fibers that might not be sleeping yet. Also, the lock will spin some time before sleeping allowing small critical sections to be acquired without passing through the waiting queue at all.

So, the changes included in this PR are:

  • The lock will spin some time before sleeping
  • The Mutex is not reentrant by default. Reentrancy requires keeping track of the locking mutex. Non-reentrant mutex just stores an atomic bit and it's much faster to "swap" the content than doing an atomic CAS operation.
  • Ownership is not passed to the waiting fibers. They are just awaken and will try again to obtain the lock.
  • An atomic count of the number of waiting fibers is kept, so the lock over the queue can be optimistically skipped. This makes the unlock faster if no fibers are waiting.

I was running a benchmark https://gist.github.com/waj/0c4b5835af088e8921fee4cc2c6006ed#file-bm_lock-cr (provided by @carlhoerberg, thanks Carl!) and these are the results:

Before changes:

$ crystal run --release bm_lock.cr
Mutex.synchronize  78.03M ( 12.82ns) (± 1.39%)  0.0B/op  fastest

$ crystal run --release -D preview_mt bm_lock.cr
Mutex.synchronize 110.29k (  9.07µs) (± 2.15%)  0.0B/op  fastest

After changes:

$ bin/crystal run --release bm_lock.cr
Mutex.synchronize 107.36M (  9.31ns) (± 2.08%)  0.0B/op  fastest

$ bin/crystal run --release -D preview_mt bm_lock.cr
Mutex.synchronize  40.15M ( 24.91ns) (±41.89%)  0.0B/op  fastest

I was also running another test (https://gist.github.com/waj/0c4b5835af088e8921fee4cc2c6006ed#file-test-mutex-cr) to evaluate the correctness of the mutex:

Before changes:

$ crystal run --release test-cr.cr -- 1000000
00:00:00.042806000
4000000

$ crystal run --release -D preview_mt test-cr.cr -- 1000000
00:00:18.215747000
4000000

After changes:

$ bin/crystal run --release test-cr.cr -- 1000000
00:00:00.029512000
4000000

$ bin/crystal run --release -D preview_mt test-cr.cr -- 1000000
00:00:00.124867000
4000000
@ysbaddaden

This comment has been minimized.

Copy link
Member

ysbaddaden commented Oct 8, 2019

now the fiber is still awaken but it doesn't automatically have the ownership of the lock. Instead it will compete again with other fibers that might not be sleeping yet.

Weird, we usually want the exact opposite to avoid contention —waking all resources will lead them to fight again to get the lock... when n - 1 of them will return back to sleep.

@waj

This comment has been minimized.

Copy link
Member Author

waj commented Oct 8, 2019

@ysbaddaden this is not waking all resources. It just wake one of the fibers and it will have to compete with those still not sleeping, if they exist. If they don't it will acquire the lock right away. Otherwise, if it looses, it will spin a little bit one more time, and hopefully get the lock before going back to sleep. It might sound weird for that fiber, but the global throughput is improved because it reduces the time the mutex is not owned by any non-sleeping fiber.

@ysbaddaden

This comment has been minimized.

Copy link
Member

ysbaddaden commented Oct 8, 2019

Ah, thanks for the clarification. It makes sense now. Yes, spinning is always a good idea :)

src/mutex.cr Outdated Show resolved Hide resolved
@waj waj force-pushed the waj:mt/faster-mutex-2 branch from d90f0ff to b1276c7 Oct 9, 2019
@waj waj merged commit 048444e into crystal-lang:master Oct 9, 2019
5 of 6 checks passed
5 of 6 checks passed
ci/circleci: test_linux32 Your tests failed on CircleCI
Details
ci/circleci: check_format Your tests passed on CircleCI!
Details
ci/circleci: test_darwin Your tests passed on CircleCI!
Details
ci/circleci: test_linux Your tests passed on CircleCI!
Details
ci/circleci: test_preview_mt Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@waj waj deleted the waj:mt/faster-mutex-2 branch Oct 9, 2019
@GavinRay97

This comment has been minimized.

Copy link

GavinRay97 commented Oct 10, 2019

Microseconds to nanoseconds?! 😮

@RX14

This comment has been minimized.

Copy link
Member

RX14 commented Oct 21, 2019

I really don't think having non-reentrant (recursive) be the default is a good idea. What happens if you accidentally recurse or unlock a unlocked lock? A deadlock? Corruption? The only acceptable state is an exception, or having a recursive lock. There may be an unsafe option to make the lock much faster for the cost of having to be extra careful with your critical sections, but safety must be the default, not speed. The stdlib is nowhere near the only consumer of Mutex.

@asterite

This comment has been minimized.

Copy link
Member

asterite commented Oct 21, 2019

I agree. Crystal should always put safety before performance.

@RX14

This comment has been minimized.

Copy link
Member

RX14 commented Oct 28, 2019

@waj

This comment has been minimized.

Copy link
Member Author

waj commented Oct 29, 2019

If you are using mutexes is because you're dealing with "unsafe" code. I mean, there is some shared state that must be protected with a mutex in order to deal with potential race conditions. In most cases a non-reentrant mutex is all you need. A non-reentrant mutex is not owned by any fiber and that means it can be unlocked by a different fiber. That is actually a feature of the mutex. You can lock, pass a message to another fiber and assume the ownership there and unlock later.

If you accidentally try to recursively lock a mutex it will deadlock and the program will fail, way much faster and deterministic than failures that can occur for not locking at all.

In other words, we're not adding safety just making the mutex reentrant, non-reentrant mutex is way much faster and all you need in most cases. That was the rationale behind the decision of making non-reentrant option the default.

@RX14

This comment has been minimized.

Copy link
Member

RX14 commented Oct 29, 2019

If you are using mutexes is because you're dealing with "unsafe" code

Not "unsafe" in terms of "can generate segfaults". Mutexes will be commonly used in shards code.

Misuse of a mutex in a shard should be debuggable. The minimal requirement for that is an exception instead of a deadlock. I'm fine with them not being reentrant, but they must fail gracefully.

@bcardiff

This comment has been minimized.

Copy link
Member

bcardiff commented Oct 29, 2019

The story of having better diagnostics/debugging tools is independent of the API mutex should have. Using a non-reentrant mutex as a reentrant mutex will block and not segfault, so the behavior is in the safe/sound side. The performance difference and the more common requirement for non-reentrant seems enough to stay with the optionally reentrant API.

@asterite

This comment has been minimized.

Copy link
Member

asterite commented Oct 29, 2019

I think this will mostly hit users in a silent way when doing stuff like:

# called from multiple functions
def helper_function
  @mutex.synchronize do
    # some data
  end
end

def main_function
  @mutex.synchronize do
    do_something
    helper_function # Oops, deadlock, not easy to debug
  end
end
@jhass

This comment has been minimized.

Copy link
Member

jhass commented Oct 29, 2019

Would it cost too much performance to require an explicit lock owner akin to Java's synchronized? It shouldn't be more than a pointer comparison and and assignment , right? Something like

def lock(owner)
  raise "Already locked" if @current_owner = owner.object_id
  real_lock
  @current_owner = owner.object_id
end

Or actually a CAS operation for that.

Then Ary's example, using self as owner for both calls, would raise.

@bcardiff

This comment has been minimized.

Copy link
Member

bcardiff commented Oct 29, 2019

But if you are using a lock that is hold you want to wait for it to be free, not fail immediately. Maybe a reasonable timeout on debug mode can help, but I am not sure I even like that alternative.

Or you mean that only for non-reentrant mutex to signal a hey: you might need a reentrant mutex?

@jhass

This comment has been minimized.

Copy link
Member

jhass commented Oct 29, 2019

Yes, the reentrant one wouldn't raise in that case, or even do anything with the owner I guess (other than keeping it for debug purposes maybe).

@RX14

This comment has been minimized.

Copy link
Member

RX14 commented Oct 29, 2019

It's not about debugging tools, it's about being able to debug in the first place. Users will not appreciate having to dig out gdb on a production server to debug their deadlocks. This is especially annoying because hung process detection and restarting is far less common and tricky to set up than auto-restart on crash. They'd much prefer if the process just crashed and have them a backtrace to investigate.

If you can have a lock owner, you can have the default lock owner be the current fiber object - which already provides the desired semantics.

Is a benchmark of the non-reentrant vs reentrant available?

@asterite

This comment has been minimized.

Copy link
Member

asterite commented Nov 22, 2019

I still don't understand why we have a deadlocking but less performance behavior as a default one, instead of a safer but slower one. One can always optimize a bottleneck once it's found. Having to deal with deadlocks or crashes upfront is not fun. But we'll see with feedback.

@RX14

This comment has been minimized.

Copy link
Member

RX14 commented Nov 22, 2019

Still:

Is a benchmark of the non-reentrant vs reentrant available?

@ysbaddaden

This comment has been minimized.

Copy link
Member

ysbaddaden commented Nov 23, 2019

With reentrant you must acces Fiber.current, thus access a thread local, then make more writes to memory when locking and unlocking. For the tight usages we make of Mutex, its necessarily faster to not have all these writes. I had 2 mutexes in my MT experiment, one private (no validation) and 1 public (deadlock detection, optionally reentrant) wrapping the private one. No need to say the private one was much more performant (for tight usages, such as in Channel, ...)

I wish there was a third option, on by default: detect and raise on deadlock/wrong fiber unlocking the mutex. Then, you know there is an issue and have a backtrace to understand it, or use a reentrant mutex if it's acceptable (thought, they're a smell for me, they bypass a bug, and can still lead to deadlock when using lock/unlock) instead of a hanging process.

I agree with Ary: validation shouldn't be opt-in. You can disable all checks if there is a performance issue (or you're a boss), thought unless in tight benchmarks/loops it may not be that significant (except for MT synchronization primitives).

@RX14

This comment has been minimized.

Copy link
Member

RX14 commented Nov 23, 2019

I wish there was a third option, on by default: detect and raise on deadlock/wrong fiber unlocking the mutex. Then, you know there is an issue and have a backtrace to understand it, or use a reentrant mutex if it's acceptable (thought, they're a smell for me, they bypass a bug, and can still lead to deadlock when using lock/unlock) instead of a hanging process.

This is what I have been asking for all along... Have I been unclear?

The users of mutexes could opt out of this checking (.new(unsafe_deadlock: true)).

@lribeiro

This comment has been minimized.

Copy link

lribeiro commented Nov 23, 2019

That would be my preference as well, safe by default, fast by need.
.new(reentrant: false) seems clear

@RX14

This comment has been minimized.

Copy link
Member

RX14 commented Nov 23, 2019

Well, there's non-reentrancy, and skipping deadlock checks in unsafe code for speed. They're different things to me (feature vs safety).

@ysbaddaden

This comment has been minimized.

Copy link
Member

ysbaddaden commented Nov 25, 2019

Yes, and that's why I insisted above on having all 3 states:

  1. check by default: raise on deadlock + wrong fiber unlocking (safe);
  2. reentrant option (may still deadlock when skipping an unlock, but otherwise safe);
  3. disable checks option: full speed (unsafe: can deadlock + any fiber can unlock).

I.e. similar to pthreads options but checked by default. It prevents deadlocks by default and reports where a deadlock/invalid unlock occurred —without having to reproduce in GDB. It keeps some flexibility with reentrant, and it won't impact performance when checks are disabled (we already check for reentrant).

@jhass

This comment has been minimized.

Copy link
Member

jhass commented Dec 2, 2019

Somebody wants to open an issue about this so we don't forget it?

@RX14

This comment has been minimized.

Copy link
Member

RX14 commented Dec 2, 2019

@jhass this is blocking the release so it certainly won't be forgotten

@asterite

This comment has been minimized.

Copy link
Member

asterite commented Dec 2, 2019

Is it blocking the release? I think we can release it like that for 0.32.0 and adjust in 0.32.1 or 0.33.0

@RX14

This comment has been minimized.

Copy link
Member

RX14 commented Dec 4, 2019

It's too easy to forget about a regression if it makes it into a release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.