Skip to content
This repository has been archived by the owner. It is now read-only.

Use optimistic reads in AtomicCell #39

Merged
merged 4 commits into from Aug 9, 2018

Conversation

@stjepang
Copy link
Member

commented Jul 31, 2018

When using global locks, two concurrent load operations on the same AtomicCell will contend on the same lock, which severely hurts performance. This PR changes the load operations to use optimistic reads instead. The idea is basically the same as in the seqlock crate.

Benchmarks:

 name                    before ns/iter  after ns/iter  diff ns/iter   diff %  speedup
 compare_and_swap_u8     9               8                        -1  -11.11%   x 1.12
 compare_and_swap_usize  6               6                         0    0.00%   x 1.00
 concurrent_load_u8      25,239,685      716,611         -24,523,074  -97.16%  x 35.22
 concurrent_load_usize   355,220         346,668              -8,552   -2.41%   x 1.02
 fetch_add_u8            9               8                        -1  -11.11%   x 1.12
 fetch_add_usize         6               6                         0    0.00%   x 1.00
 load_u8                 8               1                        -7  -87.50%   x 8.00
 load_usize              0               0                         0     NaN%    x NaN
 store_u8                9               8                        -1  -11.11%   x 1.12
 store_usize             6               6                         0    0.00%   x 1.00

Note the difference in load_u8 and concurrent_load_u8 - that's what this PR is about.

@stjepang stjepang requested review from Amanieu and jeehoonkang Jul 31, 2018
@stjepang stjepang force-pushed the stjepang:optimistic-read branch from 8ea2c44 to 2eb9ede Jul 31, 2018
Copy link
Contributor

left a comment

This is great! Just a few minor issues.

}

if step < 5 {
// Just try again.

This comment has been minimized.

Copy link
@Amanieu

Amanieu Jul 31, 2018

Contributor

Always use spin_loop_hint if you are in a spin loop.

#[inline]
fn validate_read(&self, stamp: usize) -> bool {
atomic::fence(Ordering::Acquire);
self.state.load(Ordering::SeqCst) == stamp

This comment has been minimized.

Copy link
@Amanieu

Amanieu Jul 31, 2018

Contributor

The load here can be Relaxed: the ordering is already enforced by the fence.

In terms of ordering: we only need to ensure that state.load is not reordered before the volatile_read.

This comment has been minimized.

Copy link
@stjepang

stjepang Jul 31, 2018

Author Member

The SeqCst is here just to make the operation SeqCst like the atomic one without locks.

This comment has been minimized.

Copy link
@Amanieu

Amanieu Jul 31, 2018

Contributor

I'm not exactly sure if that's enough to make the operation equivalent to SeqCst.

cc @jeehoonkang

This comment has been minimized.

Copy link
@stjepang

stjepang Jul 31, 2018

Author Member

My reasoning is:

If this SeqCst load returns a version that is not locked and is equal to the version before reading, then we've read a value that is still the same as the value present at the exact moment this SeqCst load happened. So this load operation is when the AtomicCell::load really "atomically" happens.

But yeah, I wonder what @jeehoonkang thinks, too. :)

This comment has been minimized.

Copy link
@jeehoonkang

jeehoonkang Jul 31, 2018

Collaborator

Sequence lock (stamped lock) is... strange. I think the best introduction to the seqlock and its strangeness is Hans Boehm's slide. I think we should implement the "version 3" in this slide, as (1) it's correct, as far as Hans (and I) think, and (2) it's the most efficient scheme among those implementable in C/C++.

According to the slide, (1) this load can be Relaxed, and (2) the write-lock should issue a Release fence. (2) is necessary even if this load is SeqCst: otherwise, bad reorderings may happen on the writer side.

This comment has been minimized.

Copy link
@stjepang

stjepang Aug 2, 2018

Author Member

@jeehoonkang

The high-level explanation is that two consecutive load(SeqCst) are not strictly ordered, while Mutex<Cell> guarantees two consecutive reads are strictly ordered.

Just to make sure I understood, can you also answer whether the following statements are true or false?

  1. Two consecutive load(SeqCst) are strictly ordered by the C++ standard.
  2. Two consecutive load(SeqCst) are strictly ordered on x86-64.
  3. Two consecutive load(SeqCst) are not strictly ordered on ARM.

The release fence should be issued at the end of write_lock(), not inside write_unlock().

So we just need to change the ordering of the swap operation inside write() from Acquire to AcqRel?

This comment has been minimized.

Copy link
@jeehoonkang

jeehoonkang Aug 2, 2018

Collaborator

Oh, maybe what I said was not clear. Sorry. By "consecutive loads" I mean ordered by (extended) coherence, not by intra-thread program order. Suppose two threads A and B are reading the same value from an AtomicCell<T>. If it's lock-free, two reads are not ordered at all in C++, ARMv8, and ARMv7. (In x86, probably the order doesn't matter at the beginning.) But if it's based on a spinlock, then two loads are strictly ordered, either A's read before B's read or the other way around.

Regarding seqlock, acquire and release fence should be issued. AFAIK, in "version 3", W issues a release fence right after updating the stamp value in write_lock(), and R issues an acquire fence just before reading the stamp value in the validation phase. If R reads any value in the protected data written by W, then there's a acquire/release synchronization between those fences, forcing R to read the stamp value updated by W or later.

This comment has been minimized.

Copy link
@stjepang

stjepang Aug 2, 2018

Author Member

If it's lock-free, two reads are not ordered at all in C++, ARMv8, and ARMv7.

Doesn't at least the C++ standard guarantee that two SeqCst reads are always ordered (sequential consistency)?

W issues a release fence right after updating the stamp value in write_lock()

Did you mean "right before"? (so that writes happen before the stamp value is updated)

If R reads any value in the protected data written by W, then there's a acquire/release synchronization between those fences

Why not make the swap operation by W use AcqRel so that this operation establishes acquire/release synchronization with the fence issued by R? That might be slightly faster than issuing a fence by W. Or is that not going to work?

This comment has been minimized.

Copy link
@jeehoonkang

jeehoonkang Aug 3, 2018

Collaborator
  • SeqCst loads and stores have very vague meaning in C++. As a theorem (not normative), C++ guarantees that if all atomic accesses are SeqCst and there's no data race for non-atomic locations in a program, then the program's semantics is sequentially consistent (so-called "DRF1" or "DRF-SC"; here "DRF" stands for data-race freedom). But when even a single access to an irrelevant location is e.g. Relaxed, then the program's semantics may no longer be sequentially consistent.

    In short, the semantics of SeqCst loads and stores is surprisingly weaker than expected. In particular, it's strictly weaker than spinlock-protected accesses. For the latest discussion on this subject, the RC11 paper is worth reading.

  • I meant "right after", not "right before". Also, release swap will not work. (Yes, seqlock is strange...) Because the synchronization occurs via data variables, not stamp variable: writer issues release fence and then writes a value to data, and if reader reads the new value from data and then issues acquire fence, then after that the reader should read the new stamp value written by the writer (or a later one).

This comment has been minimized.

Copy link
@stjepang

stjepang Aug 8, 2018

Author Member

@jeehoonkang

But when even a single access to an irrelevant location is e.g. Relaxed, then the program's semantics may no longer be sequentially consistent.

Whoa, that's really surprising!

By the way, in the previous pull request you said:

I think SeqCst obscures the specification: what do we expect from the methods of AtomicCell? I think the word "sequentially consistent" doesn't really help, because practically none understand the term precisely. Even worse, using SeqCst may guarantee some ordering properties which users will rely on, after which we're stuck and cannot optimize it further.

I'd just like to point out that someone might rely on the fact that spinlock-protected AtomicCell has sequentially consistent semantics, but then we might switch from spinlock to e.g. AtomicU16 (once it gets into stable Rust) and thus relax the semantics behind the scenes. So that's a similar problem.

Anyway... I pushed some changes to this PR:

  • The load in the validation step now uses Relaxed rather than SeqCst.
  • Added a Release fence in write.

Does the code look good now?

Copy link
Collaborator

left a comment

Thanks! I think it's good to go.

@Amanieu what do you think?


// Try doing an optimistic read first.
if let Some(stamp) = lock.optimistic_read() {
// We need a volatile read here because other threads might concurrently modify the

This comment has been minimized.

Copy link
@jeehoonkang

jeehoonkang Aug 9, 2018

Collaborator

From an extremely pedantic view (which I don't agree that much), volatile read is still UB in C/C++. But everyone is using volatile read/write to mark an access concurrent, including Linux for starters. Maybe commenting on it would be helpful to readers or ASAN?

This comment has been minimized.

Copy link
@stjepang

stjepang Aug 9, 2018

Author Member

I added a more elaborate comment explaining the issue.

@Amanieu

This comment has been minimized.

Copy link
Contributor

commented Aug 9, 2018

LGTM

@stjepang stjepang merged commit 5123ec4 into crossbeam-rs:master Aug 9, 2018
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@stjepang stjepang deleted the stjepang:optimistic-read branch Aug 9, 2018
@stjepang stjepang referenced this pull request Sep 1, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
3 participants
You can’t perform that action at this time.