Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False sharing? #7

Open
crsaracco opened this issue Jul 30, 2020 · 4 comments
Open

False sharing? #7

crsaracco opened this issue Jul 30, 2020 · 4 comments

Comments

@crsaracco
Copy link

From a quick-ish glance at RingBuffer, it looks like head and tail could possibly be falsely shared.

However, I tried to improve it by padding the atomics so they'd be on different cache lines, wrote a threaded benchmark, and failed to produce something faster than the current implementation (in fact, my version was usually slightly slower in benchmarks).

So, I guess this issue is either:

  • How is the current implementation getting around false sharing?
  • Or, if you haven't explicitly thought about false sharing before, perhaps look into it and see if you can make ringbuf even faster :)
@agerasev
Copy link
Owner

Thank you for the pointing out to this possible issue. To be honest, I haven't considered it.

But does the current implementation is really subjected to the false sharing? Let me slightly rewrite the code sample from Wikipedia to roughly demonstrate how does the current ring buffer logic looks like:

struct foo {
    int x;
    int y; 
};

static struct foo f;

/* The two following functions are running concurrently: */

void consume(void)
{
    for (int i = 0; i < 1000000; ++i) {
        if (f.x < f.y) {
            f.x += 1;
        }
    }
}

void produce(void)
{
    for (int i = 0; i < 1000000; ++i) {
        if (f.y - f.x < 100) {
            f.y += 1;
        }
    }
}

Both of consumer and producer do read both head and tail (but write only one of them). I'm not sure this sharing is false because both head and tail should be synchronized between cores, but maybe I missed something.

@crsaracco
Copy link
Author

crsaracco commented Jul 30, 2020

Okay, so I did some more reading on this, and the usual trick is to have a "cached" head and tail: [1] [2]. I missed this detail when I tried it yesterday.

Basically, you can reduce the cache line contention by having the consumer update their knowledge of the producer's position less frequently (only when it thinks the queue might be empty does it double-check/update). Same idea for the producer.

I'll play around with this a little bit more and see if I can get something faster in benchmarks. It does increase the complexity of the code a little bit, so you can decide if you want to actually use it (assuming I'm successful). :)

@agerasev
Copy link
Owner

@crsaracco, @mgeier, thank you for your interest! And sorry for such late answer.

I agree that the false sharing may be the one of the issues. I've tried to fix it.

But the idea of updating atomic indices less often while being transparent for user seems to have a drawback - you cannot decide when to actually synchronize indices. For example, you synchronize only when buffer is full or empty. And in such case what if consumer thread need to get just one byte, but it has to wait until buffer is full.

I believe the better approach is to synchronize buffer state on every action (as ringbuf does), but if you don't need such frequent synchronization you may accumulate data in some local buffer and push it all at once when it's needed. Another possible approach is to remove any implicit synchronization on push/pop at all and require for user to flush buffer state explicitly (maybe using some kind of guards that flushes buffer when get dropped).

Anyway thank you for your assistance!

@mgeier
Copy link

mgeier commented May 19, 2021

Please note that caching the atomic indices is not a solution to the false sharing problem (but it might might reduce the number of accesses to the atomic variables, and somewhat work around the problem).

The proper solution to false sharing is to make sure that the two indices don't share a cache line.

But the idea of updating atomic indices less often while being transparent for user seems to have a drawback - you cannot decide when to actually synchronize indices. For example, you synchronize only when buffer is full or empty. And in such case what if consumer thread need to get just one byte, but it has to wait until buffer is full.

I don't understand. I don't think that's what should happen.

The atomic indices should never be updated too late.
The idea is that if an index is advanced by a lot, all further increments will not have to touch the atomic variable at all until it's all caught up.

I'm not sure whether it's actually good to cache both indices or if caching only one of them is better, see mgeier/rtrb#48. But caching one (if it's the right one) definitely has a measurable effect.

The other options you mentioned are of course also possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants