Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lockfree queue delivers data out of order #59

Open
daveboutcher opened this issue Jun 3, 2020 · 5 comments
Open

Lockfree queue delivers data out of order #59

daveboutcher opened this issue Jun 3, 2020 · 5 comments

Comments

@daveboutcher
Copy link

This change, 3d45c00, introduced in boost 1.71.0 to fix valgrind issues, actually introduced a very subtle ordering issue in the queue implementation which also exists in 1.72 and 1.73.

Specifically, the code previously did NOT initialize the tagged_node_handle next on construction. which works beautifully when a node is allocated/freed/allocated

-            data(v)//, next(tagged_node_handle(0, 0))
+            data(v), next(tagged_node_handle(null_handle, 0))

By initializing the next pointer, the change introduces a classic ABA problem (from Wikipedia, "when a location is read twice, has the same value for both reads, and 'value is the same' is used to indicate 'nothing has changed'. However, another thread can execute between the two reads and change the value, do other work, then change the value back, thus fooling the first thread into thinking "nothing has changed" even though the second thread did work that violates that assumption.")

Reverting that single line fixes the issue in both 1.71 and 1.73

The following program demonstrates the problem where data can be returned from the queue out of order. Note that recreating this depends on system load and system architecture, but I have reliably recreated the problem on both server and desktop class systems by repeating this test 10,000 times.

// Boost queue stress test....all credits to Wes Darvin
#include <boost/lockfree/queue.hpp>
#include <iostream>
#include <thread>
#include <vector>

static constexpr int BLOCK_SIZE = 1000;
static constexpr int NUM_THREADS = 20;

// Enqueue BLOCK_SIZE values in order
void enqueueThreadMethod(boost::lockfree::queue<int32_t> *queue, int threadID) {
    int start = BLOCK_SIZE * threadID;
    for (int i = 0; i < BLOCK_SIZE; i++) {
        while (!queue->push(start + i))
            ;
    }
}

// Dequeue BLOCK_SIZE values.  Note that the values returned may come from
// different enqueuers, but for each enqueuer they should be in increasing
// order.
void dequeueThreadMethod(
    boost::lockfree::queue<int32_t> *queue,
    std::array<int32_t, NUM_THREADS * BLOCK_SIZE> *retValues, int threadID) {
    for (int i = 0; i < BLOCK_SIZE; i++) {
        while (!queue->pop((*retValues)[threadID * BLOCK_SIZE + i]))
            ;
    }
}

int main(int argc, char **argv) {
    std::array<int32_t, NUM_THREADS * BLOCK_SIZE> retValues;

    std::vector<std::thread> threadVector;

    boost::lockfree::queue<int32_t> queue(100);

    // Enqueue and dequeue relatively simulataniously. And yes, we could make
    // this more robust by adding a barrier, but it works as-is
    for (int i = 0; i < NUM_THREADS; i++) {
        threadVector.emplace_back(&enqueueThreadMethod, &queue, i);
        threadVector.emplace_back(&dequeueThreadMethod, &queue, &retValues, i);
    }

    for (std::thread &t : threadVector) {
        t.join();
    }

    // Now validate the values we dequeued.   All we are checking is
    // that for each enqueuer, its values only increase...i.e. we
    // are checking for out-of-order conditions
    for (int i = 0; i < NUM_THREADS; i++) {
        // vector of last-seen values by enqueuer. Each
        // entry should increase
        std::vector<int> localMax(NUM_THREADS, -1);

        // check the values dequeued by a single dequeuer
        for (int j = 0; j < BLOCK_SIZE; j++) {
            int localValue = retValues[i * BLOCK_SIZE + j];
            int index = localValue / BLOCK_SIZE;

            if (localValue < localMax[index]) {
                std::cout << "Out of order values: " << localValue << " < "
                          << localMax[index] << " i: " << i << ", j: " << j
                          << ", index: " << index << std::endl;
            }
            localMax[index] = localValue;
        }
    }

    // Now sort all the dequeued values and make sure there are none missing
    // or duplicated
    std::sort(retValues.begin(), retValues.end());
    for (int i = 0; i < (BLOCK_SIZE * NUM_THREADS); i++) {
        if (retValues[i] != i) {
            std::cout << "Error checking sorted values at index " << i
                      << std::endl;
        }
    }

    return 0;
}
@timblechmann
Copy link
Collaborator

thanks a lot for this very detailed test case! i need to have a look into this in detail! while i can reproduce it here, i'm wondering, where the actual bug is coming from.
i have a prototype implementation here, trying to replace the zero-initialisation with a generated number ABA prevention tag, but i want do double-check the michael/scott paper to see where things are going wrong

timblechmann added a commit that referenced this issue Jun 3, 2020
@daveboutcher
Copy link
Author

Well, I think the issue is the interaction of the pool and the queue...we definitely shouldn't zero a tagged node when it goes through the pool....though to be honest, I think the same issue could occur if a node was returned and reallocated from the heap

@itrofimow
Copy link

Hi!
This is fixed in 1.74 by df78b9d, am i right?

@daveboutcher
Copy link
Author

doubt it...losing the tag bits (i.e. memsetting everything to zero all the time) is exactly what causes this

@itrofimow
Copy link

doubt it...losing the tag bits (i.e. memsetting everything to zero all the time) is exactly what causes this

I might be mistaken, but my understanding is that all that df78b9d does is reverting erroneous 3d45c00 and initializing some memory to zeroes at allocation to prevent valgrind errors, nothing more.

I don't have recent versions of boost at hand, but maybe you could rerun your test to validate it?

@timblechmann could you please shed some light on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants