New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lockfree queue delivers data out of order #59
Comments
thanks a lot for this very detailed test case! i need to have a look into this in detail! while i can reproduce it here, i'm wondering, where the actual bug is coming from. |
Well, I think the issue is the interaction of the pool and the queue...we definitely shouldn't zero a tagged node when it goes through the pool....though to be honest, I think the same issue could occur if a node was returned and reallocated from the heap |
Hi! |
doubt it...losing the tag bits (i.e. memsetting everything to zero all the time) is exactly what causes this |
I might be mistaken, but my understanding is that all that df78b9d does is reverting erroneous 3d45c00 and initializing some memory to zeroes at allocation to prevent valgrind errors, nothing more. I don't have recent versions of boost at hand, but maybe you could rerun your test to validate it? @timblechmann could you please shed some light on this? |
This change, 3d45c00, introduced in boost 1.71.0 to fix valgrind issues, actually introduced a very subtle ordering issue in the queue implementation which also exists in 1.72 and 1.73.
Specifically, the code previously did NOT initialize the
tagged_node_handle
next
on construction. which works beautifully when a node is allocated/freed/allocatedBy initializing the next pointer, the change introduces a classic ABA problem (from Wikipedia, "when a location is read twice, has the same value for both reads, and 'value is the same' is used to indicate 'nothing has changed'. However, another thread can execute between the two reads and change the value, do other work, then change the value back, thus fooling the first thread into thinking "nothing has changed" even though the second thread did work that violates that assumption.")
Reverting that single line fixes the issue in both 1.71 and 1.73
The following program demonstrates the problem where data can be returned from the queue out of order. Note that recreating this depends on system load and system architecture, but I have reliably recreated the problem on both server and desktop class systems by repeating this test 10,000 times.
The text was updated successfully, but these errors were encountered: