Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reclaim inactive participant data #70

Closed
wants to merge 9 commits into from

Conversation

gereeter
Copy link

@gereeter gereeter commented Apr 29, 2016

The code that was removed in #42 was almost correct - it removed inactive participants from the list and tried to guard against multiple treads removing the same node. However, it guarded for that in an exactly backwards way: unlinked needs to be called by the first thread to remove the node, and so should run if the compare_and_swap succeeded. However, since compare_and_swap returns the previous value, it succeeds iff it returns false, not true. Thus, nodes were only unlinked by threads that weren't alone in removing the same participant. This also explains why the bug was so hard to exhibit - to get a double free, three threads had to simultaneously remove the same node.

I tested this with the slammer script posted in #42 - while the buggy code would segfault quickly (within a minute), I left the slammer running with this patched applied for over half an hour and it never failed.

This is implemented using a variation on Tim Harris's linked list removal algorithm.

Fixes #37.

@msullivan
Copy link
Contributor

Lurr, the compare_and_swap inversion is pretty silly. That said, I'm pretty sure I convinced myself that the algorithm was wrong even without that inversion. I'll think about it some.

@msullivan
Copy link
Contributor

Yeah, I think that this can leave unlinked nodes in the list:
Imagine you have a list head -> A -> B -> <whatever>.
T1 is iterating through it and has next = &A->next. Then, thread A exits, and T2 comes along, sees that A has exited, unlinks A from the list, and then iterates down the rest and wraps up. The list is now head -> B -> <whatever> but T1 still has a pointer to A and so is looking at a list like: A -> B -> <whatever>.

Then thread B exits. So T1 unlinks it from A, sets the flag, and unlinks it, then exits its loop because it has reached the end of the list.

But B only got unlinked from A, which isn't on the actual list anyways. So the thread list remains head -> B -> <whatever> even though B has been unlinked. Run through a couple collections and B can get reused as something else and everything will be super sad.

@gereeter
Copy link
Author

gereeter commented May 1, 2016

Right - just because I found one bug doesn't mean there aren't others.

I completely scrapped the old code and used essentially Tim Harris's linked list removal algorithm instead. This involved introducing a new MarkedAtomic type.

@gereeter
Copy link
Author

gereeter commented May 1, 2016

Oddly, though, the slammer script, even after running for at least 16 hours on my original attempt (which, as you pointed out, was still buggy), never reported a crash.

@msullivan
Copy link
Contributor

This seems plausible to me. Am I right in thinking that the Scan for an active node part is an optimization and would work if it always just picked the next node instead? (Since the active node that gets found could of course become inactive immediately after the check.) Perhaps it would be worth the code simplification to ditch that?

Maybe we need a test script that more heavily stresses thread creation/destruction to make these bugs more visible.

@arthurprs
Copy link

Is there anything specific blocking this? Otherwise it's a great fix.

@arthurprs arthurprs mentioned this pull request Jan 16, 2017
@Vtec234
Copy link
Member

Vtec234 commented Jan 16, 2017

Hey! It appears that this already adds the functionality I proposed in #102. It also has support for more things, notably arbitrarily sized markers and or_mark. OTOH, my branch has IMO better docs and a function I need - from_ptr. Ideally we could keep the best parts. If you are willing to merge it, I can write up a patch adding my functionality to this PR and then I could close mine.

@gereeter
Copy link
Author

I... completely forgot about this PR. As far as I know, there isn't anything really blocking this.

@msullivan I believe that it is just an optimization, but I think I'd prefer to just stick as close to Tim Harris's described algorithm as possible.

@Vtec234 That sounds reasonable to me.

@gereeter
Copy link
Author

Thanks to @Vtec234, I think the TaggedAtomic type introduced here is good for general consumption, beyond just fixing the leak of participant data.

@gereeter
Copy link
Author

gereeter commented Mar 7, 2017

Ping. It'd be nice to have #37 fixed.

@aturon
Copy link
Collaborator

aturon commented Mar 9, 2017

FYI, https://internals.rust-lang.org/t/crossbeam-request-for-help/4933 -- I'm actively looking to hand over maintenance, as I simply do not have the time to give this library attention.

@aturon
Copy link
Collaborator

aturon commented Mar 9, 2017

@arthurprs @gereeter @msullivan in particular I'd love to bring any of you on as maintainers, if you're game.

@jeehoonkang
Copy link
Contributor

Hi! I just wonder if we could merge atomic.rs and tagged_atomic.rs. Theoretically TaggedAtomic provides strictly larger set of features than Atomic, and it would be burdensome to maintain two copies of the common functionalities.

We can simply name it Atomic, as it will not be a breaking change.

@Vtec234
Copy link
Member

Vtec234 commented Apr 7, 2017

Can do.

EDIT: On second thought, I am not sure this is a good idea. Notice that e.g. load on Atomic returns Shared, while on TaggedAtomic it returns (Shared, usize). There are more instances of this. Changing load would be a breaking change. An option is to add load_with_tag and make load just ignore that tag. There is (probably negligible) overhead related to decoding the pointer, but more importantly grafting tagging functionality onto the Atomic type might be confusing. There are also problems like 'how do we deal with store'? Overwriting just the pointer while preserving the tag requires a CAS, which is costly, and ignoring the tag (maybe defaulting it to 0 when just the pointer is set) might not be what we want and would have to be explained in the docs, making it all the more confusing when you don't care about tagging. So I am in favour of slightly more maintenance burden, but cleaner API.

@arthurprs
Copy link

Is it a good idea? I'm not sure all methods interact well across tag/normal use.

@jeehoonkang jeehoonkang mentioned this pull request Jan 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants