Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Resolve conflicts during remote insertion integration in O(log n) #4

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

as-cii
Copy link
Contributor

@as-cii as-cii commented Nov 13, 2017

This pull-request proposes a modification to the current O(k2) insertion integration algorithm, where k represents the number of insertions that occurred concurrently to the insertion that is being integrated.

The previous approach was entirely based on the algorithm proposed by Yu in 2014, which was a revisitation of the original WOOT algorithm discovered in 2006. Both algorithms rely on recursively comparing directly conflicting insertions to derive the tree of causal dependencies and apply an order that is consistent across all site and that, therefore, converges.

In 2011, Roh et al. independently discovered replicated abstract types (RADTs), and in the same paper they presented an implementation of a replicated growable array (RGA) which solved the conflict resolution issue in a completely different manner. Rather than re-constructing the causal relationship between the conflicting insertions, they proposed an algorithm that relies on the notions of operation commutativity and precedence transitivity that had O(k) time complexity. The motivated reader can dive into the paper to find how to exploit these two concepts to achieve convergence, but in a nutshell the idea is to always preserve the intention of operations that causally occur later, and when a causal relation can't be established, an arbitrary (but consistent) order should be chosen instead (this is done via a tool that can establish a total order among operations, such as a Lamport Clock + a tie-breaking rule, or the s4vector type described in the paper). Note that "preserving intention" when inserting into an array means placing the new element closer to its left dependency. The problem of finding the proper insertion point when integrating a remote insertion can be, therefore, encoded as follows:

Find the leftmost segment after the left dependency that satisfies either one of the following conditions:

  1. Has a Lamport Clock that is strictly smaller than the new insertion's Lamport Clock.
  2. Has a Lamport Clock that is equal to the new insertion's Lamport Clock, but was originated from a site with a smaller identifier.

We could search for such a segment in linear time, and that's the approach taken by the original RGA algorithm. However, with this pull-request we are taking advantage of our BST representation of the document model to perform this query in logarithmic time. This was achieved by augmenting the document model to store the smallest Lamport Clock of any given subtree, as well as the smallest site id such clock was originated from. This way, we can binary search into the document and skip those portions of the tree where we know neither 1) nor 2) hold.

The time complexity of this approach is O(log n) in the best case because, even if we only consider subtrees located between the left and right dependencies (which will contain k segments if other sites have concurrently inserted between the exact same dependencies), the overall cost is dominated by the splay operations. The approach could degrade to O(k) if the subtree between left and right dependency is heavily unbalanced.

/cc: @nathansobo @jasonrudolph

@as-cii
Copy link
Contributor Author

as-cii commented Nov 13, 2017

Randomized tests pass and this solution seems to works correctly. However, I propose we merge this post-launch after we have the chance to manually test it too.

@nathansobo
Copy link
Contributor

@as-cii One thing I've been pondering is how the splay tree interacts with this. How balanced will the subtree between the left and right dependencies end up being? It does add some wrinkles to the analysis.

@as-cii
Copy link
Contributor Author

as-cii commented Nov 13, 2017

One thing I've been pondering is how the splay tree interacts with this. How balanced will the subtree between the left and right dependencies end up being? It does add some wrinkles to the analysis.

That's a really good point, and we can't really exploit amortization in this circumstance. However, in the worst case, I think it just means that the algorithm degrades to linear time. It might be interesting to perform a before/after benchmark and see how performance changes as a consequence of these changes. This makes me even more curious about exploring using a balanced BST for this data structure.

@nathansobo
Copy link
Contributor

This makes me even more curious about exploring using a balanced BST for this data structure.

It seems like if you don't splay, then you can't really isolate a single subtree between the left and right dependencies anyway.

@as-cii as-cii changed the title Resolve conflicts during remote insertion integration in O(log k) Resolve conflicts during remote insertion integration in O(log n) Nov 21, 2017
@as-cii
Copy link
Contributor Author

as-cii commented Nov 21, 2017

I have changed this pull-request to reflect that the cost of this approach would still be dominated by the splay operation, which has a O(log n) amortized cost and that, in the worst case, the approach could degrade to O(k) time.

With that in mind, I think that by using a perfectly-balanced binary search tree we could:

  1. Avoid splaying altogether. Instead, run a lowest-common-ancestor query between the left and right dependency and start our search from that node. This has cost O(log n) because in the worst case we would need to traverse from the leaves up until the root of the tree.
  2. Temporarily modify the left dependency to have a minClock = Infinity and propagate this change up until the lowest-common-ancestor. This will make sure that we will always go to the right of the left dependency by using the same algorithm herein proposed. This has cost O(log n) because in the worst case we would need to traverse from a leaf up until the root of the tree.
  3. Run the query to find the successor. This has the typical O(log n) cost.
  4. Restore the changes made in 2). Again, this has a cost of O(log n).

Overall this would have a cost of O(log n) in the worst case, although constant factors might play a role in this when the number of nodes isn't particularly elevated. We could also have a fast path that checks if there is any node at all between the left and right dependency, and if not it will just insert the new node.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants