New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement implicit merge policies #148

Closed
aboodman opened this Issue Jul 29, 2015 · 27 comments

Comments

Projects
None yet
5 participants
@aboodman
Contributor

aboodman commented Jul 29, 2015

After #147 is done, datastore will enforce that all writes be serialized. Boo :(.

To regain some concurrency, we should implement simple automatic merge policies for our datastructures. What I'm imagining is something where we take a pair of concurrent mutations of the tree, along with the previous version of the tree, an consider the three of them recursively, applying some simple rules like:

  • If any of the three nodes are different types, conflict:
  • If we are dealing with a set:
    • If the same object is both removed and inserted wrt parent: conflict
  • If we are dealing with a map:
    • If the same key is both removed and inserted wrt parent: conflict
    • If the same key is inserted wrt parent, but with different values: conflict
  • If we are dealing with a struct:
    • same as map
  • If we are dealing with a list:
    • Same as map but substitute "index" for "key"
    • Note that this means that things like "slice" expensive in terms of concurrency

... otherwise, the concurrent modification is allowed.

Another way to look at this (I think) is that the operations we will automatically merge effectively make our datastructures into CRDTs (https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type).

@cmasone-attic

This comment has been minimized.

Show comment
Hide comment
@cmasone-attic

cmasone-attic Aug 25, 2015

Contributor

It seems like mostly what people will do for a list is append, and simultaneous appends would fail according to these rules. Problem is I can see arguments for a lot of different policies...maybe it's fine to just append everything if different values are added wrt parent, but what if the same value is appended by both mutations?

Contributor

cmasone-attic commented Aug 25, 2015

It seems like mostly what people will do for a list is append, and simultaneous appends would fail according to these rules. Problem is I can see arguments for a lot of different policies...maybe it's fine to just append everything if different values are added wrt parent, but what if the same value is appended by both mutations?

@aboodman

This comment has been minimized.

Show comment
Hide comment
@aboodman

aboodman Aug 25, 2015

Contributor

Yeah I think list is the hardest one.

@rafael-atticlabs said he did some magical stuff that was better than this policy when he was working on a similar thing for Blink. Maybe he can say more about that. I could sort of see extending the policy for the special-case of append to make multiple concurrent appends OK. But I bet raf's thing did that and was more general at same time.

Contributor

aboodman commented Aug 25, 2015

Yeah I think list is the hardest one.

@rafael-atticlabs said he did some magical stuff that was better than this policy when he was working on a similar thing for Blink. Maybe he can say more about that. I could sort of see extending the policy for the special-case of append to make multiple concurrent appends OK. But I bet raf's thing did that and was more general at same time.

@jaseg

This comment has been minimized.

Show comment
Hide comment
@jaseg

jaseg Aug 3, 2016

Since you already had a look at CRDTs: Shapiro et al. wrote down a number of CRDTs with their merge strategies in their "Comprehensive Study of Convergent and Commutative Replicated Data Types". This might contain some inspiration.

I think generally it would make sense to allow the user to specify a merge strategy for a node when creating said node. In some cases, one might want a conflict to be raised on concurrent modification, in other cases e.g. LWW using the committing device's local wall-clock time would be appropriate. LWW would e.g. be a sensible choice when synchronizing multiple devices that all belong to the same user, since it can be assumed that this user is a) somewhat aware of what modifications she did on each device and b) only using one device at a time.

AFAICT adopting this kind of automated merge policy would make noms data structures into state-based CRDTs, with the difference that noms always stores all history. AFAICT there would be two things that a tighter integration of CRDT semantics could bring:

  • Depending on the data type one might be able to use knowledge about the merge procedure to reduce the amount of data transferred during sync
  • The automated merge procedure might be used to compute efficient deltas from full states in order to reduce history storage size (effectively transforming historical data from a state-based "CvRDT" to a operation-based "CmRDT").

jaseg commented Aug 3, 2016

Since you already had a look at CRDTs: Shapiro et al. wrote down a number of CRDTs with their merge strategies in their "Comprehensive Study of Convergent and Commutative Replicated Data Types". This might contain some inspiration.

I think generally it would make sense to allow the user to specify a merge strategy for a node when creating said node. In some cases, one might want a conflict to be raised on concurrent modification, in other cases e.g. LWW using the committing device's local wall-clock time would be appropriate. LWW would e.g. be a sensible choice when synchronizing multiple devices that all belong to the same user, since it can be assumed that this user is a) somewhat aware of what modifications she did on each device and b) only using one device at a time.

AFAICT adopting this kind of automated merge policy would make noms data structures into state-based CRDTs, with the difference that noms always stores all history. AFAICT there would be two things that a tighter integration of CRDT semantics could bring:

  • Depending on the data type one might be able to use knowledge about the merge procedure to reduce the amount of data transferred during sync
  • The automated merge procedure might be used to compute efficient deltas from full states in order to reduce history storage size (effectively transforming historical data from a state-based "CvRDT" to a operation-based "CmRDT").
@aboodman

This comment has been minimized.

Show comment
Hide comment
@aboodman

aboodman Aug 4, 2016

Contributor

@jaseg, thanks for chiming in. I agree with both your points.

I was thinking originally that CRDT research could somehow help us merge set and map mutations better. But I'm not seeing it.

But maybe it could help with list mutations? I want to re-read the section about sequences in that paper.

Contributor

aboodman commented Aug 4, 2016

@jaseg, thanks for chiming in. I agree with both your points.

I was thinking originally that CRDT research could somehow help us merge set and map mutations better. But I'm not seeing it.

But maybe it could help with list mutations? I want to re-read the section about sequences in that paper.

@jaseg

This comment has been minimized.

Show comment
Hide comment
@jaseg

jaseg Aug 7, 2016

What using the CRDTs specified in that paper gives you is at least some correctness guarantee. Conflicts (concurrent modification, concurrent add) must be resolved by some application-specific means, CRDTs cannot help there. In the end you need to consider the specific use case and allow for some application-level merge policy. I thing last-write-wins will be useful in many cases.

As for sequence types, there is the "Replicated Growable Array" specified in section 3.5.1 (p. 35ff) of the "Comprehensive Study". Their model e.g. solves the non-obvious problem that when synchronizing a list across multiple hosts, an integer is not sufficient to describe an element's position in the list.

jaseg commented Aug 7, 2016

What using the CRDTs specified in that paper gives you is at least some correctness guarantee. Conflicts (concurrent modification, concurrent add) must be resolved by some application-specific means, CRDTs cannot help there. In the end you need to consider the specific use case and allow for some application-level merge policy. I thing last-write-wins will be useful in many cases.

As for sequence types, there is the "Replicated Growable Array" specified in section 3.5.1 (p. 35ff) of the "Comprehensive Study". Their model e.g. solves the non-obvious problem that when synchronizing a list across multiple hosts, an integer is not sufficient to describe an element's position in the list.

@cmasone-attic

This comment has been minimized.

Show comment
Hide comment
@cmasone-attic

cmasone-attic Aug 15, 2016

Contributor

After reading some survey papers about CRDTs, I think the semantics of these data structures are generally unfamiliar to most users; e.g. a Set in which elements can only be added, removed, and then never added again (barring some synchronization-requiring “garbage collection” step). We want to provide traditional data structures that programmers are familiar with, even if that means conflict resolution must be done manually in some cases. We think we can provide a few options of conflict resolution policies for each core data structure kind in noms that should meet most needs.

That said, we're still interested in perhaps using some form of sequence crdt for merging lists -- this is still under investigation. The Replicated Growable Array data structure you mention relies on timestamps, which means that it's not robust in the face of clock skew. Without having something like GPS on all nodes, getting reliable timestamps in a distributed system is another whole hard problem to tackle :-) We're still looking at the "continuum"-based sequences in that paper, though.

Contributor

cmasone-attic commented Aug 15, 2016

After reading some survey papers about CRDTs, I think the semantics of these data structures are generally unfamiliar to most users; e.g. a Set in which elements can only be added, removed, and then never added again (barring some synchronization-requiring “garbage collection” step). We want to provide traditional data structures that programmers are familiar with, even if that means conflict resolution must be done manually in some cases. We think we can provide a few options of conflict resolution policies for each core data structure kind in noms that should meet most needs.

That said, we're still interested in perhaps using some form of sequence crdt for merging lists -- this is still under investigation. The Replicated Growable Array data structure you mention relies on timestamps, which means that it's not robust in the face of clock skew. Without having something like GPS on all nodes, getting reliable timestamps in a distributed system is another whole hard problem to tackle :-) We're still looking at the "continuum"-based sequences in that paper, though.

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 15, 2016

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 15, 2016

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 15, 2016

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 18, 2016

Go: Implement best-effort three-way merge for Maps
Add the Go 'merge' package, which exports one function called ThreeWay().
ThreeWay() attempts a three-way merge between two candidates and a common
ancestor. It considers the three of them recursively, applying some simple rules:

- If any of the three nodes are different types: conflict
- If we are dealing with a map:
  - If the same key is both removed and inserted wrt parent: conflict
  - If the same key is inserted wrt parent, but with different values: conflict
- If we are dealing with a struct:
  - same as map
- If we are dealing with a list:
  - Same as map but substitute "index" for "key"
- If we are dealing with a set:
  - If the same object is both removed and inserted wrt parent: conflict
... otherwise, the concurrent modification is allowed.

Currently, ThreeWay() only works on types.Map.

Towards attic-labs#148

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 18, 2016

Go: Implement best-effort three-way merge for Maps
Add the Go 'merge' package, which exports one function called ThreeWay().
ThreeWay() attempts a three-way merge between two candidates and a common
ancestor. It considers the three of them recursively, applying some simple rules:

- If any of the three nodes are different types: conflict
- If we are dealing with a map:
  - If the same key is both removed and inserted wrt parent: conflict
  - If the same key is inserted wrt parent, but with different values: conflict
- If we are dealing with a struct:
  - same as map
- If we are dealing with a list:
  - Same as map but substitute "index" for "key"
- If we are dealing with a set:
  - If the same object is both removed and inserted wrt parent: conflict
... otherwise, the concurrent modification is allowed.

Currently, ThreeWay() only works on types.Map.

Towards attic-labs#148

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 19, 2016

Go: Implement best-effort three-way merge for Maps
Add the Go 'merge' package, which exports one function called ThreeWay().
ThreeWay() attempts a three-way merge between two candidates and a common
ancestor. It considers the three of them recursively, applying some simple rules:

- If any of the three nodes are different kinds: conflict
- If we are dealing with a map:
  - If the same key is both removed and inserted wrt parent: conflict
  - If the same key is inserted wrt parent, but with different values: conflict
- If we are dealing with a struct:
  - same as map
- If we are dealing with a list:
  - Same as map but substitute "index" for "key"
- If we are dealing with a set:
  - If the same object is both removed and inserted wrt parent: conflict
... otherwise, the concurrent modification is allowed.

Currently, ThreeWay() only works on types.Map.

Towards attic-labs#148
@aboodman

This comment has been minimized.

Show comment
Hide comment
@aboodman

aboodman Aug 22, 2016

Contributor

Here is a new strawman for list merge. This thinks of the edits people are making to lists as semantically splices, not edits.

This isn't well-specified, but I hope the general idea makes sense.

Given three commits, X, Y, and P (shared parent):

  1. Decide which commit to apply first. We arbitrarily choose the one with the smaller hash. The only thing that matters is that it is deterministic. Call this one C1 and the other one C2.
  2. Derive splices that represent the change from P->C1 and P->C2. Call these S1 and S2.
    • Semantically, think of these splices as "delete {num-elements} and insert {new-elements} before element {reference-point}"
    • There would have to be a special sentinel reference-point for "end-of-list"
  3. Apply all of S1 to P
  4. Update the reference-points in all of S2 to reflect the changes S1 made, so that they still refer to the correct elements
    • e.g., if S1 contains "insert {foo, bar} at 3" and S2 contains "insert {baz} at 4", then S2 would be updated to contain "insert {baz} at 6", because the element at 4 in P is at 6 after S1 is applied.
    • Note that this might result in empty splices in the case where, e.g., both commits removed the same elements. That's fine.
    • If any of the insertions in S2 have reference points which were removed by S1, conflict.
  5. Apply all the splices in S2.

Note that the way I describe it here is by applying all of S1, then S2, but in real life with large lists, you'd want to do these in parallel. I haven't worked out the algorithm to do that.

Also, for the record, I think you could extend this to be a true CRDT by picking an arbitrary resolution in the case of conflict (e.g., adjusting the position where elements are inserted), but I'm not sure that is desirable.

Contributor

aboodman commented Aug 22, 2016

Here is a new strawman for list merge. This thinks of the edits people are making to lists as semantically splices, not edits.

This isn't well-specified, but I hope the general idea makes sense.

Given three commits, X, Y, and P (shared parent):

  1. Decide which commit to apply first. We arbitrarily choose the one with the smaller hash. The only thing that matters is that it is deterministic. Call this one C1 and the other one C2.
  2. Derive splices that represent the change from P->C1 and P->C2. Call these S1 and S2.
    • Semantically, think of these splices as "delete {num-elements} and insert {new-elements} before element {reference-point}"
    • There would have to be a special sentinel reference-point for "end-of-list"
  3. Apply all of S1 to P
  4. Update the reference-points in all of S2 to reflect the changes S1 made, so that they still refer to the correct elements
    • e.g., if S1 contains "insert {foo, bar} at 3" and S2 contains "insert {baz} at 4", then S2 would be updated to contain "insert {baz} at 6", because the element at 4 in P is at 6 after S1 is applied.
    • Note that this might result in empty splices in the case where, e.g., both commits removed the same elements. That's fine.
    • If any of the insertions in S2 have reference points which were removed by S1, conflict.
  5. Apply all the splices in S2.

Note that the way I describe it here is by applying all of S1, then S2, but in real life with large lists, you'd want to do these in parallel. I haven't worked out the algorithm to do that.

Also, for the record, I think you could extend this to be a true CRDT by picking an arbitrary resolution in the case of conflict (e.g., adjusting the position where elements are inserted), but I'm not sure that is desirable.

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 22, 2016

Go: Add three-way merge for types.Struct
ThreeWay() is now willing to merge Structs as well as Maps, as long
as the Structs are all named the name thing. Other rules are the same
as for Map.

Toward attic-labs#148

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 22, 2016

Go: Add three-way merge for types.Struct
ThreeWay() is now willing to merge Structs as well as Maps, as long
as the Structs are all named the name thing. Other rules are the same
as for Map.

Toward attic-labs#148

cmasone-attic added a commit that referenced this issue Aug 24, 2016

Go: Add three-way merge for types.Struct (#2403)
merge.ThreeWay() is now willing to merge Structs as well as Maps, as long
as the Structs are all named the name thing. Other rules are the same
as for Map.

Toward #148
@rafael-atticlabs

This comment has been minimized.

Show comment
Hide comment
@rafael-atticlabs

rafael-atticlabs Aug 25, 2016

Contributor

Right. @aboodman's model of the world matches mine pretty well. I would propose approaching the basic merge algorithm like this:

  • Given Conflicting Commits, X & Y, let P be the most recent common ancestor Commit.
  • Let Lx and Ly be respective lists within X and Y that require merging relative to a common ancestor list Lp
  • Let Sx & Sy be lists of splices which result from processing the list diff from Lp -> Lx and Lp -> Ly, respectively.
  • Let Lm be Lp
  • While Sx and Sy are both non-empty
    • Let s1 be the first splice in Sx
    • Let s2 be the first splice in Sy
    • If Overlap(s1, s2)
      • If CanMerge(s1, d2)
        • Let s3 be result of Merge(s1, s2)
        • Apply s3 to Lm and adjust remaining splices in Lx and Ly
      • Else, Conflict
    • Else
      • Apply whichever of s1 and s2 has a lower splice index to Lm and adjust remaining splices in Lx and Ly.
  • Apply remaining splices in either Sx and Sy to Lm

The basic idea here is that list diff will produce an ordered set of splices which transform one list into another. We take two such diffs which transform one initial state into two different final states and process a kind of ordered merge of the splices. By processing the two "front" splices of each diff we can kind of define the problem recursively and reason about merge policies with examples of simple cases of 1 vs 1 splices.

The merge policy we implement will be a function of defining Overlap, CanMerge and Merge.

Note that "adjusting remaining splices" is always the same. The basic idea is that all remaining splice indexes are offset by the net change in length which resulted from applying a splice. For example. If we apply a splice which removed 2 elements and added 3, we would add 1 to the index position of splices remaining to be processed.

Also note that the implementation of this would probably want to process over all the splices to see if any conflicts will arise and only then start producing the merged list, knowing that no conflicts will arise.

Contributor

rafael-atticlabs commented Aug 25, 2016

Right. @aboodman's model of the world matches mine pretty well. I would propose approaching the basic merge algorithm like this:

  • Given Conflicting Commits, X & Y, let P be the most recent common ancestor Commit.
  • Let Lx and Ly be respective lists within X and Y that require merging relative to a common ancestor list Lp
  • Let Sx & Sy be lists of splices which result from processing the list diff from Lp -> Lx and Lp -> Ly, respectively.
  • Let Lm be Lp
  • While Sx and Sy are both non-empty
    • Let s1 be the first splice in Sx
    • Let s2 be the first splice in Sy
    • If Overlap(s1, s2)
      • If CanMerge(s1, d2)
        • Let s3 be result of Merge(s1, s2)
        • Apply s3 to Lm and adjust remaining splices in Lx and Ly
      • Else, Conflict
    • Else
      • Apply whichever of s1 and s2 has a lower splice index to Lm and adjust remaining splices in Lx and Ly.
  • Apply remaining splices in either Sx and Sy to Lm

The basic idea here is that list diff will produce an ordered set of splices which transform one list into another. We take two such diffs which transform one initial state into two different final states and process a kind of ordered merge of the splices. By processing the two "front" splices of each diff we can kind of define the problem recursively and reason about merge policies with examples of simple cases of 1 vs 1 splices.

The merge policy we implement will be a function of defining Overlap, CanMerge and Merge.

Note that "adjusting remaining splices" is always the same. The basic idea is that all remaining splice indexes are offset by the net change in length which resulted from applying a splice. For example. If we apply a splice which removed 2 elements and added 3, we would add 1 to the index position of splices remaining to be processed.

Also note that the implementation of this would probably want to process over all the splices to see if any conflicts will arise and only then start producing the merged list, knowing that no conflicts will arise.

@rafael-atticlabs

This comment has been minimized.

Show comment
Hide comment
@rafael-atticlabs

rafael-atticlabs Aug 25, 2016

Contributor

Now for the merge policy. I think there is some room for debate here. I propose the following two cases which I hope are uncontreversial

Case 1: 
s: a b c d e // start list
l: a 1 c d e [1, 1, 1] // "left" change, [SpAt, SpRemoved, SpAdded, SpFrom] (see edit_distance.go)
r: a b c 2 e [3, 1, 2] // "right" change
m: a 1 c 3 e // merged list

I.e. I think this is the simplest example of a merge which should probably be allowed in any merge policy

Case 2: 
s: a b c
l: a b 1 c [2, 0, 1]
r: a b 2 c [2, 0, 2]
=> conflict // is the result [a b 1 2 c] or [a b 2 1 c]?

i.e. I think this is the simplest example of a merge conflict. The only way to make the implied "operations" commutative would be to use "external" information, such as commit time (i.e. last write wins) -- which we probably don't want to do with our default merge policy because it should be attempting to merge operations which aren't logical conflicts.

From here it gets more fuzzy. I'm tempted to feel like we should take one of two basic paths (maybe eventually implementing both and allowing for switchable behavior). One path is conservative and the other is liberal.

In order to reason about the policies, here are some cases to consider:

Case 3 (conflicting appends)
a b c
a b c 1 [3, 0, 1]
a b c 2 [3, 0, 2]
Case 4 (delete overlapping, different index)
a b c d
a b c  [3, 1]
a b    [2, 2]
Case 5 (delete same, insert prefix)
a b c
a b 1   [2, 1, 1]
a b 1 2 [2, 1, 1, 2]
Case 6:
a b c d
a b 1 2 [2, 2, 1, 2]
a b c 1 [3, 1, 1]
Case 7 (delete same, no inserts)
a b c
a c [1, 1]
a c [1, 1]
Case 8 (delete same, identical inserts)
a b c
a 1 2 c [1, 1, 1, 2]
a 1 2 c [1, 1, 1, 2]
Case 9
a b c
a 1 2 c [1, 1, 1, 2]
a b 3   [2, 1, 3]
Case 10
a b c
a c     [1, 1]
a 1 b c [1, 0, 1]
Contributor

rafael-atticlabs commented Aug 25, 2016

Now for the merge policy. I think there is some room for debate here. I propose the following two cases which I hope are uncontreversial

Case 1: 
s: a b c d e // start list
l: a 1 c d e [1, 1, 1] // "left" change, [SpAt, SpRemoved, SpAdded, SpFrom] (see edit_distance.go)
r: a b c 2 e [3, 1, 2] // "right" change
m: a 1 c 3 e // merged list

I.e. I think this is the simplest example of a merge which should probably be allowed in any merge policy

Case 2: 
s: a b c
l: a b 1 c [2, 0, 1]
r: a b 2 c [2, 0, 2]
=> conflict // is the result [a b 1 2 c] or [a b 2 1 c]?

i.e. I think this is the simplest example of a merge conflict. The only way to make the implied "operations" commutative would be to use "external" information, such as commit time (i.e. last write wins) -- which we probably don't want to do with our default merge policy because it should be attempting to merge operations which aren't logical conflicts.

From here it gets more fuzzy. I'm tempted to feel like we should take one of two basic paths (maybe eventually implementing both and allowing for switchable behavior). One path is conservative and the other is liberal.

In order to reason about the policies, here are some cases to consider:

Case 3 (conflicting appends)
a b c
a b c 1 [3, 0, 1]
a b c 2 [3, 0, 2]
Case 4 (delete overlapping, different index)
a b c d
a b c  [3, 1]
a b    [2, 2]
Case 5 (delete same, insert prefix)
a b c
a b 1   [2, 1, 1]
a b 1 2 [2, 1, 1, 2]
Case 6:
a b c d
a b 1 2 [2, 2, 1, 2]
a b c 1 [3, 1, 1]
Case 7 (delete same, no inserts)
a b c
a c [1, 1]
a c [1, 1]
Case 8 (delete same, identical inserts)
a b c
a 1 2 c [1, 1, 1, 2]
a 1 2 c [1, 1, 1, 2]
Case 9
a b c
a 1 2 c [1, 1, 1, 2]
a b 3   [2, 1, 3]
Case 10
a b c
a c     [1, 1]
a 1 b c [1, 0, 1]
@rafael-atticlabs

This comment has been minimized.

Show comment
Hide comment
@rafael-atticlabs

rafael-atticlabs Aug 25, 2016

Contributor

@cmasone-attic, @aboodman Why don't you think about these cases some and we can discuss more in person on Thursday.

Contributor

rafael-atticlabs commented Aug 25, 2016

@cmasone-attic, @aboodman Why don't you think about these cases some and we can discuss more in person on Thursday.

@aboodman

This comment has been minimized.

Show comment
Hide comment
@aboodman

aboodman Aug 25, 2016

Contributor
Case 2: 
s: a b c
l: a b 1 c [2, 0, 1, 2]
r: a b 2 c [2, 0, 1, 2]
=> conflict // is the result [a b 1 2 c] or [a b 2 1 c]?

i.e. I think this is the simplest example of a merge conflict. The only way to make the implied "operations" commutative would be to use "external" information, such as commit time (i.e. last write wins) -- which we probably don't want to do with our default merge policy because it should be attempting to merge operations which aren't logical conflicts.

I think this should not be a conflict, and the correct answer is: [a b 1 2 c].

The reason why is because:

> l = new noms.List(['a', 'b', 1, 'c'])
> r = new noms.List(['a', 'b', 2, 'c'])
> l.hash.toString()
'2k1hc5s5q0he6uir0a9q4d5817cccb7n'
> r.hash.toString()
'e8v4btj0rb0do1thvi3alh82rpkq1r4v'
> l.hash.toString() < r.hash.toString()
true

... the external information is the hash of the list. Because Noms is awesome, we can calculate the hash efficiently no matter the size of the data. We choose the lower hash arbitrarily.

Case 3 (conflicting appends)
a b c
a b c 1 [3, 0, 1, 3]
a b c 2 [3, 0, 1, 3]

Correct answer is: [a b c 2 1] because new noms.List(['a', 'b', 'c', 1]).hash.toString() < new noms.List(['a', 'b', 'c', 2]) => false.

Case 4:

Let us say l has the lower hash. Splice of r should be adjusted to [2, 1, 0, 0].
Otherwise if r has the lower hash, then splice of l is adjusted to [2, 0, 0, 0] (no-op).

Case 5:

[a b c 1 1 2] :-(

Contributor

aboodman commented Aug 25, 2016

Case 2: 
s: a b c
l: a b 1 c [2, 0, 1, 2]
r: a b 2 c [2, 0, 1, 2]
=> conflict // is the result [a b 1 2 c] or [a b 2 1 c]?

i.e. I think this is the simplest example of a merge conflict. The only way to make the implied "operations" commutative would be to use "external" information, such as commit time (i.e. last write wins) -- which we probably don't want to do with our default merge policy because it should be attempting to merge operations which aren't logical conflicts.

I think this should not be a conflict, and the correct answer is: [a b 1 2 c].

The reason why is because:

> l = new noms.List(['a', 'b', 1, 'c'])
> r = new noms.List(['a', 'b', 2, 'c'])
> l.hash.toString()
'2k1hc5s5q0he6uir0a9q4d5817cccb7n'
> r.hash.toString()
'e8v4btj0rb0do1thvi3alh82rpkq1r4v'
> l.hash.toString() < r.hash.toString()
true

... the external information is the hash of the list. Because Noms is awesome, we can calculate the hash efficiently no matter the size of the data. We choose the lower hash arbitrarily.

Case 3 (conflicting appends)
a b c
a b c 1 [3, 0, 1, 3]
a b c 2 [3, 0, 1, 3]

Correct answer is: [a b c 2 1] because new noms.List(['a', 'b', 'c', 1]).hash.toString() < new noms.List(['a', 'b', 'c', 2]) => false.

Case 4:

Let us say l has the lower hash. Splice of r should be adjusted to [2, 1, 0, 0].
Otherwise if r has the lower hash, then splice of l is adjusted to [2, 0, 0, 0] (no-op).

Case 5:

[a b c 1 1 2] :-(

@rafael-atticlabs

This comment has been minimized.

Show comment
Hide comment
@rafael-atticlabs

rafael-atticlabs Aug 25, 2016

Contributor

I've updated the comment above to use the more familiar JS syntax for splices (index, removeCount, ...insertItems). I was using the struct version that noms uses internally, but the js syntax is probably easier to discuss.

Contributor

rafael-atticlabs commented Aug 25, 2016

I've updated the comment above to use the more familiar JS syntax for splices (index, removeCount, ...insertItems). I was using the struct version that noms uses internally, but the js syntax is probably easier to discuss.

@rafael-atticlabs

This comment has been minimized.

Show comment
Hide comment
@rafael-atticlabs

rafael-atticlabs Aug 25, 2016

Contributor

WRT Case 2 (& Case 3): I see that your proposed semantics allows these cases to be commutative, but it doesn't seem useful to me. In particular it seems to violate the basic semantic of being a list which is that ordering is significant. Picking which element comes first based on the hash of the final states of the lists makes it impossible to reason about how various cases will resolve.

I can see that if you're thinking about importing CSVs into lists and two inserts of new lines at the same position, it's tempting to allow both, but this seems like Set semantics to me, not List.

Or looking at it another way, I think we should probably not risk incorrectly merging two versions of a list whose order is managed by an explicit precedence function. I.e.

s: [{ name: 'bob', age: 20}, {name: 'alice', age: 30}]
l: [{ name: 'bob', age: 20}, {name: 'sam', age: 29}, {name: 'alice', age: 30}]
r: [{ name: 'bob', age: 20}, {name: 'jill', age: 22}, {name: 'alice', age: 30}]
Contributor

rafael-atticlabs commented Aug 25, 2016

WRT Case 2 (& Case 3): I see that your proposed semantics allows these cases to be commutative, but it doesn't seem useful to me. In particular it seems to violate the basic semantic of being a list which is that ordering is significant. Picking which element comes first based on the hash of the final states of the lists makes it impossible to reason about how various cases will resolve.

I can see that if you're thinking about importing CSVs into lists and two inserts of new lines at the same position, it's tempting to allow both, but this seems like Set semantics to me, not List.

Or looking at it another way, I think we should probably not risk incorrectly merging two versions of a list whose order is managed by an explicit precedence function. I.e.

s: [{ name: 'bob', age: 20}, {name: 'alice', age: 30}]
l: [{ name: 'bob', age: 20}, {name: 'sam', age: 29}, {name: 'alice', age: 30}]
r: [{ name: 'bob', age: 20}, {name: 'jill', age: 22}, {name: 'alice', age: 30}]
@cmasone-attic

This comment has been minimized.

Show comment
Hide comment
@cmasone-attic

cmasone-attic Aug 25, 2016

Contributor

I'm a strong believer in the principle of least-surprise, which makes me agree with Raf about using the hashes of l and r to decide whose splices take precedent in a merge. As a user of Noms, I'd feel like some opaque thing was foisting unpredictable merge results on me.

I wonder if we should keep the policies simple and, like @arv suggested, provide a resolution callback so that developers can use whatever additional information they want to resolve conflicts.

FWIW, case 7 doesn't seem to me like it needs to be a conflict, but all the others kinda do. Depending on what you're using the list for, I think you'd want to make different merge decisions. Consider @rafael-atticlabs' example above with csvs and Set vs List semantics...I think that there are definitely times when I'd want de-duping like a Set provides, but still be able to enumerate elements in the order they were added, so I'd choose to use a List. Some kind of distributed event-monitoring system, for example.

Contributor

cmasone-attic commented Aug 25, 2016

I'm a strong believer in the principle of least-surprise, which makes me agree with Raf about using the hashes of l and r to decide whose splices take precedent in a merge. As a user of Noms, I'd feel like some opaque thing was foisting unpredictable merge results on me.

I wonder if we should keep the policies simple and, like @arv suggested, provide a resolution callback so that developers can use whatever additional information they want to resolve conflicts.

FWIW, case 7 doesn't seem to me like it needs to be a conflict, but all the others kinda do. Depending on what you're using the list for, I think you'd want to make different merge decisions. Consider @rafael-atticlabs' example above with csvs and Set vs List semantics...I think that there are definitely times when I'd want de-duping like a Set provides, but still be able to enumerate elements in the order they were added, so I'd choose to use a List. Some kind of distributed event-monitoring system, for example.

@aboodman

This comment has been minimized.

Show comment
Hide comment
@aboodman

aboodman Aug 25, 2016

Contributor

I see your points, and I'm fine with starting simpler.

That said I think that for Raf's example of people sorted by age, it would
be better to use a map since then noms can do all the merges automatically.

What my proposal basically models is a partially ordered sequence.
Something like a distributed event log seems like it would be a good match
for that (all we know is that two events happened at "the same" time, not
which one really happened first). But I don't have a strong use case for
this, so happy to start simple and dumb (I actually don't have many use
cases for lists, so even happier to start simple and dumb).

On Thu, Aug 25, 2016 at 9:17 AM, cmasone-attic notifications@github.com
wrote:

I'm a strong believer in the principle of least-surprise, which makes me
agree with Raf about using the hashes of l and r to decide whose splices
take precedent in a merge. As a user of Noms, I'd feel like some opaque
thing was foisting unpredictable merge results on me.

I wonder if we should keep the policies simple and, like @arv
https://github.com/arv suggested, provide a resolution callback so that
developers can use whatever additional information they want to resolve
conflicts.

FWIW, case 7 doesn't seem to me like it needs to be a conflict, but all
the others kinda do. Depending on what you're using the list for, I think
you'd want to make different merge decisions. Consider @rafael-atticlabs
https://github.com/rafael-atticlabs' example above with csvs and Set vs
List semantics...I think that there are definitely times when I'd want
de-duping like a Set provides, but still be able to enumerate elements in
the order they were added, so I'd choose to use a List. Some kind of
distributed event-monitoring system, for example.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#148 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAE6BBUUHcoatCXwQmP9m0bzPTE02jLDks5qjcAQgaJpZM4FiTyN
.

Contributor

aboodman commented Aug 25, 2016

I see your points, and I'm fine with starting simpler.

That said I think that for Raf's example of people sorted by age, it would
be better to use a map since then noms can do all the merges automatically.

What my proposal basically models is a partially ordered sequence.
Something like a distributed event log seems like it would be a good match
for that (all we know is that two events happened at "the same" time, not
which one really happened first). But I don't have a strong use case for
this, so happy to start simple and dumb (I actually don't have many use
cases for lists, so even happier to start simple and dumb).

On Thu, Aug 25, 2016 at 9:17 AM, cmasone-attic notifications@github.com
wrote:

I'm a strong believer in the principle of least-surprise, which makes me
agree with Raf about using the hashes of l and r to decide whose splices
take precedent in a merge. As a user of Noms, I'd feel like some opaque
thing was foisting unpredictable merge results on me.

I wonder if we should keep the policies simple and, like @arv
https://github.com/arv suggested, provide a resolution callback so that
developers can use whatever additional information they want to resolve
conflicts.

FWIW, case 7 doesn't seem to me like it needs to be a conflict, but all
the others kinda do. Depending on what you're using the list for, I think
you'd want to make different merge decisions. Consider @rafael-atticlabs
https://github.com/rafael-atticlabs' example above with csvs and Set vs
List semantics...I think that there are definitely times when I'd want
de-duping like a Set provides, but still be able to enumerate elements in
the order they were added, so I'd choose to use a List. Some kind of
distributed event-monitoring system, for example.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#148 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAE6BBUUHcoatCXwQmP9m0bzPTE02jLDks5qjcAQgaJpZM4FiTyN
.

zcstarr pushed a commit to zcstarr/noms that referenced this issue Aug 25, 2016

Go: Add three-way merge for types.Struct (attic-labs#2403)
merge.ThreeWay() is now willing to merge Structs as well as Maps, as long
as the Structs are all named the name thing. Other rules are the same
as for Map.

Toward attic-labs#148
@cmasone-attic

This comment has been minimized.

Show comment
Hide comment
@cmasone-attic

cmasone-attic Aug 25, 2016

Contributor

Discussed at length with @rafael-atticlabs today, and we think we can have a simple algorithm that covers a lot of cases using the framework laid out above. In prose, the plan is to declare a conflict for any splices that overlap, EXCEPT those that are precisely identical -- that is, they start at exactly the same index and add/remove exactly the same items. For non-overlapping splices, apply the one that starts earlier and move on.

let Overlap(s1, s2) be

Overlap(s1, s2) bool {
  earlier, later := s1, s2
  if s2.index < s1.index {
    earlier, later = s2, s1
  }
  return s1.index == s2.index || earlier.index + earlier.removeCount > later.index
}

let CanMerge(s1, s2) be

CanMerge(s1, s2) bool {
  return s1 == s2
}

let Merge(s1, s2) be

Merge(s1, s2) splice {
  return s1
}
Contributor

cmasone-attic commented Aug 25, 2016

Discussed at length with @rafael-atticlabs today, and we think we can have a simple algorithm that covers a lot of cases using the framework laid out above. In prose, the plan is to declare a conflict for any splices that overlap, EXCEPT those that are precisely identical -- that is, they start at exactly the same index and add/remove exactly the same items. For non-overlapping splices, apply the one that starts earlier and move on.

let Overlap(s1, s2) be

Overlap(s1, s2) bool {
  earlier, later := s1, s2
  if s2.index < s1.index {
    earlier, later = s2, s1
  }
  return s1.index == s2.index || earlier.index + earlier.removeCount > later.index
}

let CanMerge(s1, s2) be

CanMerge(s1, s2) bool {
  return s1 == s2
}

let Merge(s1, s2) be

Merge(s1, s2) splice {
  return s1
}

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 25, 2016

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 25, 2016

@rafael-atticlabs

This comment has been minimized.

Show comment
Hide comment
@rafael-atticlabs

rafael-atticlabs Aug 25, 2016

Contributor

To enumerate:

Case 1: a 1 c 3 e
Case 2: => conflict
Case 3: => conflict
Case 4: => conflict
Case 5: => conflict
Case 6: => conflict
Case 7: a c
Case 8: a 1 2 c
Case 9: a 1 2 3
Case 10: => conflict
Contributor

rafael-atticlabs commented Aug 25, 2016

To enumerate:

Case 1: a 1 c 3 e
Case 2: => conflict
Case 3: => conflict
Case 4: => conflict
Case 5: => conflict
Case 6: => conflict
Case 7: a c
Case 8: a 1 2 c
Case 9: a 1 2 3
Case 10: => conflict
@rafael-atticlabs

This comment has been minimized.

Show comment
Hide comment
@rafael-atticlabs

rafael-atticlabs Aug 25, 2016

Contributor

We can start with this and iterate on including the more exotic cases that kind of "look" mergeable as cases arise.

Contributor

rafael-atticlabs commented Aug 25, 2016

We can start with this and iterate on including the more exotic cases that kind of "look" mergeable as cases arise.

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 25, 2016

Add Set support to merge.ThreeWay
merge.ThreeWay(SetA, SetB, Parent) essentially returns the union of
the three Sets.

Toward attic-labs#148

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 25, 2016

Add Set support to merge.ThreeWay
merge.ThreeWay(SetA, SetB, Parent) essentially returns the union of
the three Sets.

Toward attic-labs#148

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 27, 2016

Add types.List support merge.ThreeWay()
While the code in threeWayListMerge() starts similarly to that
in threeWayOrderedSequenceMerge(), the types of the channels
are different and trying to shoehorn them together would
remove a lot of clarity just for the sake of sharing maybe
15 lines of code.

Toward attic-labs#148

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 30, 2016

Add types.List support merge.ThreeWay()
While the code in threeWayListMerge() starts similarly to that
in threeWayOrderedSequenceMerge(), the types of the channels
are different and trying to shoehorn them together would
remove a lot of clarity just for the sake of sharing maybe
15 lines of code.

Toward attic-labs#148

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 30, 2016

Add types.List support merge.ThreeWay()
While the code in threeWayListMerge() starts similarly to that
in threeWayOrderedSequenceMerge(), the types of the channels
are different and trying to shoehorn them together would
remove a lot of clarity just for the sake of sharing maybe
15 lines of code.

Toward attic-labs#148

cmasone-attic added a commit that referenced this issue Aug 30, 2016

Add types.List support merge.ThreeWay() (#2448)
While the code in threeWayListMerge() starts similarly to that
in threeWayOrderedSequenceMerge(), the types of the channels
are different and trying to shoehorn them together would
remove a lot of clarity just for the sake of sharing maybe
15 lines of code.

Toward #148

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 30, 2016

Allow merge.ThreeWay() to operate across DBs
Much like Diff, we want to be able to merge values from different
databases. Unlike Diff, Merge needs to be able to resolve Refs --
which requires a ValueReadWriter that has access to the referenced
Value. So, merge.ThreeWay() now takes a ValueReadWriter for each Value
being considered for the merge -- the parent and both descendants.

Towards attic-labs#148

cmasone-attic added a commit to cmasone-attic/noms that referenced this issue Aug 30, 2016

In merge.ThreeWay(), use Diff() across the board
Initially, this code used DiffLeftRight() for some kinds
of collections. Standardize on Diff().

Towards attic-labs#148
@cmasone-attic

This comment has been minimized.

Show comment
Hide comment
@cmasone-attic

cmasone-attic Sep 8, 2016

Contributor

Remaining stuff to do before declaring this closed:

  • Go: Auto-merge concurrent changes to different datasets #2524
  • JS: Auto-merge concurrent changes to different datasets #2524
  • Add merge options to Dataset.CommitOptions #2534
  • Move noms-merge code into noms commit tool #2535
  • Document final merge semantics somewhere
  • Remove hacky trivial merge we used to have #2720
Contributor

cmasone-attic commented Sep 8, 2016

Remaining stuff to do before declaring this closed:

  • Go: Auto-merge concurrent changes to different datasets #2524
  • JS: Auto-merge concurrent changes to different datasets #2524
  • Add merge options to Dataset.CommitOptions #2534
  • Move noms-merge code into noms commit tool #2535
  • Document final merge semantics somewhere
  • Remove hacky trivial merge we used to have #2720

@cmasone-attic cmasone-attic added P1 and removed P0 labels Sep 16, 2016

@aboodman

This comment has been minimized.

Show comment
Hide comment
@aboodman

aboodman Oct 21, 2016

Contributor

I realize this has bounced around a bit, but I think we should finish it. @cmasone-attic can you find a time sometime soon to close it out?

Contributor

aboodman commented Oct 21, 2016

I realize this has bounced around a bit, but I think we should finish it. @cmasone-attic can you find a time sometime soon to close it out?

@cmasone-attic

This comment has been minimized.

Show comment
Hide comment
@cmasone-attic

cmasone-attic Oct 26, 2016

Contributor

oh...right. Now I remember. We haven't implemented ANY merge logic in Javascript. And the merge algorithm is built on Diff...which we also haven't implemented in JS :-)

So...I can implement issue #2534 for Go, and then go finish #2535 no problem. But JS...that's a big chunk of work

Contributor

cmasone-attic commented Oct 26, 2016

oh...right. Now I remember. We haven't implemented ANY merge logic in Javascript. And the merge algorithm is built on Diff...which we also haven't implemented in JS :-)

So...I can implement issue #2534 for Go, and then go finish #2535 no problem. But JS...that's a big chunk of work

@aboodman

This comment has been minimized.

Show comment
Hide comment
@aboodman

aboodman Oct 26, 2016

Contributor

I think it is fine to call this done without auto-merge in js.

Contributor

aboodman commented Oct 26, 2016

I think it is fine to call this done without auto-merge in js.

@rafael-atticlabs

This comment has been minimized.

Show comment
Hide comment
@rafael-atticlabs

rafael-atticlabs Oct 27, 2016

Contributor

Do we have a bug open for implementing all of this in JS? If not, please
open one and mark it "helpwanted"

On Wed, Oct 26, 2016 at 11:13 AM, Aaron Boodman notifications@github.com
wrote:

I think it is fine to call this done without auto-merge in js.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#148 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AMQOHUsR-_LCpkHqlIgASCqjGdZs3kETks5q35hjgaJpZM4FiTyN
.

Contributor

rafael-atticlabs commented Oct 27, 2016

Do we have a bug open for implementing all of this in JS? If not, please
open one and mark it "helpwanted"

On Wed, Oct 26, 2016 at 11:13 AM, Aaron Boodman notifications@github.com
wrote:

I think it is fine to call this done without auto-merge in js.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#148 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AMQOHUsR-_LCpkHqlIgASCqjGdZs3kETks5q35hjgaJpZM4FiTyN
.

@cmasone-attic

This comment has been minimized.

Show comment
Hide comment
@cmasone-attic

cmasone-attic Oct 27, 2016

Contributor

Issues filed (#2764 & #2765)

Contributor

cmasone-attic commented Oct 27, 2016

Issues filed (#2764 & #2765)

@cmasone-attic

This comment has been minimized.

Show comment
Hide comment
@cmasone-attic

cmasone-attic Oct 27, 2016

Contributor

'shbouya

Contributor

cmasone-attic commented Oct 27, 2016

'shbouya

@augustoteixeira

This comment has been minimized.

Show comment
Hide comment
@augustoteixeira

augustoteixeira Dec 5, 2016

This issue is stated as "not implemented at all" in the FAQ:
https://github.com/attic-labs/noms/blob/master/doc/faq.md

What is the current state of things?

augustoteixeira commented Dec 5, 2016

This issue is stated as "not implemented at all" in the FAQ:
https://github.com/attic-labs/noms/blob/master/doc/faq.md

What is the current state of things?

@aboodman

This comment has been minimized.

Show comment
Hide comment
@aboodman

aboodman Dec 5, 2016

Contributor

We need to update the FAQ and the homepage and put together a demo, but it's landed now.

Contributor

aboodman commented Dec 5, 2016

We need to update the FAQ and the homepage and put together a demo, but it's landed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment