Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce size of vector clocks #28

Closed
ept opened this issue Jun 22, 2017 · 3 comments
Closed

Reduce size of vector clocks #28

ept opened this issue Jun 22, 2017 · 3 comments

Comments

@ept
Copy link
Member

ept commented Jun 22, 2017

At present, we create a new actor ID every time a node is started (either as a completely fresh node, or loading its state from file). That approach ensures that even if you run a process twice on the same machine, they will have different actor IDs, and so they won't step on each others' toes. However, the downside is that the rate of churn of actor IDs is high, and so the vector clocks (which have an entry for each actor) get rather large.

I did a first step in this direction in 9d8f136: instead of including the full vector clock in each changeset, we now only include the actor ID of the originator, the sequence number (which starts at 1 for a new actor ID and is incremented on every changeset), and a minimal set of dependencies (as actorId-seqNum pairs). The set of dependencies does not include any dependencies that can be reached transitively through other dependencies, and it implicitly includes seqNum–1 for the same actorId. In other words, it only references changesets from other actors that were received by the originator between seqNum–1 and seqNum, making the dependencies much like a merge commit in git.

However, the API for figuring out which changesets to send between peers (getVClock(), getDeltasAfter()) still uses full vector clocks without any truncation. Here, reducing the size is a little trickier. It can be done, but it will require more than one round-trip between peers to figure out what deltas need to be sent.

The reason is as follows: for example, say the latest changeset known by peer A (the "HEAD" in git terminology) has sequence number 42. All other changesets known by A can be reached transitively by following the dependency chains from A:42. So it is technically sufficient for getVClock() to return {A:42}, since that contains all the necessary information. However, any peer that does not have recent changesets from A cannot interpret the clock {A:42}. All it can tell is that it is missing a bunch of changesets from A, but it may also be missing changesets from other actors, and it won't find out what those are until it has received the changesets from A.

To solve this, we can probably take a look at the git transfer protocol as implemented by git-send-pack and git-upload-pack. Git has a similar problem, since it identifies the state of a repository by a commit hash, and if you only know a commit hash, you don't know what history you're missing in order to get there.

@ept
Copy link
Member Author

ept commented Jun 23, 2017

Current implementation of network protocol for exchanging changesets/deltas is in DeltaRouter. It might make sense to move some of that into Tesseract (as far as it can be done abstractly, without introducing a dependency on WebRTC).

@pvh
Copy link
Member

pvh commented Feb 14, 2018

Is this basically done, then, Martin? If so we might close this issue and reopen a new one that described any additional work you envision.

@ept
Copy link
Member Author

ept commented Feb 15, 2018

@pvh Sort of. The vector clock attached to each individual change is now quite compact, since it only includes changes that arrived since the last change that was sent. Full vector clocks are still used in the Automerge.Connection network protocol for synchronising peers. This issue was intended to cover vector clock compaction in Automerge.Connection as well. However, I don't think we particularly need to do anything at the moment, since I think other areas of Automerge are in greater need of optimisation, and the vector clocks don't seem to be a bottleneck at the moment.

So I will close this issue and set it aside for now. We can always revisit it in the future if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants