-
Notifications
You must be signed in to change notification settings - Fork 460
Description
At present, we create a new actor ID every time a node is started (either as a completely fresh node, or loading its state from file). That approach ensures that even if you run a process twice on the same machine, they will have different actor IDs, and so they won't step on each others' toes. However, the downside is that the rate of churn of actor IDs is high, and so the vector clocks (which have an entry for each actor) get rather large.
I did a first step in this direction in 9d8f136: instead of including the full vector clock in each changeset, we now only include the actor ID of the originator, the sequence number (which starts at 1 for a new actor ID and is incremented on every changeset), and a minimal set of dependencies (as actorId-seqNum pairs). The set of dependencies does not include any dependencies that can be reached transitively through other dependencies, and it implicitly includes seqNum–1 for the same actorId. In other words, it only references changesets from other actors that were received by the originator between seqNum–1 and seqNum, making the dependencies much like a merge commit in git.
However, the API for figuring out which changesets to send between peers (getVClock(), getDeltasAfter()) still uses full vector clocks without any truncation. Here, reducing the size is a little trickier. It can be done, but it will require more than one round-trip between peers to figure out what deltas need to be sent.
The reason is as follows: for example, say the latest changeset known by peer A (the "HEAD" in git terminology) has sequence number 42. All other changesets known by A can be reached transitively by following the dependency chains from A:42. So it is technically sufficient for getVClock() to return {A:42}, since that contains all the necessary information. However, any peer that does not have recent changesets from A cannot interpret the clock {A:42}. All it can tell is that it is missing a bunch of changesets from A, but it may also be missing changesets from other actors, and it won't find out what those are until it has received the changesets from A.
To solve this, we can probably take a look at the git transfer protocol as implemented by git-send-pack and git-upload-pack. Git has a similar problem, since it identifies the state of a repository by a commit hash, and if you only know a commit hash, you don't know what history you're missing in order to get there.