Efficient sync #12

cinnamon-bun · 2020-06-03T22:52:00Z

(First, probably do #14 set up RPC)

Right now we sync by sending everything. Instead, do something like:

Both sides agree on a query filter and sort order
Both sides look at the next 1000 items we would send
Send each other the hash of all those items together: hash(list of individual item hashes)
If the hashes agree, move on to next 1000 items
Otherwise, send the actual 1000 items

This could obviously recurse further into the 1000 items, but there are tradeoffs:

Limit the number of network roundtrips
Limit the amount of hash computation. Use a cheap one like xor or md5.

We can't pre-compute the hashes because replication filters might choose a different subset of items each time.

This should eventually support #6 Replication filters.

The text was updated successfully, but these errors were encountered:

sgwilym · 2022-05-09T12:55:59Z

Earthstar still needs efficient sync. I've got an idea for a new method, and first it's worth revisiting how we thought we could solve this issue:

Cinnamon was developing a 'local index' approach which gave log-like properties to Earthstar's k/v stores:

Every document is assigned a local index upon ingestion. The local index would be the previous max local index + 1.
When one replica syncs with another, it would store the max local index of its partner replica.
When they'd next sync, the replica would ask for all documents written after their partners last known max index.

This approach was promising but it's not tolerant to a few scenarios:

Telephone sync

A syncs with C, and remembers its last local index as 100
A writes 100 new documents.
A syncs 100 new documents with B.
B syncs A's 100 new documents with C, setting C's new max local index to 200.
A asks C for everything after local index 100, and gets back 100 of its own docs which it immediately discards.

Causality faults

A syncs with B, and remembers its last local index as 10.
B suffers a data loss, and restores to a backup where it had a max local index of 5.
B syncs 5 new docs from C, incrementing its local index back up to 10.
A asks B for new docs after 10, but doesn't receive anything.

So we need another approach. I've been looking at vector clocks, version vectors, and even interval tree clocks, but I don't think any of these are suitable for Earthstar: they would require recording a version per replica ID per document, for starters. They also seem to be built with the assumption that a cluster is under the control of a single entity, which is not the case with Earthstar.

In a thread on SSB it was suggested to send the content hashes of all documents from one replica to another, and derive which docs are wanted from that. This can't work for Earthstar because Earthstar is able to overwrite and even delete values from its store: the absence of a hash is not enough to determine whether a replica wants a doc or not.

What do replicas want?

Replicas want the latest version of a valid document by any given author. A replica will discard documents written by the same author with a lower timestamp.

If we provide this minimum level of information — path, author, timestamp — we can determine whether a replica wants that document or not.

Let's imagine how this sync would work with the participants of our share, +nature.

Nature's network has three participants: a computer owned by Cinnamon, and a computer owned by gwil, and a server owned by gwil. Each device has its own replica.

Let's take a peek inside these replicas and remind ourselves how they work. Here's replica A:

Each path (e.g. /wiki/bees.md) can hold multiple versions of a document: one per author. The version with the highest timestamp is the canonical version of the document. The rest are kept for manual conflict resolution.

Now let's look at the insides of replica B:

Replica B has some different documents to A. It has an updated version of gwil's document at /wiki/leaf.md, and a document at the entirely new path of /blog/hello.md.

In an ideal world, we would send only these new documents from replica B to replica A.

But replica A has no idea these new documents exist
And replica B can't know that replica A has documents which would be overwritten by these ones

When Replica A syncs with Replica B, it's going to ask it for the minimum amount of information needed to determine if it wants any of Replica B's documents or not.

Replica B will answer with something like this:

Each item has a path, a collection of authors it has documents by, and the timestamps of those documents. Each item has a temporary ID generated for this sync.

Replica A will proceed through each of these items like this:

If there aren't any documents with the path of the item, request the given ID for that path.
1a. If there are, proceed to step 2.
If there is no document by that author at that path, request the given ID for that path + author.
2a. If there is one, proceed to step 3.
If the timestamp is greater than the timestamp of the document the replica already has, request the given ID for that path + author.
3a. If the timestamp is less than the timestamp of the document the replica already has, do nothing.

Replica A will then send the IDs of the documents it wants:

And replica B will send over the documents associated with those temporary IDs.

That's my idea. I think that's about as good as I'm capable of at this point. It's not terribly complicated, which is an acceptance criteria for the Earthstar project.

Obviously there's a bit of replication in here: paths, authors and timestamps are transmitted from B to A twice. Maybe replica B could just send over the properties of the document replica A doesn't already have, or maybe that's being too clever.
This method gets rid of any sync associated state: replicas won't need to remember anything about each other.
After this initial exchange, replica A and B would maintain a connection with each other, and continue to send more minimal "I have this document" style messages to each other.

basham · 2022-05-11T21:24:49Z

I like the simplicity of this approach. Here are some additional considerations:

How does this scale? If replica A sends the paths, authors, and timestamps of every document to replica B, how will this work when there are thousands of documents in the replica? Should this transmission occur over streams or some paginated back-and-forth, like what happens when the docs themselves are passed? Or am I overthinking it?
If the client is a browser syncing with a server, how do we maintain a connection or some verification of each other's state, so that the sync process doesn't need to restart all the time? This disconnection could occur if the user navigates to another page (even in the same web app) or if the device goes offline (even temporarily). Maybe the connection and syncing should only really happen within a service worker?
Maybe the two replicas could generate a MD5 hash (or something like it) that represents the contents of their replicas. That way, there is much less exchange happening when two replicas start to connect. "My replica's hash is a18fa5, is that yours?" "Yep, we're in sync!" Or: "Nope, here's a summary of what I have. What do you want?"

sgwilym · 2022-05-12T07:41:36Z

Should this transmission occur over streams or some paginated back-and-forth, like what happens when the docs themselves are passed? Or am I overthinking it?

You're not overthinking it: messages with the pathname + versions will be streamed from one peer to the other.

If the client is a browser syncing with a server, how do we maintain a connection or some verification of each other's state, so that the sync process doesn't need to restart all the time?

My goal is to make syncing quick and cheap enough that restarting doesn't matter. Ideally replicas shouldn't have to remember any state about each other.

Maybe the connection and syncing should only really happen within a service worker?

I think we should definitely have some tools to do stuff like this more easily though. I don't have much experience with workers, though.

Maybe the two replicas could generate a MD5 hash (or something like it) that represents the contents of their replicas. That way, there is much less exchange happening when two replicas start to connect. "My replica's hash is a18fa5, is that yours?" "Yep, we're in sync!" Or: "Nope, here's a summary of what I have. What do you want?"

Sounds reasonable, I'm up for trying it.

basham · 2022-06-02T16:49:10Z

I've been researching this week, and I found the talk CRDTs for Mortals (Offline First Apps). It gives a good overview of the problems with conflict resolution and syncing. The talk mostly skips over the syncing part, but the repo of the associated example app says: "This also contains an implementation of a merkle tree to check consistency of the data to make sure all clients are in sync." This repo is a very small implementation, which helps to better understand how it works.

As I understand it, when the client syncs with the server, the client sends its latest messages (it would be "documents" for Earthstar) and a Merkle tree representing the contents of the client's data. The messages are added to the server's data. A new tree is created based on that. Then a diff is made between the client tree and the new server tree. That diff results in a timestamp used to query all messages the server has that the client does not have. Then the server responds with those messages and the server's tree. The client then does another tree diff and confirms that the two sides are in sync.

The Hypercore protocol also uses Merkle trees "to only download the parts of the log they are interested in."

I wonder if using a Merkle tree could help the syncing process for Earthstar. It certainly makes the library implementation more complex than what you were outlining, but if it improves syncing, then it may be worth it.

Has Merkle trees been considered for Earthstar, yet? I would need to research more to get a better sense of how it would work.

sgwilym · 2022-06-09T13:01:17Z

I've also been doing my research, and am wondering whether this is worth exploring some other (funded) time.

I think merkle trees do not work for Earthstar as Earthstar shares store mutable sets of documents with identity. You can't know if you want a document or not simply whether you have it or not: a document peer A has but which peer B does not may be a document peer B has since overwritten.

I have been reading this outline of Simple And Efficient Set Reconciliation which I think could be applied to Earthstar. I'm trying to assess whether it's something I can really fit into this year's work or not.

basham · 2022-06-09T14:40:33Z

That's a fascinating proposal. This is how I'm interpreting it: Searching based on a range means there has to be some agreed upon way to sort items within that range. For Earthstar, this range could be at few different things. Although I suspect there are problems I'm overlooking, because this is just a quick thought.

Range based on a document's last updated timestamp, regardless of path or author.
Range based on a document's path-then-author, sorted lexicographically. This may be ideal when there are many documents, but only a small number of them are actively being updated.
Range based on a document's author-then-path, sorted lexicographically. This may be ideal when there are many authors but only a small number of them are actively creating or updating documents.

sgwilym · 2022-07-14T07:17:21Z

Before leaving on holiday, I had a call with @AljoschaMeyer to talk about using their range-based set reconciliation method with Earthstar. Long story short, we can do it!

We also determined that Earthstar will still need to perform the step where a subset of document information (path + author + timestamp) needs to be sent over to determine if the peer actually wants the document it doesn't have yet.

So the final procedure will be:

Determine set difference of peer documents using range-based search
Request 'thumbnail' versions of documents from the set difference
Determine which documents are wanted using thumbnails and request them
Ingest received documents

@basham Like you say, to do this we need a total order on documents. As it is possible for two documents to have the same timestamp (however remote the chance), we need to do either path-then-author / author-then-path. I think I am going to go with path-then-author just because this is how things are organised in other places in the codebase.

AljoschaMeyer · 2022-07-14T11:02:58Z

Step 2. should not end up being an explicit step however, the items that are being exchanged by the range-based set reconciliation the procedure could/should be those 'thumbnails'.

we need to do either path-then-author / author-then-path

Shouldn't the timestamp also appear in there somewhere? The same author is expected to write to the same path multiple time, aren't they?

The total order might indeed be the most design-intensive remaining decision. Remember that the items over which you define the total order are also items you need to transmit when communicating the boundaries of a range fingerprint. So if you define the order on the thumbnails directly, each range fingerprint contributes an author, a path and a timestamp to the amount of bits that need to be sent.

A more efficient approach is to hash each thumbnail and reconcile those hashes, simply using numeric ordering on those hashes. When transmitting range item sets, you could still transmit the actual thumbnails rather than their hashes. The main drawback is that this order is completely meaningless. This has an actual impact on the replication performance, as the protocol performs better in sessions where the items that differ between two peers are successive in the total order. Ordering hashes destroys all locality, whereas ordering by timestamp/author/path would lead to nice clustering in many realistic scenarios. The worst-case setting stays identical though.

A meaningful order on thumbnails is more in the spirit of #6, as asking for a specific subrange, which, for example, contains only documents under a single path, naturally corresponds to a limited query.

Keep in mind that multi-dimensional ranges are also a possibility; you could define a range as a triplet of a least and greatest path, least and greatest author, and least and greatest timestamp. This does however increase the implementation effort, as the auxiliary tree data structure would need to be a kd-tree.

From a purely aesthetic view, it feels to me like it should either be an ordering on hashes, or three-dimensional ranges on thumbnails. But ordering thumbnails by an arbitrary linearization of their three dimensions seems, well, arbitrary. And since multi-dimensional ranges are more complicated, I would lean toward ordering hashes and leaving meaningful, three-dimensional ranges as a future possibility. But of course I'm a project outsider, so take this recommendation with a large grain of salt.

sgwilym · 2022-07-14T13:04:00Z

Very happy to have your guidance here @AljoschaMeyer!

I think I muddied the waters with the casual introduction of new phrasing.

I’ll keep using your thesis’ term of fingerprint to refer to the things being exchanged during range-based reconciliation.
I was using thumbnail to refer to a smaller subset of information about a document used to determine whether a peer wants it or not (the path + author + timestamp). I will call it document thumbnail from here on out.

In our last call, I think we talked about thumbnails being inappropriate for use as a fingerprint, as they’d be too big: a document thumbnail as a string would be minimum ~70 characters.

My current thinking was that the fingerprint would be a much shorter hash derived from the document thumbnail.

Could I map the total order of the document thumbnails (path / author) to their derived fingerprints? For example...

Peer A making fingerprints:

Position	Path	Author	Timestamp	Derived fingerprint
1	`/hello.txt`	`@aljo`	`1001039`	`x89ar`
2	`/hello.txt`	`@gwil`	`1000940`	`cba84`
3	`/other.md`	`@gwil`	`1000023`	`3abut`

Peer B, with a newer document at /hello.txt by @gwil, making fingerprints:

Position	Path	Author	Timestamp	Derived fingerprint
1	`/hello.txt`	`@aljo`	`1001039`	`x89ar`
2	`/hello.txt`	`@gwil`	`1002530`	`tjxb1`
3	`/other.md`	`@gwil`	`1000023`	`3abut`

Shouldn't the timestamp also appear in there somewhere? The same author is expected to write to the same path multiple time, aren't they?

Yes, but once an author writes to a path it replaces the previous document. A path will only ever have a single version from each author.

The timestamp needs to be transmitted because a peer might have a newer version, in which case it doesn’t care about the older version of the document.

AljoschaMeyer · 2022-07-14T14:16:10Z

Looks like we are on the same page, I used "thumbnail" in my last comment synonymously to how you just defined "document thumbnail".

I’ll keep using your thesis’ term of fingerprint to refer to the things being exchanged during range-based reconciliation.

Two different things are being exchanged: fingerprints and actual items. For earthstar, the actual items would be document thumbnails. The fingerprints would be (sums of) hashes of document thumbnails. In terminology of the thesis, the function that lifts document thumbnails into the universe of the monoid is some arbitrary hash function, the monoid could be addition modulo the number of distinct digests.

Could I map the total order of the document thumbnails (path / author) to their derived fingerprints? For example...

That is unfortunately not possible. If I give you nothing but a (secure) hash of some document thumbnail, you are by definition unable to compute a pre-image, so how would you know which document thumbnail to use for determining your position in the order? And you cannot really hope to find special hash functions that produce identically ordered hashes either; handwavingly speaking, hashes need to be too random to preserve such a high amount of structure.

Yes, but once an author writes to a path it replaces the previous document. A path will only ever have a single version from each author.

The timestamp needs to be transmitted because a peer might have a newer version, in which case it doesn’t care about the older version of the document.

Just to emphasize this: the timestamp must also influence the fingerprints, as otherwise peers could not distinguish between different versions during reconciliation. (I think you already know this, just want us to have it explicitly in this comment thread).

basham · 2022-07-14T16:08:14Z

Ignore this comment, if I'm not following all the intricacies of what's being proposed here. But I suspect the timestamp may need to be included. Since the three parts can't be hashed together, what if the parts within timestamp+path+author could be individually hashed? If the parts can't have a consistent and predicable length, they could be joined together with some delimiter that's outside of the hash range. Let's say for the sake of this example, the hashes are only alphanumeric (a-zA-Z0-9), then the delimiter could be a dash (-). As a real-world example, the cuid project encodes timestamps to base 36 by default.

So, perhaps the fingerprints could be something like this? (I just made up some characters in the derived fingerprint to illustrate some examples.)

Timestamp	Path	Author	Derived fingerprint
`1001039`	`/hello.txt`	`@aljo`	`l5l7p5x7-f0ow-z1b`
`1000940`	`/hello.txt`	`@gwil`	`l5l7oyyi-f0ow-tr2`
`1000023`	`/other.md`	`@gwil`	`l5l7oqxg-ba5q-tr2`

AljoschaMeyer · 2022-07-14T16:53:10Z

I suspect the timestamp may need to be included.

Indeed it has to be. Just not explicitly, it suffices if it was somehow involved in computing the fingerprint.

Since the three parts can't be hashed together, [...]

This assumption is not true, it (thankfully) is completely sufficient to concatenate those three components and hashing the result. Hence I cannot see the necessity of your proposal - but I unfortunately cannot see where our mismatching perspectives come from either at this point.

There is however also a subtle flaw with your proposal: it is actually important that the timestamp is not hashed on its own but in combination with other data, else fingerprint collisions become too likely. In your scheme, you would probably define the fingerprint of a set as the component-wise sum (or XOR or whichever monoid you prefer) of all document thumbnails in the set. Given the comparatively small number of timestamps, collisions become dangerously realistic. Particularly when simply combining literal timestamps, this can easily happen. But even when hashing time stems before combining them, 2^64 distinct values are simply not enough to consider the probability of collisions negligible.

That is actually the main reason why I couldn't figure out how to get rid of the step of requesting all the thumbnails for which you actually want the content. I couldn't figure out how to thread timestamp information through the recursive fingerprint comparisons without high probability of collisions.

basham · 2022-07-14T17:30:21Z

Maybe part of the misunderstanding comes from the value of the author. My interpretation of "author" for the sake of this conversation would be more accurately Earthstar Identities.

A person's identity in Earthstar is represented by a keypair, which is a bit like a username and password.

It's made up of an address and a secret. The address is made up of a 4-letter "shortname" you choose and something called a public key, which can look like this: @suzy.bzfkbawuj2cdxsxilvso4a2xczmcx6char4pqd2tjhdmcoy5i4ebq.

There could be multiple @suzy authors in the same Share, but the public key that is automatically generated effectively guarantees that the identity is unique within the Share.

Is the problem with collisions resolved when considering the combined parts of timestamp + path + identity?

This assumption is not true, it (thankfully) is completely sufficient to concatenate those three components and hashing the result.

Neat!

AljoschaMeyer · 2022-07-15T06:39:33Z

Is the problem with collisions resolved when considering the combined parts of timestamp + path + identity?

It is. Suppose you have four different document thumbnails m_a, m_b, m_c, m_d with timestamps a, b, c, d, such that a + b = c + d. When treating timestamps in an isolated way, the fingerprints of the set {m_a, m_b} and {m_c, m_d} would collide in the timestamp component. But when hashing everything together, the fact that a + b = c + d does not lead to colliding hashes, as a particular relation between some suffixes of data that is being hashed is not preserved by the digests. Does that make sense?

basham · 2022-07-15T13:00:33Z

Yep, that makes sense. Thanks!

cinnamon-bun added the completeness label Jun 3, 2020

cinnamon-bun added this to the v1 milestone Jun 3, 2020

cinnamon-bun modified the milestones: v1, v2 Aug 5, 2020

sgwilym modified the milestones: Bananaslug, Squirrel May 27, 2022

sgwilym self-assigned this Jul 18, 2022

sgwilym mentioned this issue Dec 9, 2022

Attachments, Protected Shares, Efficient Sync, and more #279

Merged

5 tasks

sgwilym closed this as completed in #279 Dec 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient sync #12

Efficient sync #12

cinnamon-bun commented Jun 3, 2020 •

edited

sgwilym commented May 9, 2022 •

edited

basham commented May 11, 2022

sgwilym commented May 12, 2022

basham commented Jun 2, 2022 •

edited

sgwilym commented Jun 9, 2022

basham commented Jun 9, 2022

sgwilym commented Jul 14, 2022

AljoschaMeyer commented Jul 14, 2022 •

edited

sgwilym commented Jul 14, 2022

AljoschaMeyer commented Jul 14, 2022

basham commented Jul 14, 2022

AljoschaMeyer commented Jul 14, 2022

basham commented Jul 14, 2022 •

edited

AljoschaMeyer commented Jul 15, 2022

basham commented Jul 15, 2022

Efficient sync #12

Efficient sync #12

Comments

cinnamon-bun commented Jun 3, 2020 • edited

sgwilym commented May 9, 2022 • edited

Telephone sync

Causality faults

What do replicas want?

basham commented May 11, 2022

sgwilym commented May 12, 2022

basham commented Jun 2, 2022 • edited

sgwilym commented Jun 9, 2022

basham commented Jun 9, 2022

sgwilym commented Jul 14, 2022

AljoschaMeyer commented Jul 14, 2022 • edited

sgwilym commented Jul 14, 2022

AljoschaMeyer commented Jul 14, 2022

basham commented Jul 14, 2022

AljoschaMeyer commented Jul 14, 2022

basham commented Jul 14, 2022 • edited

AljoschaMeyer commented Jul 15, 2022

basham commented Jul 15, 2022

cinnamon-bun commented Jun 3, 2020 •

edited

sgwilym commented May 9, 2022 •

edited

basham commented Jun 2, 2022 •

edited

AljoschaMeyer commented Jul 14, 2022 •

edited

basham commented Jul 14, 2022 •

edited