Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track when docs were received by a peer: _receivedTimestamp or _localSeq - to enable reliable indexing and reliable livestreaming #66

Closed
cinnamon-bun opened this issue Feb 14, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@cinnamon-bun
Copy link
Member

cinnamon-bun commented Feb 14, 2021

There are several features that need to know when the local Storage instance received a certain document.

Implementation

  1. When ingesting a document, give it an extra field _receivedTimestamp which is equal to the local time on this peer.
  2. That field is sort of a private local field; don't send it to other nodes when syncing. That's why it has an underscore.
  3. Allow queries to specify sorting by _receivedTimestamp and continuing from after a given timetstamp. This might mean adding an extra index to the storage backends.
  4. This timestamp should be enforced to be monotonic, e.g. despite changes in the computer clock this number should only ever increase.
  5. It could also just be an incrementing sequence number but there are a few benefits to using timestamps, such as the ability to merge event streams across different workspaces.

Implications

  • The Document type sometimes needs to hold this local-only field, _receivedTimestamp. Maybe the type needs to split into two, DocumentWithLocalData and DocumentForWire...
  • Storage backends need to support the extra field, and sorting by it.
  • The field should never be sent across the wire to another peer
  • Document validation shouldn't ever encounter this field, or should ignore it
  • Monotonic time tracking needs a little bit of storage to record the last value; it could go in the workspace config storage.

Features this unlocks

  • Reliable live streaming Live syncing relies on a livestream of changes to a Storage. If that's interrupted, we want to pick up where it left off. The reliable way to do that is to sort by _receivedTimestamp, and resume a stream with anything after a given _receivedTimestamp.
  • Reliable indexing of workspace data If you wanted to build an index against a Storage it's much the same problem -- you need a feed of changes for updating your index. You can't just sort by regular document timestamp because sometimes you get documents a long time after they're authored, so their regular timestamp is different than the order you received them in. In this case you want a tuple: (generation, _receivedTimestamp). Generation is an integer that increments whenever the entire storage is forgotten and reset, or some documents have been locally forgotten (besides ephemeral documents), or when the storage is recreated from scratch. If the generation changes, the index has to start over and re-index everything. Generation could also be a plain timestamp for any of the previous types of events.
  • Possibly more efficient syncing Haven't worked this out yet, but it might help two peers figure out what data to trade with each other
@cinnamon-bun cinnamon-bun added the enhancement New feature or request label Feb 14, 2021
@cinnamon-bun cinnamon-bun changed the title Track when docs were received by a peer: _receivedTimestamp Track when docs were received by a peer: _receivedTimestamp Feb 14, 2021
@cinnamon-bun cinnamon-bun added this to the Bananaslug milestone Feb 19, 2021
@cinnamon-bun
Copy link
Member Author

This is a duplicate of an older issue: #30

@cinnamon-bun cinnamon-bun changed the title Track when docs were received by a peer: _receivedTimestamp Track when docs were received by a peer: _receivedTimestamp or _localSeq Mar 18, 2021
@cinnamon-bun
Copy link
Member Author

cinnamon-bun commented Mar 18, 2021

A big diagram explaining this situation from the perspective of an App or Layer that wants to index an Earthstar IStorage.

This is the "Reliable Indexing" use case. The "Reliable live streaming" situation is very similar, just replace the orange box with another Peer instead of an App. In both cases the other party wants to track how much of a Storage it has processed using a minimal amount of state, like just a single index integer, so it can resume indexing later when it has been away and missed some events.

If an app or peer is always there and receives all the events, none of this is needed, the events are enough to tell it what it needs to know.

The downside of adding _localSeq metadata to documents is that new we have yet another way we need to query and sort them. IStorages will need to have an index for this purpose...


Made with diagrams.net
https://drive.google.com/file/d/1_uA1j7gCL9PIKqbQ6ym5dsAgd3JSfHHN/view?usp=sharing

2021 Indexing big

@cinnamon-bun cinnamon-bun changed the title Track when docs were received by a peer: _receivedTimestamp or _localSeq Track when docs were received by a peer: _receivedTimestamp or _localSeq - to enable reliable indexing and reliable livestreaming Mar 18, 2021
sgwilym pushed a commit that referenced this issue Feb 16, 2022
…/tslib-2.3.0

Bump tslib from 2.2.0 to 2.3.0
@sgwilym sgwilym closed this as completed Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants