New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preview: Automerge binary data format #253
Conversation
@types/automerge/index.d.ts
Outdated
key?: string | ||
value?: any | ||
key: string | number | ||
insert?: boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me it is really hard to reason about an optional bool.
A bool should either true or false. If you really need to express three states, I would prefer a enum with named states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the only place in the type specification where optional bool is used. Maybe we could clean this up if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a side-effect of writing something in idiomatic JavaScript? Certainly good advice for TypeScript code, but encoding these states as 0, 1, 2
would be pretty opaque to the JavaScript consumer, and using strings would bloat the in-memory cost. In practice, I imagine you'd use a falsy check here, so undefined
would be equivalent to false
. I assume this is still just two states (falsy
, true
) rather than three (undefined
, false
, true
) in any meaningful sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not tristate, it's just a boolean that defaults to false when absent. I don't know how to declare that in TypeScript. In 02271f0 I've changed operations such that they always have an insert
property that is explicitly true or false.
Hi Martin, I've had some time now to digest all this and give it some thought. First, on the subject of timestamps, paying for per-millisecond timestamps on all commits is pretty expensive, and it's not clear to me that there is much utility there. I think per-second timestamps would be a quite-reasonable optimization, and I'd further consider making their insertion optional. Next, as for key movements or other expensive but less valuable data, I've thought quite a bit about "non-dependent actors" as a useful option here. The concept is (in old-automerge lingo) an actor which depends on other actors but is never depended-upon. It could thus track the main document but would not need to be loaded unless it was necessary. Users could provision and synchronize this kind of cursor history during interactive sessions but throw them away afterwards or simply not bother to share them during asynchronous collaboration. It's not clear to me whether this would still be feasible with the new design but it seems in principle like it should be (simply bump the dependency hash as necessary). As to the future for Automerge Connection I have been aware of this as an upcoming problem for quite some time and have only recently begun giving it serious consideration. The vector clock model is desirable because calculating the necessary synchronization work between two documents is a trivial operation that only needs to examine the clock. On the face of it, calculating a delta between hash-chained commits would be tricky. We'd need to calculate the least-common-ancestor for the peers and then generate a special commit that caught them up. Streaming the changes to allow a client to receive deltas incrementally would be quite expensive in comparison but perhaps it's a desirable feature to send the result in a single exchange. My mind does wander to the rsync/noms/jump-rope style architecture for synchronization. Is there a mechanical sympathy with a binary-encoding-level synchronization? Could we use some kind of probablistic-blocks method to determine what needs to be exchanged in each direction at the storage-layer? It's unclear to me right now. All of this stuff also intersects with privacy concerns. We need to think carefully about how we query between peers and what we expose and to whom. Content hashes are, in general, pretty sensitive. All that said, I'm very excited that we're getting to the point of working on these problems! I'm eager to dig into all of this in more detail as the binary layers come into greater focus. We'll need to find a way to commission a project to work on this stuff sometime soon, but that's a topic for a different venue. |
Minisketch (https://github.com/sipa/minisketch) might be a useful library for dealing with synchronizing sets of hashes efficiently. This could be more efficient than storing and exchanging a whole lot of vector clock metadata. |
+1 on per-second timestamps, and +1 on making timestamps optional altogether. |
For what it's worth, I've got a lot of code built on top of the vector clock model; so I'd hate to see it thrown out entirely. I'd understood from this conversation that the plan was to keep vector clocks but augment them with chain hashing, is that no longer the case?
That definitely strikes me as the right move - it's always seemed a bit out of place in the main codebase. |
Just to be clear, we'll always have access to the individual changes, right? Cevitxe also stores an append-only log of changes. Maybe there need to be different strategies available for the multiuser-text-editing use case, where you have more individual changes with tiny payloads, vs. the replicated-database use case, where you have fewer changes with more substantial payloads. |
Okay, I've changed it to use second-resolution timestamps. Here is the updated file size:
I am really pleased with this: the gzipped document uses less than one byte per change, while still capturing the full keystroke-by-keystroke change history!
Yes, the full log of individual changes will always be available (e.g. through
Agree, I think we should keep vector clocks but augment them with hash chains. Vector clocks are still useful because they allow two nodes to sync up in one round-trip, while resolving hash chains sometimes requires several round-trips. |
May I suggest using |
This compression step happens outside of Automerge, so you can combine Automerge with the compression algorithm that best suits your requirements. |
I was curious to test the performance improvements in crdt-benchmarks. This is quite a leap! You can compare the results here: dmonad/crdt-benchmarks@73f3f65 It is very impressive how well your new algorithm compresses meta information! The only thing that I'm worried about is that parsing of the compressed document I see that you implemented a similar module for encoding/decoding as I did for Yjs. If you are interested you could make use of lib0/encoding and lib0/decoding. I already put a crazy effort into optimizing read/write performance. |
Hi @dmonad, thanks for running the benchmarks! The loading of documents is not yet optimised at all, and currently takes a slow path through all the old data structures. That is the next thing on my list of things to fix. |
* test/watchable_doc_test: move callback registration to beforeEach() * copy automerge.d.ts to dist on build/publish * use strict mode everywhere The original code here suggests that in a browser, modifying a frozen object will fail silently, whereas in node it will throw an error. This test was failing for me in browser testing, so I dug into it and it turns out the difference isn't node vs browser but strict mode vs not. (In mocha.opts we had "use strict" so node tests were always running in strict mode. TypeScript was also compiling in strict mode.) Since we have control over whether we're in strict mode or not, it makes more sense to just turn it on everywhere and only test for that scenario. screw it, just use strict mode everywhere * test/types: add CounterList * @types/automerge/index.d.ts: `message` can be undefined * restore uuid factory, add typings * @types/automerge/index.d: add uuid() * @types/automerge/index.d: getElemId * put all types in automerge namespace * test/test: restore message type test * make `Change` generic: `Change<T>` * test/proxies_test: restore tests with string indices * Action parameter is different for Ops vs Diffs * properly type Table.add so it enforces property order when passing array * npm: fix name of types file * Automerge.Text no longer considered experimental * Undo unnecessary formatting changes * Revert back to plain JS for tests * Nicer JSON serialization of Counter, Table, and Text * Improve JSON serialization of Table Put the rows in a separate object from the columns, so that applications can iterate over the rows by calling `Object.values(doc.table.rows)`. * Changelog for automerge#163 and 0.10.1 * 0.10.1 * actorId and elemId are not necessarily UUIDs * Remove unused type definitions for tests * setActorId is frontend-only * Some corrections of the TypeScript types * Initial set of tests for the TypeScript API * Fix undo/redo when using separate frontend and backend Bug: when using Automerge in a configuration with separate frontend and backend, attempting to perform `Frontend.undo()` would result in the following exception: TypeError: Cannot read property 'updated' of null at makeChange (frontend/index.js:113:43) Fixed the bug and added tests that exercise undo/redo in the case where frontend and backend are separate. * Fix some bugs in type definitions * Exception on modifying frozen object is optional In Node v12.3.1 and v10.16.0, assigning or deleting a property of a frozen object no longer throws an exception like it did in previous versions. Rather than trying to split the test by version number, I figured it was easier to just remove the assertion that an exception is thrown (and just check that the object remains unmodified). * Tests still to do * Update caveats in readme * Exception on modifying frozen object is optional In Node v12.3.1 and v10.16.0, assigning or deleting a property of a frozen object no longer throws an exception like it did in previous versions. Rather than trying to split the test by version number, I figured it was easier to just remove the assertion that an exception is thrown (and just check that the object remains unmodified). * Changelog for automerge#165 * Run Travis tests on Node 11 and 12 as well * README: note correct immutable behavior * index.d.ts: remove stray question mark * typescript tests: Automerge.Text * typescript tests: Automerge.Table * no need to specify type twice * expand on usage of types in initialization * typescript tests: Automerge.Counter * ignore .vscode * add `Automerge.from` Returns a new document object with the given initial state. See automerge#127 * Document created with `from` needs to have a backend * Update the documentation for Automerge.from * Changelog for automerge#127 Closes automerge#175 * Fix indentation and whitespace * Remove unused code * Add types for Automerge.from (automerge#127) * WIP: experiment with readonly types returned outside of Automerge * @types/automerge/index.d: move comment * aha! Doc<T> = DeepFreezeObject<T>, not DeepFreeze<T> * tweak `any` tests to work with DeepFreeze * group types functionally * upgrade typescript * rename to Freeze, Proxy * more rearranging * simplify Table definition by inheriting from Array<T> * getConflicts key is `keyof T` * add documentation for Doc<T> and Proxy<D> * formatting * fix handler signature automerge/automerge#155 (review) * add tests for DocSet and WatchableDoc * add DocSet tests * ignore .vscode * Revert "Update the documentation for Automerge.from" This reverts commit c0ba4d2. * Make object freezing optional It turns out that in V8, copying frozen objects (especially arrays) is a lot more expensive than copying non-frozen objects. https://bugs.chromium.org/p/chromium/issues/detail?id=980227 https://jsperf.com/cloning-frozen-array/1 https://jsperf.com/cloning-frozen-object/1 Removing freezing entirely from skip_list.js since these objects are not part of the user-facing API, and so we don't need to worry as much about accidental mutation. Retaining freezing in a few places in counter.js, table.js, apply_patch.js and frontend/index.js where it has negligible performance impact. * Allow freeze option to be passed in to init() and load() * Update README: we no longer freeze objects by default * Slight README refresh * Our own object-copying function is faster Surprisingly, Object.assign() is not the fastest way of copying an object: https://jsperf.com/cloning-large-objects/1 * Remote type parameter from Change interface It is only needed for the "before" property, and I think that property should not be considered part of the public API. (It is only used within the frontend for internal bookkeeping; the Change objects produced and consumed by the backend do not have this property.) * Use opaque BackendState type in the backend * Remove unused type parameter from Message interface * Update the documentation for Automerge.from * Changelog and README updates for automerge#155 * TypeScript support for freeze option * Changelog for automerge#177 and automerge#179 Fixes: automerge#177 * Changelog for v0.11.0 * 0.11.0 * Allow Text instatiation with preset value * docs: Add introduction to "Applying operations to the local state" * `DocSet<T>` was missing `docIds` property * `DocSet.getDoc` return value is `Doc<T>,` not `T` # Conflicts: # @types/automerge/index.d.ts * `Clock` is a `Map`, not an object * allow passing `init` options to `from` * index.d.ts: more precise InitOptions definition; add InitOptions to .from * add `freeze` to `InitOptions` * add tests for `from` with options * refactor: Remove unused `local` list * docs: Add local state overview * Types for `Frontend.init()` and `Frontend.from()` Addendum to automerge#183 * Changelog for automerge#183 * Some refactoring/tidying Name `instantiateText` is for consistency with `instantiateTable` * Allow modification of Text without assignment to document Methods `.set()`, `.insertAt()` and `.deleteAt()` are now defined directly on Automerge.Text, rather than going through the listProxy. Therefore, these methods can now be called directly on a Text object without having to first assign the Text object to the document. Fixes automerge#180, automerge#166 * Changelog for automerge#180 and automerge#181 * List methods such as .filter() should use proxy objects Fixes automerge#174 * Upgrade npm dependencies - Remove `@types/sinon` and `@types/uuid` dependencies, which are unused. - Update all version numbers in package.json to their latest version, except for `babel-loader`, which I'm keeping unchanged for now (upgrading it to 8.0.6 breaks `yarn build`, and I can't be bothered to figure that out now). - Run `yarn upgrade` to bring packages up-to-date. The new version of `get-caller-file` dropped support for Node 9, and since that version of Node is now unsupported anyway, I dropped it from the Travis CI test matrix. - Version 4.0.0-alpha.4 of `selenium-webdriver` requires Node 10 or higher. As Node 8 is still supported, I want to keep it in the CI text matrix. Thus, I had to add a resolution rule fixing `selenium-webdriver` to 4.0.0-alpha.1, which is compatible with Node 8. * docs: Add description for determining causal readiness * docs: Add missing operations * .toString() of Text should return a string content * test.js: add tests for concurrent deletion * Fix stack overflow error on large changes Fixes automerge#202. The stack overflow is not due to unbounded recursion, but rather because of a method's varargs argument list being too long (presumably each argument takes up some space on the stack). In particular, there are several places where we are calling Array.push with an argument list generated using the spread operator in order to concatenate arrays. This commit replaces them with for loops, which also happen to be faster: https://jsperf.com/pushing-large-arrays * Changelog for 0.12.0 * 0.12.0 * test/typescript_test.ts: docSet.docIds * Revert "`Clock` is a `Map`, not an object" This reverts commit d8fd73f. * Changelog for automerge#174, automerge#184, and automerge#199 * Readme updates * Fix trap error by setting writable property descriptor to true * add initialization tests for array, primitives * fix typo in test name; make describe labels consistent * readme: automatic formatting * readme: initializing a document * readme: updating a document * readme: rearrange outline * readme: actorId note * readme: undo/redo edits * readme: sending/receiving edits * readme: conflicting changes edits * readme: crdt datatypes edits * readme: emoji * readme: cross-references * Tidy up description of network layers * Text.toString() * Add DocSet.removeDoc * Changelog for automerge#210 and 0.12.1 * 0.12.1 * Remove unused parameter in SkipList node Fixes automerge#213 * a test for round-tripping a control object * split out tests a bit * Allow objects to appear as elements in Automerge.Text Fixes automerge#194 * toSpans() and a test set for it * add to types * okay, maybe not any * quill delta experiment * show that overlapping spans work * basic apply-delta function * account for control characters occupying space during delete & retain * clean up the op application functions a little bit * parse embeds from deltadocs * make it possible to distinguish between attributes and embed control characters * finish support for embeds * explicitly account for control characters vs embeds * Fix freezing of opaque string types * Include link to repo * docs(readme): add perge library to list under sending/receiving changes * Update README.md * Allow options object to be passed in to Automerge.change Similarly in emptyChange(), undo(), and redo() * canUndo should return false after Automerge.from() * Bump handlebars from 4.1.2 to 4.5.3 Bumps [handlebars](https://github.com/wycats/handlebars.js) from 4.1.2 to 4.5.3. - [Release notes](https://github.com/wycats/handlebars.js/releases) - [Changelog](https://github.com/wycats/handlebars.js/blob/master/release-notes.md) - [Commits](handlebars-lang/handlebars.js@v4.1.2...v4.5.3) Signed-off-by: dependabot[bot] <support@github.com> * fix deleteAt bug with input 0 * New `Automerge.getAllChanges()` API * Bring internals documentation up-to-date * Document frontend-backend protocol * Bring the rest of the internals doc up-to-date * Reword intro * Fix indentation * Changelog for v0.13.0 * 0.13.0 * Update copyright * Add link to automerge-client-server * Rephrase caveats * remove tests for adding a Table row as an array of values * Remove KeyOrder array from Table type * remove KeyOrder array from types in tests * clean up documentation * remove undocumented ability to add table row from array of values * remove test for API being removed * Bump acorn from 6.2.1 to 6.4.1 Bumps [acorn](https://github.com/acornjs/acorn) from 6.2.1 to 6.4.1. - [Release notes](https://github.com/acornjs/acorn/releases) - [Commits](acornjs/acorn@6.2.1...6.4.1) Signed-off-by: dependabot[bot] <support@github.com> * support more nesting & fix retain logic * use string concat instead of join() * Update dependencies (yarn upgrade) * Updating dependencies * Throw exception if people try to use the old API * Remove obsolete Node 8 and 11 from testing matrix * Changelog for automerge#236 * Get Sauce Labs tests working again * Update Sauce Labs build badge * Remove the list of columns from Automerge.Table Since automerge#236 the Automerge.Table type no longer allows rows to be added as an array of values, which are then mapped to column names. Now that this feature is removed, nothing in Automerge is actually using the list of columns that is currently required by the Automerge.Table constructor. The columns are still saved, but since they do not enforce any schema, they are purely advisory. Hopefully one day we can have proper schema support in Automerge, but the current half-hearted implementation is not really helping us get there. Therefore I think it is best to just remove this list of columns feature. * Update Mocha configuration file format mocha.opts is deprecated. This stops the annoying deprecation warning when running the tests. * Mocha should load test files * Remove trailing whitespace * Changelog for automerge#238 * Make table row's primary key available as `id` property Currently, if you have a row object from an Automerge.Table, you can get its primary key with `Automerge.getObjectId(row)`. This API is not very discoverable; users who don't know about this API might be tempted to generate their own IDs for table rows, which would miss the entire point of Automerge.Table. This patch makes the primary key more visible by making it available as `row.id` instead. When a new row is added, we check that the row object doesn't already have an `id` property. We also ensure that the `id` property cannot be modified. (Besides API usability, there is also a deeper reason for making this change: on the `performance` branch, objectIds are backend-generated Lamport timestamps rather than UUIDs to enable better compression; since `Table.add` should synchronously return the primary key of the new row, it must use a different ID from the objectId. Putting the row's primary key in a separate property reinforces that distinction.) * Fix description of Table in README * Changelog for automerge#241 and automerge#242 * Remove set() method from Automerge.Table API It is not needed, since add() and remove() handle changes to the set of rows, and properties of row objects can be updated directly without requiring any set() operation at the table level. Looking at the code, I now realise that set() was not intended to be part of the public API at all: it is called while applying a patch, and if a user calls it, it does not generate any operations in the change context. Hence I renamed it to start with an underscore (like `_clone()` and `_freeze()`), and added a warning to the comment. * Changelog for automerge#243 * Changelog for 0.14.0 * 0.14.0 * Link to new CRDT website * Update table class type to reflect property getter * Changelog and test for automerge#249 * Fix console inspection of proxy objects in Node * Fix type signatures for WatchableDoc#get and WatchableDoc#set * Use Slack's own invitation link instead of communityinviter * Queued changes (whose dependencies are missing) should also be saved Previously, load(save()) would discard any queued changes. Bug reported by @KarenSarmiento, fixes automerge#258 * Make clearer that fine-grained updates are preferred Add a README section as suggested by @johannesjo in automerge#260, and make a more descriptive exception when users try to assign an object that is already in the document. Fixes automerge#260 * Changelog for 0.14.1 * 0.14.1 * Link to automerge#253 from README * support calling indexOf with an object * logo assets * replace h1 with logo * README.md: smaller logo * images with wider spacing * revert unintended reformatting * Update name of main branch fixes automerge#264 Co-authored-by: Herb Caudill <herb@devresults.com> Co-authored-by: Martin Kleppmann <martin@kleppmann.com> Co-authored-by: Herb Caudill <herb@caudillweb.com> Co-authored-by: Harry Brundage <harry.brundage@gmail.com> Co-authored-by: Irakli Gozalishvili <contact@gozala.io> Co-authored-by: Eric Dahlseng <edahlseng@yahoo.com> Co-authored-by: Peter van Hardenberg <pvh@pvh.ca> Co-authored-by: Mikey Stengel <mikey.stengel96@gmail.com> Co-authored-by: Brent Keller <brentkeller@gmail.com> Co-authored-by: Jeff Peterson <jeff@yak.sh> Co-authored-by: Max Gfeller <max.gfeller@gmail.com> Co-authored-by: Sam McCord <sam.mccord@protonmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Speros Kokenes <speros@speross-mbp.lan> Co-authored-by: Jeremy Apthorp <nornagon@nornagon.net> Co-authored-by: Lauritz Hilsøe <mail+gh@lauritz.me> Co-authored-by: vincent.capicotto <vincent.capicotto@hiptest.net> Co-authored-by: Phil Schatz <253202+philschatz@users.noreply.github.com>
@ept I was wondering if you are targeting a release date for v1. |
Hi @ankit-m, no definite schedule, but I hope to have a preview release within the next month or two. A final stable release may take a bit longer (since this branch has a lot of new code, there will be bugs that need to be ironed out) but I'm hoping well before the end of the year. |
Rather than (0, 0). The reason for this change is that when a change gets applied to a document, an actor index of 0 in a change gets transformed into the document's actor index for the author of that change, which may be any number. This is confusing, since the head of a list is not actually associated with any one particular actorId.
This makes patches match the TypeScript type definitions, where props and edits are non-optional properties
common.js is intended for code that is shared by frontend and backend, but appendEdit is only needed in the backend.
Enable a gentle eslint
Changed it so that patches for both list and map objects initially have a `props` property, which for lists is indexed by elemId. This makes it easier and more robust to find the appropriate subpatch without having to scan the list of edits. The `props` property is then deleted by finalizePatch() before the patch is sent to the frontend.
…ormat Implements new frontend-backend protocol with more compact handling of list element insertion and deletion. Fixes #311
Update dependencies and Sauce Labs configuration. The Uint8Array changes are because the latest version of Safari otherwise throws exceptions while running the tests.
All the compatibility-breaking changes that I've planned to make are now completed and merged into the performance branch, and so I'm declaring this branch done and merging it into main. No doubt there will be bugs, but we can address them on separate PRs. Hooray, it's shipping! 🚢 |
As we all know, Automerge's current performance on large documents is terrible (#89) — loading them is very slow, the data files on disk are huge, and they use huge amounts of memory and network bandwidth. Improving this situation is one of my top priorities. However, most of the low-hanging fruit had already been picked, and a more fundamental rethink of Automerge's data structures was needed.
It turns out that CRDTs are very easy to implement badly, but actually quite difficult to make fast and efficient. A lot of this is due to the metadata overhead: for example, in a text document, every single character needs a unique identifier. Various schemes have been designed to reduce this metadata cost, but some of them behave badly under concurrent insertion. Others rely on periodic cleanup operations, but the cleanup depends on communicating with all nodes, and it stops if some of the nodes are offline for a long time.
Almost a year ago I first wrote up a proposal for a compressed binary encoding format for Automerge, taking a new approach that I've not yet seen in the CRDT research literature. Borrowing ideas from column-oriented database systems, this is in the first instance a way of saving disk space and network bandwidth, but it also paves the way towards some big potential performance improvements (and it doesn't suffer from the aforementioned problems).
I have been working on this project on and off since, and while it's not yet finished, I wanted to share an update on where we're at and where we're heading. I am excited about the progress so far and I'm pretty sure this is the right path for the future.
This work has been happening on the
performance
branch. My intention is to stabilise the features and data formats on this branch over the coming weeks, and to then turn it into a release candidate for ✨Automerge 1.0✨. There are a bunch of compatibility-breaking changes on this branch, both in the APIs and the data formats, and I want to wrap up all of these breaking changes into a single bump of the major version number, so as to minimise the number of further breaking changes in the future. There will be a migration tool for Automerge 0.* users to convert their existing documents to the Automerge 1.0 format, and the new data formats are carefully designed with extensibility in mind, allowing us to add new features to Automerge while retaining interoperability between clients running different versions of Automerge.Automerge.getChanges()
andAutomerge.save()
now returnUint8Array
objects using the binary data format. The Automerge frontend APIs are largely unchanged (except for a few tweaks as documented in the CHANGELOG). The communication protocol between frontend and backend has changed a lot (fixing some historical design mistakes, and allowing the frontend to be simplified, which should make it easier to create frontends in other languages).There are also some changes of the "might as well also do this now" sort:
The switch to using hash chaining will have a bearing on network protocols that sync up Automerge replicas, such as Automerge.Connection, which currently relies on vector clocks. This is something I would like to discuss with the community. By the way, I have split Automerge.Connection into a separate repository, because I think we might want to evolve it and assign it version numbers independently from the Automerge core.
Anyway, time to report some numbers on how the binary format compares to Automerge's existing JSON data format. First of all, let me emphasise that I have not yet solved the performance problems on this branch (despite the branch name "performance") — so far I have primarily worked on making the binary format compact. Therefore the analysis below focusses mostly on encoded data sizes. Also, I will use text editing examples, since text editing tends to accumulate changes very quickly (one change per keystroke), so it's a challenging type of workload. Nevertheless, the binary format should be good for all types of data supported by Automerge.
The binary-encoded length of a typical single-character text insertion is 105 bytes. Sounds like a lot, but only 35 of those bytes are the encoding of the actual operation. The rest is made up of the hash of the previous change (33 bytes), the actorId (17 bytes), the timestamp (6 bytes), a checksum (4 bytes), and miscellaneous other header fields (10 bytes). It might be possible to squeeze that a little further, but with diminishing returns. A single-character deletion takes 109 bytes. There is no point gzipping these short byte sequences, as doing so actually increases the size.
The benefits of the binary format show up much more starkly when inserting or deleting longer runs of characters in one go. When inserting 10,000 ASCII characters at once into a text document (a medium-sized copy&paste), the JSON format uses about 248 bytes per inserted character, leading to a change about 2.5 MB in size. On the other hand, the binary format contains the raw ASCII text (10,000 bytes) plus a small, near-constant overhead of 124 bytes for the whole change. Although the JSON data compresses well with gzip, with a 24:1 compression ratio, the gzipped JSON is still 10 times the size of the un-gzipped binary data. The binary data can still be gzipped further for an additional 2.3:1 compression.
The situation is even more extreme for a 10,000 character deletion: the binary encoding is almost constant-size, while the JSON format takes over 100 bytes per deleted character. In this example, the JSON encoding is 8,400 times the size of the binary encoding.
So far for the microbenchmarks. A more interesting question is how the encoding fares with more complex editing patterns. For this, I used a dataset that we captured a few years ago, when we wrote the LaTeX source of an entire paper using a homegrown text editor. The result is an editing history containing 332,702 keystrokes (of which 182,315 are single-character insertions; the rest are single-character deletions and cursor movements). The final text file (without any editing history) is 104,852 bytes in size.
I converted this dataset to the new binary format, with each keystroke as a separate change, and the result is as follows:
The first row shows what happens when we treat each of the 332,702 changes separately: 440 bytes per change in JSON, and 152 bytes per change for the binary encoding — a bit more than the per-change numbers above, but not wildly so. However, treating each change individually drastically limits the compression we can apply. Applying gzip compression to each JSON change individually yields a compression ratio of only 2:1.
It is much more effective to compress the document as a whole. Simply gzipping the JSON change history as a whole yields a 24:1 compression ratio. But the big deal is the whole-document binary encoding, which is about a megabyte (3 bytes per change), and which further gzips to 664 kB. When working with the whole document, binary encoding + gzip is almost 10 times more compact than JSON + gzip!
I should emphasise that in all of these examples, the compression is lossless: it's not just a snapshot of the latest state, but it preserves the entire character-by-character editing history. In fact, from the whole-document binary format it is possible to reconstruct the bitwise identical change history that created it, recompute all the SHA-256 hashes for the dependency chains between changes, and end up with exactly the same hashes for the "heads" (in Git terminology). Note that hashes need to be recomputed, not stored: merely storing the hash of each of the 332,702 changes would require about 10MB, ten times the size of the binary-encoded file.
The next interesting question is: what makes up those 1,119,341 bytes of binary data? Can we reduce it further? Well, it breaks down as follows:
Comparing the whole-document binary encoding (~1 MB without gzip) to the final text document without any CRDT metadata (~100 kB without gzip), we see there is still a substantial cost. But we do gain a very detailed change history, and of course the CRDT merging ability. And we have made big progress compared to the JSON encoding (two orders of magnitude smaller without gzip, one order of magnitude smaller with gzip). We could compress it by another factor of ~2 by reducing timestamps to 1-second resolution and leaving out the cursor movements (they could be maintained in separate transient storage, and omitted from the persistent document history).
So far all of the compression has been lossless. Of course we can also consider approximate forms of compression, such as combining several changes with similar timestamps into a more coarse-grained change (reducing the number of timestamps we have to encode), or discarding some of the change history entirely. However, any sort of lossy compression would mean losing the ability to reconstruct the original change history, and thus losing the ability to check the hash chains. Depending on how we use the hash chains (to check integrity, to verify authenticity?), such a trade-off will need to be considered very carefully.
A key insight from the experiments above is that when dealing with large change histories, it is much more efficient to encode the document as a whole than to encode each change separately (in the example above, it makes a factor of 50 difference). This has implications for network protocols that sync Automerge nodes: for example, Hypermerge is currently based on append-only logs of changes, assuming that we're working with one change at a time.
Future protocols will need to detect how far out of sync the two nodes are: if they are almost in sync, it is more efficient to just send the last few missing changes, as each change is typically a few hundred bytes. But if they are far apart (or if one of the nodes is lacking the document entirely) it is more efficient to send the whole-document encoding. Multiple versions of the whole-document encoding can be merged efficiently, without having to turn the document into a log of changes and back again. Essentially, the CRDT can run in two modes: an operation-based mode (where we send changes over the network) and a state-based mode (where we send the whole-document encoding). Network protocols will need to be adapted to take advantage of this choice, by figuring out when to use which mode.
One last thing about the whole-document encoding: while its primary design goal is to be compact, the close second goal is to allow fast loading of the current state from disk. For example, fetching the current value of a particular field (e.g. the title of a document) should not require decoding the whole file, but only reading a bit of metadata and then seeking to the appropriate place in the file. On some platforms, we might even be able to avoid loading the whole file into memory by memory-mapping it and seeking only to those bits that we need.
To test this, I wrote a hacky experimental decoder that reads the latest text of the LaTeX paper source example (that is, it extracts the 104,852-character final text string from the 1,119,341-byte encoded file). Here are the timings from running that decoder 10 times on my laptop:
I don't know why the running time fluctuates so much (GC maybe?), but a median time of ~5 milliseconds is pretty encouraging. Loading a file with hundreds of thousands of changes would probably take minutes with the current Automerge implementation, if it completes at all.
I will write up a detailed specification of the binary format sometime soon. A lot of thought has gone into it, e.g. designing for future extensibility. For now, I want to put this update out there to share what's happening.
By the way, Orion and Alex have been doing an excellent job of tracking this progress in their Rust port of Automerge. This means interop between the JavaScript and Rust backends is not far off!