Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preview: Automerge binary data format #253

Merged
merged 367 commits into from May 5, 2021
Merged

Preview: Automerge binary data format #253

merged 367 commits into from May 5, 2021

Conversation

ept
Copy link
Member

@ept ept commented May 1, 2020

As we all know, Automerge's current performance on large documents is terrible (#89) — loading them is very slow, the data files on disk are huge, and they use huge amounts of memory and network bandwidth. Improving this situation is one of my top priorities. However, most of the low-hanging fruit had already been picked, and a more fundamental rethink of Automerge's data structures was needed.

It turns out that CRDTs are very easy to implement badly, but actually quite difficult to make fast and efficient. A lot of this is due to the metadata overhead: for example, in a text document, every single character needs a unique identifier. Various schemes have been designed to reduce this metadata cost, but some of them behave badly under concurrent insertion. Others rely on periodic cleanup operations, but the cleanup depends on communicating with all nodes, and it stops if some of the nodes are offline for a long time.

Almost a year ago I first wrote up a proposal for a compressed binary encoding format for Automerge, taking a new approach that I've not yet seen in the CRDT research literature. Borrowing ideas from column-oriented database systems, this is in the first instance a way of saving disk space and network bandwidth, but it also paves the way towards some big potential performance improvements (and it doesn't suffer from the aforementioned problems).

I have been working on this project on and off since, and while it's not yet finished, I wanted to share an update on where we're at and where we're heading. I am excited about the progress so far and I'm pretty sure this is the right path for the future.

This work has been happening on the performance branch. My intention is to stabilise the features and data formats on this branch over the coming weeks, and to then turn it into a release candidate for ✨Automerge 1.0✨. There are a bunch of compatibility-breaking changes on this branch, both in the APIs and the data formats, and I want to wrap up all of these breaking changes into a single bump of the major version number, so as to minimise the number of further breaking changes in the future. There will be a migration tool for Automerge 0.* users to convert their existing documents to the Automerge 1.0 format, and the new data formats are carefully designed with extensibility in mind, allowing us to add new features to Automerge while retaining interoperability between clients running different versions of Automerge.

Automerge.getChanges() and Automerge.save() now return Uint8Array objects using the binary data format. The Automerge frontend APIs are largely unchanged (except for a few tweaks as documented in the CHANGELOG). The communication protocol between frontend and backend has changed a lot (fixing some historical design mistakes, and allowing the frontend to be simplified, which should make it easier to create frontends in other languages).

There are also some changes of the "might as well also do this now" sort:

  • Every change now has a millisecond-resolution timestamp of when it was made. This timestamp has no bearing on the conflict resolution; it exists only to enable better inspection of the editing history (allowing you to see when a change was made). I'm not sure why we didn't add a timestamp from the beginning — it seems like quite an obvious omission in retrospect.
  • Every change can be identified by the SHA-256 hash of its binary encoding, and this hash is now used instead of actorId+sequence number (vector clocks) to express dependencies between changes — much like in Git or blockchains (see discussion in Hash-based integrity checking of operation sets #27 and Using hash chaining to encode the dependency graph #200). ActorIds still exist, because using hashes for everything totally destroys compression (see below).

The switch to using hash chaining will have a bearing on network protocols that sync up Automerge replicas, such as Automerge.Connection, which currently relies on vector clocks. This is something I would like to discuss with the community. By the way, I have split Automerge.Connection into a separate repository, because I think we might want to evolve it and assign it version numbers independently from the Automerge core.

Anyway, time to report some numbers on how the binary format compares to Automerge's existing JSON data format. First of all, let me emphasise that I have not yet solved the performance problems on this branch (despite the branch name "performance") — so far I have primarily worked on making the binary format compact. Therefore the analysis below focusses mostly on encoded data sizes. Also, I will use text editing examples, since text editing tends to accumulate changes very quickly (one change per keystroke), so it's a challenging type of workload. Nevertheless, the binary format should be good for all types of data supported by Automerge.

The binary-encoded length of a typical single-character text insertion is 105 bytes. Sounds like a lot, but only 35 of those bytes are the encoding of the actual operation. The rest is made up of the hash of the previous change (33 bytes), the actorId (17 bytes), the timestamp (6 bytes), a checksum (4 bytes), and miscellaneous other header fields (10 bytes). It might be possible to squeeze that a little further, but with diminishing returns. A single-character deletion takes 109 bytes. There is no point gzipping these short byte sequences, as doing so actually increases the size.

Current JSON format JSON + gzip New binary format Binary + gzip
Single-character insertion 280 bytes 187 bytes 105 bytes 138 bytes
Single-character deletion 183 bytes 159 bytes 109 bytes 141 bytes
10,000-character insertion 2,476,915 bytes 105,296 bytes 10,124 bytes 4,380 bytes
10,000-character deletion 1,118,968 bytes 28,386 bytes 133 bytes 151 bytes

The benefits of the binary format show up much more starkly when inserting or deleting longer runs of characters in one go. When inserting 10,000 ASCII characters at once into a text document (a medium-sized copy&paste), the JSON format uses about 248 bytes per inserted character, leading to a change about 2.5 MB in size. On the other hand, the binary format contains the raw ASCII text (10,000 bytes) plus a small, near-constant overhead of 124 bytes for the whole change. Although the JSON data compresses well with gzip, with a 24:1 compression ratio, the gzipped JSON is still 10 times the size of the un-gzipped binary data. The binary data can still be gzipped further for an additional 2.3:1 compression.

The situation is even more extreme for a 10,000 character deletion: the binary encoding is almost constant-size, while the JSON format takes over 100 bytes per deleted character. In this example, the JSON encoding is 8,400 times the size of the binary encoding.

So far for the microbenchmarks. A more interesting question is how the encoding fares with more complex editing patterns. For this, I used a dataset that we captured a few years ago, when we wrote the LaTeX source of an entire paper using a homegrown text editor. The result is an editing history containing 332,702 keystrokes (of which 182,315 are single-character insertions; the rest are single-character deletions and cursor movements). The final text file (without any editing history) is 104,852 bytes in size.

I converted this dataset to the new binary format, with each keystroke as a separate change, and the result is as follows:

JSON format JSON + gzip Binary format Binary + gzip
Paper (individual changes) 146,406,415 bytes 71,347,241 bytes 51,620,565 bytes N/A
Paper (whole document) 146,406,415 bytes 6,132,895 bytes 1,119,341 bytes 664,268 bytes

The first row shows what happens when we treat each of the 332,702 changes separately: 440 bytes per change in JSON, and 152 bytes per change for the binary encoding — a bit more than the per-change numbers above, but not wildly so. However, treating each change individually drastically limits the compression we can apply. Applying gzip compression to each JSON change individually yields a compression ratio of only 2:1.

It is much more effective to compress the document as a whole. Simply gzipping the JSON change history as a whole yields a 24:1 compression ratio. But the big deal is the whole-document binary encoding, which is about a megabyte (3 bytes per change), and which further gzips to 664 kB. When working with the whole document, binary encoding + gzip is almost 10 times more compact than JSON + gzip!

I should emphasise that in all of these examples, the compression is lossless: it's not just a snapshot of the latest state, but it preserves the entire character-by-character editing history. In fact, from the whole-document binary format it is possible to reconstruct the bitwise identical change history that created it, recompute all the SHA-256 hashes for the dependency chains between changes, and end up with exactly the same hashes for the "heads" (in Git terminology). Note that hashes need to be recomputed, not stored: merely storing the hash of each of the 332,702 changes would require about 10MB, ten times the size of the binary-encoded file.

The next interesting question is: what makes up those 1,119,341 bytes of binary data? Can we reduce it further? Well, it breaks down as follows:

  • 662,388 bytes for the timestamps on each change. Yes, you read that correctly: two thirds of the file is just timestamps! The reason is: timestamps are millisecond resolution, and for each change we store the number of milliseconds since the previous change. When typing, the time between keystrokes is usually a few hundred milliseconds. The binary format encodes integers between 64 and 8,191 using two bytes, so most changes require two bytes to represent their timestamp delta to the previous change. Two bytes times 332,702 changes costs about 664 kB. We could reduce this to one byte per change if we reduce the timestamp resolution to integer seconds instead of milliseconds.
  • 182,315 bytes for the ASCII characters of the 182,315 insertion operations. This is just ASCII text, so it further gzips pretty well. This example document only contains ASCII, but obviously the binary format handles all of Unicode using UTF-8.
  • 112,726 bytes for the CRDT metadata on the text object. That is about 0.44 bytes per text editing operation — pretty awesome!
  • 124,306 bytes for the cursor movement operations (1.2 bytes per operation).
  • 36,954 bytes to store how the operations are grouped into changes.
  • 652 bytes for miscellaneous headers.

Comparing the whole-document binary encoding (~1 MB without gzip) to the final text document without any CRDT metadata (~100 kB without gzip), we see there is still a substantial cost. But we do gain a very detailed change history, and of course the CRDT merging ability. And we have made big progress compared to the JSON encoding (two orders of magnitude smaller without gzip, one order of magnitude smaller with gzip). We could compress it by another factor of ~2 by reducing timestamps to 1-second resolution and leaving out the cursor movements (they could be maintained in separate transient storage, and omitted from the persistent document history).

So far all of the compression has been lossless. Of course we can also consider approximate forms of compression, such as combining several changes with similar timestamps into a more coarse-grained change (reducing the number of timestamps we have to encode), or discarding some of the change history entirely. However, any sort of lossy compression would mean losing the ability to reconstruct the original change history, and thus losing the ability to check the hash chains. Depending on how we use the hash chains (to check integrity, to verify authenticity?), such a trade-off will need to be considered very carefully.

A key insight from the experiments above is that when dealing with large change histories, it is much more efficient to encode the document as a whole than to encode each change separately (in the example above, it makes a factor of 50 difference). This has implications for network protocols that sync Automerge nodes: for example, Hypermerge is currently based on append-only logs of changes, assuming that we're working with one change at a time.

Future protocols will need to detect how far out of sync the two nodes are: if they are almost in sync, it is more efficient to just send the last few missing changes, as each change is typically a few hundred bytes. But if they are far apart (or if one of the nodes is lacking the document entirely) it is more efficient to send the whole-document encoding. Multiple versions of the whole-document encoding can be merged efficiently, without having to turn the document into a log of changes and back again. Essentially, the CRDT can run in two modes: an operation-based mode (where we send changes over the network) and a state-based mode (where we send the whole-document encoding). Network protocols will need to be adapted to take advantage of this choice, by figuring out when to use which mode.

One last thing about the whole-document encoding: while its primary design goal is to be compact, the close second goal is to allow fast loading of the current state from disk. For example, fetching the current value of a particular field (e.g. the title of a document) should not require decoding the whole file, but only reading a bit of metadata and then seeking to the appropriate place in the file. On some platforms, we might even be able to avoid loading the whole file into memory by memory-mapping it and seeking only to those bits that we need.

To test this, I wrote a hacky experimental decoder that reads the latest text of the LaTeX paper source example (that is, it extracts the 104,852-character final text string from the 1,119,341-byte encoded file). Here are the timings from running that decoder 10 times on my laptop:

12.919ms
6.203ms
55.147ms
1.702ms
1.960ms
4.270ms
8.165ms
5.617ms
1.977ms
1.753ms

I don't know why the running time fluctuates so much (GC maybe?), but a median time of ~5 milliseconds is pretty encouraging. Loading a file with hundreds of thousands of changes would probably take minutes with the current Automerge implementation, if it completes at all.

I will write up a detailed specification of the binary format sometime soon. A lot of thought has gone into it, e.g. designing for future extensibility. For now, I want to put this update out there to share what's happening.

By the way, Orion and Alex have been doing an excellent job of tracking this progress in their Rust port of Automerge. This means interop between the JavaScript and Rust backends is not far off!

key?: string
value?: any
key: string | number
insert?: boolean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it is really hard to reason about an optional bool.
A bool should either true or false. If you really need to express three states, I would prefer a enum with named states.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the only place in the type specification where optional bool is used. Maybe we could clean this up if possible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a side-effect of writing something in idiomatic JavaScript? Certainly good advice for TypeScript code, but encoding these states as 0, 1, 2 would be pretty opaque to the JavaScript consumer, and using strings would bloat the in-memory cost. In practice, I imagine you'd use a falsy check here, so undefined would be equivalent to false. I assume this is still just two states (falsy, true) rather than three (undefined, false, true) in any meaningful sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not tristate, it's just a boolean that defaults to false when absent. I don't know how to declare that in TypeScript. In 02271f0 I've changed operations such that they always have an insert property that is explicitly true or false.

@pvh
Copy link
Member

pvh commented May 4, 2020

Hi Martin, I've had some time now to digest all this and give it some thought.

First, on the subject of timestamps, paying for per-millisecond timestamps on all commits is pretty expensive, and it's not clear to me that there is much utility there. I think per-second timestamps would be a quite-reasonable optimization, and I'd further consider making their insertion optional.

Next, as for key movements or other expensive but less valuable data, I've thought quite a bit about "non-dependent actors" as a useful option here. The concept is (in old-automerge lingo) an actor which depends on other actors but is never depended-upon. It could thus track the main document but would not need to be loaded unless it was necessary. Users could provision and synchronize this kind of cursor history during interactive sessions but throw them away afterwards or simply not bother to share them during asynchronous collaboration. It's not clear to me whether this would still be feasible with the new design but it seems in principle like it should be (simply bump the dependency hash as necessary).

As to the future for Automerge Connection I have been aware of this as an upcoming problem for quite some time and have only recently begun giving it serious consideration. The vector clock model is desirable because calculating the necessary synchronization work between two documents is a trivial operation that only needs to examine the clock. On the face of it, calculating a delta between hash-chained commits would be tricky. We'd need to calculate the least-common-ancestor for the peers and then generate a special commit that caught them up. Streaming the changes to allow a client to receive deltas incrementally would be quite expensive in comparison but perhaps it's a desirable feature to send the result in a single exchange.

My mind does wander to the rsync/noms/jump-rope style architecture for synchronization. Is there a mechanical sympathy with a binary-encoding-level synchronization? Could we use some kind of probablistic-blocks method to determine what needs to be exchanged in each direction at the storage-layer? It's unclear to me right now.

All of this stuff also intersects with privacy concerns. We need to think carefully about how we query between peers and what we expose and to whom. Content hashes are, in general, pretty sensitive.

All that said, I'm very excited that we're getting to the point of working on these problems! I'm eager to dig into all of this in more detail as the binary layers come into greater focus. We'll need to find a way to commission a project to work on this stuff sometime soon, but that's a topic for a different venue.

@mlockett42
Copy link

Minisketch (https://github.com/sipa/minisketch) might be a useful library for dealing with synchronizing sets of hashes efficiently. This could be more efficient than storing and exchanging a whole lot of vector clock metadata.

@HerbCaudill
Copy link
Contributor

First, on the subject of timestamps, paying for per-millisecond timestamps on all commits is pretty expensive, and it's not clear to me that there is much utility there. I think per-second timestamps would be a quite-reasonable optimization, and I'd further consider making their insertion optional.

+1 on per-second timestamps, and +1 on making timestamps optional altogether.

@HerbCaudill
Copy link
Contributor

The switch to using hash chaining will have a bearing on network protocols that sync up Automerge replicas, such as Automerge.Connection, which currently relies on vector clocks. This is something I would like to discuss with the community.

For what it's worth, I've got a lot of code built on top of the vector clock model; so I'd hate to see it thrown out entirely. I'd understood from this conversation that the plan was to keep vector clocks but augment them with chain hashing, is that no longer the case?

By the way, I have split Automerge.Connection into a separate repository, because I think we might want to evolve it and assign it version numbers independently from the Automerge core.

That definitely strikes me as the right move - it's always seemed a bit out of place in the main codebase.

@HerbCaudill
Copy link
Contributor

A key insight from the experiments above is that when dealing with large change histories, it is much more efficient to encode the document as a whole than to encode each change separately (in the example above, it makes a factor of 50 difference). This has implications for network protocols that sync Automerge nodes: for example, Hypermerge is currently based on append-only logs of changes, assuming that we're working with one change at a time.

Just to be clear, we'll always have access to the individual changes, right? Cevitxe also stores an append-only log of changes. Maybe there need to be different strategies available for the multiuser-text-editing use case, where you have more individual changes with tiny payloads, vs. the replicated-database use case, where you have fewer changes with more substantial payloads.

@ept
Copy link
Member Author

ept commented May 11, 2020

Okay, I've changed it to use second-resolution timestamps. Here is the updated file size:

Binary format Binary + gzip
Paper (whole document) 695,298 bytes 302,067 bytes

I am really pleased with this: the gzipped document uses less than one byte per change, while still capturing the full keystroke-by-keystroke change history!

Just to be clear, we'll always have access to the individual changes, right?

Yes, the full log of individual changes will always be available (e.g. through Automerge.getChanges() and Automerge.getAllChanges()). The compressed document format is only returned by Automerge.save().

I've got a lot of code built on top of the vector clock model; so I'd hate to see it thrown out entirely. I'd understood from this conversation that the plan was to keep vector clocks but augment them with chain hashing

Agree, I think we should keep vector clocks but augment them with hash chains. Vector clocks are still useful because they allow two nodes to sync up in one round-trip, while resolving hash chains sometimes requires several round-trips.

This was referenced May 19, 2020
@ichorid
Copy link

ichorid commented May 21, 2020

May I suggest using lz4 compression instead of gzip? Of course, lz4's compression ratio is considerably worse than that of gzip, but the speed of compression is about 6x times higher. Also lz4 is able to compressed data incrementally which is very nice to e.g. fit as much data as possible into network packet without resorting to backtracking.

@ept
Copy link
Member Author

ept commented May 25, 2020

May I suggest using lz4 compression instead of gzip?

This compression step happens outside of Automerge, so you can combine Automerge with the compression algorithm that best suits your requirements.

ept added a commit that referenced this pull request Jun 2, 2020
@dmonad
Copy link

dmonad commented Jun 3, 2020

I was curious to test the performance improvements in crdt-benchmarks. This is quite a leap!

You can compare the results here: dmonad/crdt-benchmarks@73f3f65

It is very impressive how well your new algorithm compresses meta information! The only thing that I'm worried about is that parsing of the compressed document (parseTime) now takes longer than before. There seems to be quite some overhead in encoding the document in this way. In my experience, the true problem of CRDTs is that they are too slow to parse long edit-histories. Network traffic and storage are not really a problem.

I see that you implemented a similar module for encoding/decoding as I did for Yjs. If you are interested you could make use of lib0/encoding and lib0/decoding. I already put a crazy effort into optimizing read/write performance.

@ept
Copy link
Member Author

ept commented Jun 3, 2020

Hi @dmonad, thanks for running the benchmarks! The loading of documents is not yet optimised at all, and currently takes a slow path through all the old data structures. That is the next thing on my list of things to fix.

@ept ept changed the base branch from master to main June 12, 2020 12:41
philxia added a commit to philxia/automerge that referenced this pull request Jun 27, 2020
* test/watchable_doc_test: move callback registration to beforeEach()

* copy automerge.d.ts to dist on build/publish

* use strict mode everywhere

The original code here suggests that in a browser,
modifying a frozen object will fail silently, whereas
in node it will throw an error. This test was failing
for me in browser testing, so I dug into it and it
turns out the difference isn't node vs browser
but strict mode vs not. (In mocha.opts we had
"use strict" so node tests were always running
in strict mode. TypeScript was also compiling in
strict mode.)

Since we have control over whether we're in strict 
mode or not, it makes more sense to just turn it on
everywhere and only test for that scenario.

screw it, just use strict mode everywhere

* test/types: add CounterList

* @types/automerge/index.d.ts: `message` can be undefined

* restore uuid factory, add typings

* @types/automerge/index.d: add uuid()

* @types/automerge/index.d: getElemId

* put all types in automerge namespace

* test/test: restore message type test

* make `Change` generic: `Change<T>`

* test/proxies_test: restore tests with string indices

* Action parameter is different for Ops vs Diffs

* properly type Table.add so it enforces property order when passing array

* npm: fix name of types file

* Automerge.Text no longer considered experimental

* Undo unnecessary formatting changes

* Revert back to plain JS for tests

* Nicer JSON serialization of Counter, Table, and Text

* Improve JSON serialization of Table

Put the rows in a separate object from the columns, so that applications
can iterate over the rows by calling `Object.values(doc.table.rows)`.

* Changelog for automerge#163 and 0.10.1

* 0.10.1

* actorId and elemId are not necessarily UUIDs

* Remove unused type definitions for tests

* setActorId is frontend-only

* Some corrections of the TypeScript types

* Initial set of tests for the TypeScript API

* Fix undo/redo when using separate frontend and backend

Bug: when using Automerge in a configuration with separate frontend and
backend, attempting to perform `Frontend.undo()` would result in the
following exception:

    TypeError: Cannot read property 'updated' of null
      at makeChange (frontend/index.js:113:43)

Fixed the bug and added tests that exercise undo/redo in the case where
frontend and backend are separate.

* Fix some bugs in type definitions

* Exception on modifying frozen object is optional

In Node v12.3.1 and v10.16.0, assigning or deleting a property of a
frozen object no longer throws an exception like it did in previous
versions. Rather than trying to split the test by version number, I
figured it was easier to just remove the assertion that an exception is
thrown (and just check that the object remains unmodified).

* Tests still to do

* Update caveats in readme

* Exception on modifying frozen object is optional

In Node v12.3.1 and v10.16.0, assigning or deleting a property of a
frozen object no longer throws an exception like it did in previous
versions. Rather than trying to split the test by version number, I
figured it was easier to just remove the assertion that an exception is
thrown (and just check that the object remains unmodified).

* Changelog for automerge#165

* Run Travis tests on Node 11 and 12 as well

* README: note correct immutable behavior

* index.d.ts: remove stray question mark

* typescript tests: Automerge.Text

* typescript tests: Automerge.Table

* no need to specify type twice

* expand on usage of types in initialization

* typescript tests: Automerge.Counter

* ignore .vscode

* add `Automerge.from`

Returns a new document object with the given initial state. See automerge#127

* Document created with `from` needs to have a backend

* Update the documentation for Automerge.from

* Changelog for automerge#127

Closes automerge#175

* Fix indentation and whitespace

* Remove unused code

* Add types for Automerge.from (automerge#127)

* WIP: experiment with readonly types returned outside of Automerge

* @types/automerge/index.d: move comment

* aha! Doc<T> = DeepFreezeObject<T>, not DeepFreeze<T>

* tweak `any` tests to work with DeepFreeze

* group types functionally

* upgrade typescript

* rename to Freeze, Proxy

* more rearranging

* simplify Table definition by inheriting from Array<T>

* getConflicts key is `keyof T`

* add documentation for Doc<T> and Proxy<D>

* formatting

* fix handler signature
automerge/automerge#155 (review)

* add tests for DocSet and WatchableDoc

* add DocSet tests

* ignore .vscode

* Revert "Update the documentation for Automerge.from"

This reverts commit c0ba4d2.

* Make object freezing optional

It turns out that in V8, copying frozen objects (especially arrays) is a
lot more expensive than copying non-frozen objects.
https://bugs.chromium.org/p/chromium/issues/detail?id=980227
https://jsperf.com/cloning-frozen-array/1
https://jsperf.com/cloning-frozen-object/1

Removing freezing entirely from skip_list.js since these objects are not
part of the user-facing API, and so we don't need to worry as much about
accidental mutation.

Retaining freezing in a few places in counter.js, table.js,
apply_patch.js and frontend/index.js where it has negligible performance
impact.

* Allow freeze option to be passed in to init() and load()

* Update README: we no longer freeze objects by default

* Slight README refresh

* Our own object-copying function is faster

Surprisingly, Object.assign() is not the fastest way of copying an
object: https://jsperf.com/cloning-large-objects/1

* Remote type parameter from Change interface

It is only needed for the "before" property, and I think that property
should not be considered part of the public API. (It is only used within
the frontend for internal bookkeeping; the Change objects produced and
consumed by the backend do not have this property.)

* Use opaque BackendState type in the backend

* Remove unused type parameter from Message interface

* Update the documentation for Automerge.from

* Changelog and README updates for automerge#155

* TypeScript support for freeze option

* Changelog for automerge#177 and automerge#179

Fixes: automerge#177

* Changelog for v0.11.0

* 0.11.0

* Allow Text instatiation with preset value

* docs: Add introduction to "Applying operations to the local state"

* `DocSet<T>` was missing `docIds` property

* `DocSet.getDoc` return value is `Doc<T>,` not `T`

# Conflicts:
#	@types/automerge/index.d.ts

* `Clock` is a `Map`, not an object

* allow passing `init` options to `from`

* index.d.ts: more precise InitOptions definition; add InitOptions to .from

* add `freeze` to `InitOptions`

* add tests for `from` with options

* refactor: Remove unused `local` list

* docs: Add local state overview

* Types for `Frontend.init()` and `Frontend.from()`

Addendum to automerge#183

* Changelog for automerge#183

* Some refactoring/tidying

Name `instantiateText` is for consistency with `instantiateTable`

* Allow modification of Text without assignment to document

Methods `.set()`, `.insertAt()` and `.deleteAt()` are now defined
directly on Automerge.Text, rather than going through the listProxy.
Therefore, these methods can now be called directly on a Text object
without having to first assign the Text object to the document.

Fixes automerge#180, automerge#166

* Changelog for automerge#180 and automerge#181

* List methods such as .filter() should use proxy objects

Fixes automerge#174

* Upgrade npm dependencies

- Remove `@types/sinon` and `@types/uuid` dependencies, which are unused.
- Update all version numbers in package.json to their latest version,
  except for `babel-loader`, which I'm keeping unchanged for now
  (upgrading it to 8.0.6 breaks `yarn build`, and I can't be bothered to
  figure that out now).
- Run `yarn upgrade` to bring packages up-to-date. The new version of
  `get-caller-file` dropped support for Node 9, and since that version
  of Node is now unsupported anyway, I dropped it from the Travis CI
  test matrix.
- Version 4.0.0-alpha.4 of `selenium-webdriver` requires Node 10 or
  higher. As Node 8 is still supported, I want to keep it in the CI text
  matrix. Thus, I had to add a resolution rule fixing `selenium-webdriver`
  to 4.0.0-alpha.1, which is compatible with Node 8.

* docs: Add description for determining causal readiness

* docs: Add missing operations

* .toString() of Text should return a string content

* test.js: add tests for concurrent deletion

* Fix stack overflow error on large changes

Fixes automerge#202. The stack overflow is not due to unbounded recursion, but
rather because of a method's varargs argument list being too long
(presumably each argument takes up some space on the stack). In
particular, there are several places where we are calling Array.push
with an argument list generated using the spread operator in order to
concatenate arrays. This commit replaces them with for loops, which also
happen to be faster: https://jsperf.com/pushing-large-arrays

* Changelog for 0.12.0

* 0.12.0

* test/typescript_test.ts: docSet.docIds

* Revert "`Clock` is a `Map`, not an object"

This reverts commit d8fd73f.

* Changelog for automerge#174, automerge#184, and automerge#199

* Readme updates

* Fix trap error by setting writable property descriptor to true

* add initialization tests for array, primitives

* fix typo in test name; make describe labels consistent

* readme: automatic formatting

* readme: initializing a document

* readme: updating a document

* readme: rearrange outline

* readme: actorId note

* readme: undo/redo edits

* readme: sending/receiving edits

* readme: conflicting changes edits

* readme: crdt datatypes edits

* readme: emoji

* readme: cross-references

* Tidy up description of network layers

* Text.toString()

* Add DocSet.removeDoc

* Changelog for automerge#210 and 0.12.1

* 0.12.1

* Remove unused parameter in SkipList node

Fixes automerge#213

* a test for round-tripping a control object

* split out tests a bit

* Allow objects to appear as elements in Automerge.Text

Fixes automerge#194

* toSpans() and a test set for it

* add to types

* okay, maybe not any

* quill delta experiment

* show that overlapping spans work

* basic apply-delta function

* account for control characters occupying space during delete & retain

* clean up the op application functions a little bit

* parse embeds from deltadocs

* make it possible to distinguish between attributes and embed control characters

* finish support for embeds

* explicitly account for control characters vs embeds

* Fix freezing of opaque string types

* Include link to repo

* docs(readme): add perge library to list under sending/receiving changes

* Update README.md

* Allow options object to be passed in to Automerge.change

Similarly in emptyChange(), undo(), and redo()

* canUndo should return false after Automerge.from()

* Bump handlebars from 4.1.2 to 4.5.3

Bumps [handlebars](https://github.com/wycats/handlebars.js) from 4.1.2 to 4.5.3.
- [Release notes](https://github.com/wycats/handlebars.js/releases)
- [Changelog](https://github.com/wycats/handlebars.js/blob/master/release-notes.md)
- [Commits](handlebars-lang/handlebars.js@v4.1.2...v4.5.3)

Signed-off-by: dependabot[bot] <support@github.com>

* fix deleteAt bug with input 0

* New `Automerge.getAllChanges()` API

* Bring internals documentation up-to-date

* Document frontend-backend protocol

* Bring the rest of the internals doc up-to-date

* Reword intro

* Fix indentation

* Changelog for v0.13.0

* 0.13.0

* Update copyright

* Add link to automerge-client-server

* Rephrase caveats

* remove tests for adding a Table row as an array of values

* Remove KeyOrder array from Table type

* remove KeyOrder array from types in tests

* clean up documentation

* remove undocumented ability to add table row from array of values

* remove test for API being removed

* Bump acorn from 6.2.1 to 6.4.1

Bumps [acorn](https://github.com/acornjs/acorn) from 6.2.1 to 6.4.1.
- [Release notes](https://github.com/acornjs/acorn/releases)
- [Commits](acornjs/acorn@6.2.1...6.4.1)

Signed-off-by: dependabot[bot] <support@github.com>

* support more nesting & fix retain logic

* use string concat instead of join()

* Update dependencies (yarn upgrade)

* Updating dependencies

* Throw exception if people try to use the old API

* Remove obsolete Node 8 and 11 from testing matrix

* Changelog for automerge#236

* Get Sauce Labs tests working again

* Update Sauce Labs build badge

* Remove the list of columns from Automerge.Table

Since automerge#236 the Automerge.Table type no longer allows rows to be added as
an array of values, which are then mapped to column names. Now that this
feature is removed, nothing in Automerge is actually using the list of
columns that is currently required by the Automerge.Table constructor.
The columns are still saved, but since they do not enforce any schema,
they are purely advisory.

Hopefully one day we can have proper schema support in Automerge, but
the current half-hearted implementation is not really helping us get
there. Therefore I think it is best to just remove this list of columns
feature.

* Update Mocha configuration file format

mocha.opts is deprecated. This stops the annoying deprecation warning
when running the tests.

* Mocha should load test files

* Remove trailing whitespace

* Changelog for automerge#238

* Make table row's primary key available as `id` property

Currently, if you have a row object from an Automerge.Table, you can get
its primary key with `Automerge.getObjectId(row)`. This API is not very
discoverable; users who don't know about this API might be tempted to
generate their own IDs for table rows, which would miss the entire point
of Automerge.Table.

This patch makes the primary key more visible by making it available as
`row.id` instead. When a new row is added, we check that the row object
doesn't already have an `id` property. We also ensure that the `id`
property cannot be modified.

(Besides API usability, there is also a deeper reason for making this
change: on the `performance` branch, objectIds are backend-generated
Lamport timestamps rather than UUIDs to enable better compression; since
`Table.add` should synchronously return the primary key of the new row,
it must use a different ID from the objectId. Putting the row's primary
key in a separate property reinforces that distinction.)

* Fix description of Table in README

* Changelog for automerge#241 and automerge#242

* Remove set() method from Automerge.Table API

It is not needed, since add() and remove() handle changes to the set of
rows, and properties of row objects can be updated directly without
requiring any set() operation at the table level.

Looking at the code, I now realise that set() was not intended to be
part of the public API at all: it is called while applying a patch, and
if a user calls it, it does not generate any operations in the change
context. Hence I renamed it to start with an underscore (like `_clone()`
and `_freeze()`), and added a warning to the comment.

* Changelog for automerge#243

* Changelog for 0.14.0

* 0.14.0

* Link to new CRDT website

* Update table class type to reflect property getter

* Changelog and test for automerge#249

* Fix console inspection of proxy objects in Node

* Fix type signatures for WatchableDoc#get and WatchableDoc#set

* Use Slack's own invitation link instead of communityinviter

* Queued changes (whose dependencies are missing) should also be saved

Previously, load(save()) would discard any queued changes.

Bug reported by @KarenSarmiento, fixes automerge#258

* Make clearer that fine-grained updates are preferred

Add a README section as suggested by @johannesjo in automerge#260, and make a
more descriptive exception when users try to assign an object that is
already in the document.

Fixes automerge#260

* Changelog for 0.14.1

* 0.14.1

* Link to automerge#253 from README

* support calling indexOf with an object

* logo assets

* replace h1 with logo

* README.md: smaller logo

* images with wider spacing

* revert unintended reformatting

* Update name of main branch

fixes automerge#264

Co-authored-by: Herb Caudill <herb@devresults.com>
Co-authored-by: Martin Kleppmann <martin@kleppmann.com>
Co-authored-by: Herb Caudill <herb@caudillweb.com>
Co-authored-by: Harry Brundage <harry.brundage@gmail.com>
Co-authored-by: Irakli Gozalishvili <contact@gozala.io>
Co-authored-by: Eric Dahlseng <edahlseng@yahoo.com>
Co-authored-by: Peter van Hardenberg <pvh@pvh.ca>
Co-authored-by: Mikey Stengel <mikey.stengel96@gmail.com>
Co-authored-by: Brent Keller <brentkeller@gmail.com>
Co-authored-by: Jeff Peterson <jeff@yak.sh>
Co-authored-by: Max Gfeller <max.gfeller@gmail.com>
Co-authored-by: Sam McCord <sam.mccord@protonmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Speros Kokenes <speros@speross-mbp.lan>
Co-authored-by: Jeremy Apthorp <nornagon@nornagon.net>
Co-authored-by: Lauritz Hilsøe <mail+gh@lauritz.me>
Co-authored-by: vincent.capicotto <vincent.capicotto@hiptest.net>
Co-authored-by: Phil Schatz <253202+philschatz@users.noreply.github.com>
@ankit-m
Copy link

ankit-m commented Jul 18, 2020

@ept I was wondering if you are targeting a release date for v1.

@ept
Copy link
Member Author

ept commented Jul 23, 2020

Hi @ankit-m, no definite schedule, but I hope to have a preview release within the next month or two. A final stable release may take a bit longer (since this branch has a lot of new code, there will be bugs that need to be ironed out) but I'm hoping well before the end of the year.

ept added 6 commits July 23, 2020 10:56
Rather than (0, 0). The reason for this change is that when a change
gets applied to a document, an actor index of 0 in a change gets
transformed into the document's actor index for the author of that
change, which may be any number. This is confusing, since the head of a
list is not actually associated with any one particular actorId.
pvh and others added 25 commits April 28, 2021 23:30
This makes patches match the TypeScript type definitions, where props
and edits are non-optional properties
common.js is intended for code that is shared by frontend and backend,
but appendEdit is only needed in the backend.
Changed it so that patches for both list and map objects initially have
a `props` property, which for lists is indexed by elemId. This makes it
easier and more robust to find the appropriate subpatch without having
to scan the list of edits. The `props` property is then deleted by
finalizePatch() before the patch is sent to the frontend.
…ormat

Implements new frontend-backend protocol with more compact handling of
list element insertion and deletion.

Fixes #311
Update dependencies and Sauce Labs configuration. The Uint8Array changes
are because the latest version of Safari otherwise throws exceptions
while running the tests.
@ept ept merged commit c4ff6f3 into main May 5, 2021
@ept
Copy link
Member Author

ept commented May 5, 2021

All the compatibility-breaking changes that I've planned to make are now completed and merged into the performance branch, and so I'm declaring this branch done and merging it into main. No doubt there will be bugs, but we can address them on separate PRs. Hooray, it's shipping! 🚢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet