Skip to content
Calle Laakkonen edited this page Jan 18, 2020 · 2 revisions

How Drawpile works

This document describes a high level overview of how collaborative painting is implemented in Drawpile.

The protocol

Drawpile uses a simple message based protocol communicate amongst all the members of a session.

Each drawing command, like penmove, penup or fillrect has a unique message type. The history of a drawing session is a sequence of these commands. The history can be recorded and replayed, producing the same end result.

Normally, the commands are encoded in an efficient binary format, but a text encoding is also supported. Here is a short example recording that can be opened in Drawpile.

!version=dp:4.20.1
1 resize right=100 bottom=100
1 newlayer id=1 title=Demo
1 brush layer=1
1 penmove 0 0
1 penmove 100 100
1 penup

The above sample creates a 100x100 image with a single layer and draws a diagonal line across it. See tests/ directory for more examples.

There are three different types of messages in the protocol: Control, Meta and Command.

Message types can also be divided into Transparent/Opaque (the messages the server needs to understand and the ones it just passes along) and recordable/nonrecordable (ones that make sense to store in recordings) types.

Control messages are used for client/server communications (e.g. negotiating login details, setting server properties, etc.) and are not relevant to drawing.

Meta messages are not directly related to drawing, but may affect how subsequent commands are filtered. They include things like user login notifications and layer access control settings.

Command messages are the actual drawing commands that effect changes to the canvas. A (filtered) session history with all the meta messages removed should yield the exact same end result as the original.

The server

A realtime collaborative editor has two challenges to solve: allowing a user to work on their changes with minimal latency and how to apply the edits in a consistent order.

More on the first problem later.

Drawpile solves the second problem in the most simple way possible: all edits go through a central server, which distributes them to all connected clients (including the originator) in the same order. So, when a client receives commands from the server, it knows all the other clients also received, or will receive, them in the same order.

The server itself is quite dumb. It does not need to understand the semantics of (opaque) messages, it just needs to add them to it's own copy of the sesssion history and then transmit that history to all clients. Newly logged in clients will have to download the entire history to catch up.

Access controls

Since the server is dumb, it does not know what a layer is or anything else about the actual business of painting. Therefore it cannot enforce any access controls, such as layer specific locks ^*^. Instead, access control filtering is performed on the client side. Allowed command are added to the client's internal history and filtered commands are simply dropped.

The server is responsible for gross access controls: whether to let a new user in and which users are considered "session owners". Session owners have the right to perform certain serverside commands normal users cannot, such as kicking other users from the session or changint the session title. They also have some rights the server does not know about, which are validated on the client side only.

^*^ Prior to version 2.0, the server did have an understand of the drawing command semantics and did perform the filtering. However, this mean the server version was very tightly coupled to the client version and changes to the protocol meant the server could not serve newer or older client versions.

The canvas

When a message makes it past the filters, it can finally have an effect on the canvas. In Drawpile's source code, this is handled by the class StateTracker. Drawpile's protocol is not entirely stateless. To allow brush strokes to appear in real time as they're drawn, they are sent as a series of penmove commands, rather than as a single complete pen stroke. So, a complete brush stroke is made up of multiple commands, like this:

1 brush layer=1 color=#ff0000 size=2
1 penmove 0 0
1 penmove 100 100
1 penup

Each command is prefixed with the user ID. (In the binary protocol, the ID is part of the message header.) The actual canvas part doesn't know about multiple users anymore, it just knows about multiple tool contexts, so the user ID is also called a context ID. For each context ID, the state tracker remembers the following information:

  • The selected layer
  • Current brush settings
  • Last pen position
  • Is the pen down? (a penmove when the pen is not down starts a new stroke)

Side note: If the brush/penmove and penup commands were to be merged into a single PenStroke command, the state tracker would not be necessary. However, this would mean brush strokes could not be streamed out in real time.

The most important part of the canvas is the layer stack: a list of one or more layers that hold the actual image data.

Layers

Layers are where the actual drawing happens. Layers offer the drawing primitives on which all the painting features are built: brush dab, brush line, rectangle fill and image put. Brush dab composites a brush mask onto the layer. Brush line draws a straight line segment by dabbing the brush at regular intervals (the spacing% brush property controls the interval.) Rectangle fill just fills a rectangular section with the specified color and blending mode. PutImage allows arbitrary pixel data to be written onto the layer (used, for example, when pasting images from the clipboard.)

A layer is not simply a long pixel array, but a matrix of tiles. Each tile is a 64x64 pixel buffer with copy-on-write semantics. A tile can be a null tile, in which case it behaves like a tile whose every pixel has a zero value.

Tiles having COW semantics has some very important consequences:

  • Copying a layer is very cheap: the content is shared until modified, so only the metadata and tile vector must be copied
  • Most layers are either mostly empty or mostly solid color: empty tiles consume just a single pointers worth of memory, and single color layers need just one tile for the whole layer
  • Cheap layer copies are essentials to implementing undo and retcon (more on those later)
  • Tile identity comparison can be used to quickly and cheaply find changed tiles for cache refreshes
  • Compositing mostly empty layers is very fast, since null tiles can be skipped

When pixel data on a layer changes, its "dirty region" is updated. The dirty region is the bounding rectangle of all pixels that have changed since the last screen refresh. When a change is reported, the UI will request a repaint of its cache pixmap. The layer stack will then "flatten" all the tiles intersecting the dirty region and copy the results onto the pixmap. (Note: a single large pixmap is used to cache the entire canvas in order to avoid visual glitches at the boundaries when zoomed in.)

Undo

In a typical single user editor, undo is implemented as a stack of changes. Each change object has two operations: redo() and undo(). The redo operation applies the change to the document and the undo operation undos the change.

So for example, when a user draws a line on the canvas a DrawLine object is created, it's redo operation called, and it's then pushed onto the undo stack. When the user presses Ctrl+Z, the DrawLine is popped off the stack and its undo operation called. Next, if the user presses Shift+Ctrl+Z, the object's redo operation is called and it is pushed back onto the stack.

Pixel operations aren't typically reversible, so a snapshot of the region painted over must be stored in the change object. This means that if the effect of the changes overlap, they must be undone in order they appear in the stack. For a collaborative drawing program, this can be a problem.

Imagine Alice draws a line. Bob then draws another line over it. Alice now undos her change. If the program would just naively call Alice's change object's undo() function, it would end up with a hole right in the middle of Bob's drawing.

Instead, the program must undo all changes up to Alice's change, then redo the ones that weren't supposed to be undone.

What Drawpile does is effectively the same, just implemented a bit differently to take advantage of features we already have: the session history and cheap layer copies.

But first, we must define what is an undoable action. Most of the time, undoing just one penmove makes no sense. Undoing just a penup never makes sense. To solve this problem, Drawpile protocol has a special command called UndoPoint that is used to demarcate the history into undoable sections. An undoable changeset is therefore the set of messages belonging to the same context ID, starting from (and including) the first undopoint until the next one.

We just need to introduce one more concept to have all the tools needed to implement undo: canvas snapshots. A canvas snapshot is exactly what it sounds like: a copy of the canvas at some point in time. It exploits the fact that COW layer copies are very cheap. When drawing happens, only the changed tiles need to be copied and made unique, so most canvas snapshots share most of their data.

The undo algorithm for context id C:

  1. Searches backward in history until the first not-undone UndoPoint UP belonging to C is found
  2. Find a state savepoint SP that was taken at or before UP. (A state savepoint is typically made at each UndoPoint)
  3. Mark all done commands belonging to C, starting at UP until the end of history as undone
  4. Reset canvas to SP
  5. Re-execute all not-undone commands starting from SP

The end result of calling undo(Alice) is that Alice's line appears to have been undone from right undearneath Bob's drawing!

Redo works almost the same:

  1. Find the first undone UndoPoint UP belonging to C, searching forward from the bottom of the undo stack.
  2. Find a state savepoint SP that was taken at or before UP.
  3. Mark undone commands belonging to C, from UP up to the next undopoint as done.
  4. Reset canvas to SP
  5. Re-execute all not-undone commands starting from SP

Okay, so now we have undo and redo, but what happens when I press undo twice, draw something, then press redo? This implementation would redo the first undone changeset, but that's not what we want. When we draw something after undoing, we throw away all the changesets outside the undo stack, so there isn't anything to redo.

Another way to look at it, is to think of the undo history as a directed acyclic graph where each node represents an undoable operation. We have a pointer that points to the node corresponding to the current state of the canvas. (That is: when we apply all operations reachable from this node to the initial canvas state, we end up with the current canvas.) When we undo something, we move the pointer to the previous node. When we draw something, we add a node. If the current node is not a leaf node, we create a branch. In a more advanced program, these branches would be accessible too.

To implement this sort of branching in a linear history, we need a third undo state: {done, undone, gone}

The new gone state behaves otherwise the same as undone, but it can never be redone.

Now, when we add a new UndoPoint to the history, we'll mark undone messages (that belong to the same context ID) as gone, to represent the fact that they are in a now unreachable fork of the undo history:

  1. Mark all undone messages belonging C as gone

And that's it! We now have a multiuser aware undo/redo system!

Retcon

The other problem of realtime collaborative editors mentioned earlier is how to allow users to work on their own changes without network latency getting in their way. Drawpile keeps all clients in sync by routing all drawing commands through a server which decides the canonical message ordering. Without lag hiding, the latency between moving the stylus and seeing the stroke appear on screen would make using the program frustrating beyond belief.

An earlier, pre-1.0, version hid the latency by using preview strokes: brush strokes drawn on a temporary layer that would be removed once the messages finished their roundtrip. While it worked, more or less, it was a pessimistic approach.

If we think of each command as a function applied to the canvas, we can model the session history as a sequence of function applications: C = C0 * f1 * f2 * ... * fn.

Some of these functions are commutative: we can get the same C from C0 * f1 * f2 and C0 * f2 * f1. An example of two commutative operations are brush strokes whose bounding rectangles do not intersect. In collaborative systems jargon, these kinds of operations are called concurrent and the non-commutative ones causally dependent.

Now, as it turns out, people will naturally avoid drawing over each other, so the vast majority of messages from different users coming in at about the same time are actually concurrent. So, the old approach using preview strokes was optimizing for the worst case that almost never happens.

The undo/redo system provides a toolkit for building a smarter latency hider. In the new system, local drawing operations are immediately applied to the canvas. When a message comes in that is not concurrent with the local changes, the canvas is rolled back and redrawn in the right order. This functionality in Drawpile is called retcon, since it rewrites history whenever inconsistencies are detected.

Local fork

Let's think about the session history as a DAG again. When we add commands directly to the canvas without waiting for the roundtrip, we'll keep track of them in a branch we call the local fork.

The local fork is created at the end of the session history (if it doesn't exist already) and the new command is appended to it. The commands are applied to the canvas immediately.

Concurrent commands (in case of brush strokes, this means the stroke bounding rectangles do not intersect) from another user are appended to the mainline session history as usual. The local fork is no longer at the end of the history.

When our own command has made the round trip, it is appended to the mainline history and removed from the local fork.

A problem arises when an incompatible command (causally dependant of any command in the local fork) is received.

The inconsistency is resolved by rolling back the canvas to some point before the local fork, then moving the local fork to the end of the history and replaying the commands. (Note: in practice, it's a better idea to just erase the local fork, as there are probably more non-concurrent commands on the way. Our own commands will be finishing their roundtrip shortly anyway.)

Finally, it is possible that we receive one of our own commands that does not match the tail of the local fork. This is rare, but not necessarily an error: our command may have been filtered away due to a lock, or the session operator may have undone an action on our behalf.

This is also resolved by rolling back and replaying, and this time the local fork must be discarded.

Undo with retcon

But how will undo/redo work with a forked session history? We will bypass the problem by ignoring undo commands in the local fork. Instead, we will only execute them when they are added to the mainline history. This simplifies the algorithm, at the expense of sacrificing lag-freeness.

Now, remember that when an undo command is executed, the session history is searched for the latest undoable undopoint UP and the canvas state is rolled back to some savepoint SP, such that SP≤UP.

The problem is that since the local fork precedes the undopoint, SP=UP will already contain the effects we are trying to undo!

Luckily, the solution is simple: we prohibit the creation of savepoints after the fork point. This way, every SP is guaranteed to never contain effects that are not yet in the mainline session history.

Since SP preceeds the local fork, moving the fork to the end of the history before rolling back is also safe.