Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Clone in Desktop Download ZIP

Loading…

Add Sequence Numbers to write operations #10708

Open
bleskes opened this Issue · 1 comment

3 participants

@bleskes
Owner

Introduction

An Elasticsearch shard can receive indexing, update, and delete commands. Those changes are applied first on the primary shard, maintaining per doc semantics and are then replicated to all the replicas. All these operations happen concurrently. While we maintain ordering on a per doc basis, using versioning support there is no way to order them with respect to each other. Having such a per shard operation ordering will enable us to implement higher level features such as Changes API (follow changes to documents in a shard and index) and Reindexing API (take all data from a shard and reindex it into another, potentially mutating the data). Internally we could use this ordering to speed up shard recoveries, by identifying which specific operations need to be replayed to the recovering replica instead of falling back to a file based sync.

To get such ordering, each operation will be assigned a unique and ever increasing Sequence Number (in short, seq#). This sequence number will be assigned on the primary and replicated to all replicas. Seq# are to be indexed in Lucene to allow sorting, range filtering etc.

Warning, research ahead

What follows in this ticket is the current thinking about how to best implement this feature. It may change in subtle or major ways as the work continues. Is is important to implement this infrastructure in a way that is correct, resilient to failures, and without slowing down indexing speed. We feel confident with the approach described below, but we may have to backtrack or change the approach completely.

What is a Sequence #

Applying an operation order on a primary is a simple question of incrementing a local counter for every operation. However, this is not sufficient to guarantee global uniqueness and monotonicity under error conditions where the primary shard can be isolated by a network partition. For those, the identity of the current primary needs to be baked into each operation. For example, late to arrive operations from an old primary can be detected and rejected.

In short, each seq# consists of two numbers:

  • a term - this number is incremented with every primary assignment and is determined by the cluster master. This is very similar to the notion of a term in Raft, a view-number in Viewstamped Replication or an epoch in Zab.
  • a counter - this number is incremented by the primary with each operation it processes.

To achieve ordering, when comparing two seq# , s1 & s2, we say that s1 < s2 if and only if s1.term < s2.term or (s1.term == s2.term and s1.counter < s2.counter). Equality and greater than are defined in a similar fashion.

For reasons explained later on, we maintain for each shard copy two special seq#:

  1. local checkpoint# - this is the highest seq# for which all lower seq# have been processed . Note that this is not the highest seq# the shard has processed due to the concurrent indexing, which means that some changes can be processed while previous more heavy ones can still be on going.

  2. global checkpoint# (or just checkpoint#) - the highest seq# for which the local shard can guarantee that all previous (included) seq# have been processed on all active shard copies (i.e., primary and replicas).

Those two numbers will be maintained in memory but also persisted in the metadata of every lucene commit.

Changes to indexing flow on primaries

Here is a sketch of the indexing code on primaries. Much of it is identical to the current logic. Changes or additions are marked in bold .

  1. Validate write consistency based on routing tables.
  2. Incoming indexing request is parsed first (rejected upon mapping/parsing failures)
  3. Under uid lock:
    1. Versioning is resolved to a fixed version to be indexed.
    2. Operation is assigned a seq#
    3. Doc is indexed into Lucene.
    4. Doc is put into translog.
  4. Replication
    1. Failures in step 3 above are also replicated (eg due to failure of lucene tokenization)
    2. Send docs to all assigned replicas.
    3. Replicas respond with their current local checkpoint#.
    4. When all respond (or have failed), send answer to client.
  5. Checkpoint update:
    1. Update the global `checkpoint# to the highest seq# for which all active replicas have processed all lower seq# (inclusive). This is based on information received in 4.3 .
    2. If changed, send new global checkpoint# to replicas (can be folded into a heartbeat/next index req).

Changes to indexing flow on replicas

As above, this is sketch of the indexing code on replicas. Changes with the current logic are marked as bold.

  1. Validate request
    1. Seq#'s term is >= locally known primary term.
  2. Under uid lock:
    1. Index into Lucene if seq# > local copy and doesn't represent an error on primary.
    2. Add to local translog.
  3. Respond with the current local checkpoint#

Global Checkpoint# increment on replicas

The primary advances its global checkpoint# based on its knowledge of its local and replica's local checkpoint#. Periodically it shares its knowledge with the replicas

  1. Validate source:
    1. Seq#'s primary term is == locally known primary term.
  2. Validate correctness:
    1. Check that all sequence# below the new global checkpoint# were processed and local checkpoint# is of the same primary term. If not, fail shard.
  3. Set the shard’s copy of global checkpoint#

First use case - faster replica recovery

Have an ordering of operations allows us to speed up the recovery process of an existing replica and synchronization with the primary. At the moment, we do file based sync which typically results in over-copying of data. Having a clearly marked checkpoint# allows us to limit the sync operation to just those documents that have changed subsequently. In many cases we expect to have no documents to sync at all. This improvement will be tracked in a separate issue.

@bleskes bleskes added the resiliency label
@shikhar

First use case - faster replica recovery

I'd argue the first use case is making replication semantics more sound :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.