Add Sequence Numbers to write operations #10708

Open
bleskes opened this Issue Apr 21, 2015 · 8 comments

Projects

None yet

5 participants

@bleskes
Member
bleskes commented Apr 21, 2015 edited

Introduction

An Elasticsearch shard can receive indexing, update, and delete commands. Those changes are applied first on the primary shard, maintaining per doc semantics and are then replicated to all the replicas. All these operations happen concurrently. While we maintain ordering on a per doc basis, using versioning support there is no way to order them with respect to each other. Having such a per shard operation ordering will enable us to implement higher level features such as Changes API (follow changes to documents in a shard and index) and Reindexing API (take all data from a shard and reindex it into another, potentially mutating the data). Internally we could use this ordering to speed up shard recoveries, by identifying which specific operations need to be replayed to the recovering replica instead of falling back to a file based sync.

To get such ordering, each operation will be assigned a unique and ever increasing Sequence Number (in short, seq#). This sequence number will be assigned on the primary and replicated to all replicas. Seq# are to be indexed in Lucene to allow sorting, range filtering etc.

Warning, research ahead

What follows in this ticket is the current thinking about how to best implement this feature. It may change in subtle or major ways as the work continues. Is is important to implement this infrastructure in a way that is correct, resilient to failures, and without slowing down indexing speed. We feel confident with the approach described below, but we may have to backtrack or change the approach completely.

What is a Sequence

Applying an operation order on a primary is a simple question of incrementing a local counter for every operation. However, this is not sufficient to guarantee global uniqueness and monotonicity under error conditions where the primary shard can be isolated by a network partition. For those, the identity of the current primary needs to be baked into each operation. For example, late to arrive operations from an old primary can be detected and rejected.

In short, each operation is assigned two numbers:

  • a term - this number is incremented with every primary assignment and is determined by the cluster master. This is very similar to the notion of a term in Raft, a view-number in Viewstamped Replication or an epoch in Zab.
  • a seq# - this number is incremented by the primary with each operation it processes.

To achieve ordering, when comparing two operations , o1 & o2, we say that o1 < o2 if and only if s1.seq# < s2.seq# or (s1.seq# == s2.seq# and s1.term < s2.term). Equality and greater than are defined in a similar fashion.

For reasons explained later on, we maintain for each shard copy two special seq#:

  1. local checkpoint# - this is the highest seq# for which all lower seq# have been processed . Note that this is not the highest seq# the shard has processed due to the concurrent indexing, which means that some changes can be processed while previous more heavy ones can still be on going.
  2. global checkpoint# (or just checkpoint#) - the highest seq# for which the local shard can guarantee that all previous (included) seq# have been processed on all active shard copies (i.e., primary and replicas).

Those two numbers will be maintained in memory but also persisted in the metadata of every lucene commit.

Changes to indexing flow on primaries

Here is a sketch of the indexing code on primaries. Much of it is identical to the current logic. Changes or additions are marked in bold .

  1. Validate write consistency based on routing tables.
  2. Incoming indexing request is parsed first (rejected upon mapping/parsing failures)
  3. Under uid lock:
    1. Versioning is resolved to a fixed version to be indexed.
    2. Operation is assigned a seq# and a term
    3. Doc is indexed into Lucene.
    4. Doc is put into translog.
  4. Replication
    1. Failures in step 3 above are also replicated (eg due to failure of lucene tokenization)
    2. Send docs to all assigned replicas.
    3. Replicas respond with their current local checkpoint#.
    4. When all respond (or have failed), send answer to client.
  5. Checkpoint update:
    1. Update the global `checkpoint# to the highest seq# for which all active replicas have processed all lower seq# (inclusive). This is based on information received in 4.3 .
    2. If changed, send new global checkpoint# to replicas (can be folded into a heartbeat/next index req).

Changes to indexing flow on replicas

As above, this is sketch of the indexing code on replicas. Changes with the current logic are marked as bold.

  1. Validate request
    1. Seq#'s term is >= locally known primary term.
  2. Under uid lock:
    1. Index into Lucene if seq# > local copy and doesn't represent an error on primary.
    2. Add to local translog.
  3. Respond with the current local checkpoint#

Global Checkpoint# increment on replicas

The primary advances its global checkpoint# based on its knowledge of its local and replica's local checkpoint#. Periodically it shares its knowledge with the replicas

  1. Validate source:
    1. source's primary term is == locally known primary term.
  2. Validate correctness:
    1. Check that all sequence# below the new global checkpoint# were processed and local checkpoint# is of the same primary term. If not, fail shard.
  3. Set the shard’s copy of global checkpoint#, if it's lower than the incoming global checkpoint.

Note that the global checkpoint is a local knowledge of that is update under the mandate of the primary. It may be that the primary information is lagging compared to a replica. This can happen when a replica is promoted to a primary (but still has stale info).

First use case - faster replica recovery

Have an ordering of operations allows us to speed up the recovery process of an existing replica and synchronization with the primary. At the moment, we do file based sync which typically results in over-copying of data. Having a clearly marked checkpoint# allows us to limit the sync operation to just those documents that have changed subsequently. In many cases we expect to have no documents to sync at all. This improvement will be tracked in a separate issue.

Road map

Basic infra

  • Introduce Primary Terms (#14062)
  • Introduce Seq# and index them (#14651)
  • Introduce local checkpoints (#15390)
  • Introduce global checkpoints (#15485)
  • Replicated failed operations for which a seq# was assigned (@Areek)
  • Create testing infrastructure to allow testing replication as a unit test, but with real IndexShards (#18930)
  • Persist local and global checkpoints (#18949)
  • Update to Lucene 6.0 (#20793)
  • Persist global checkpoint in translog commits (@jasontedor) #21254
  • Reading global checkpoint from translog should be part of translog recovery (@jasontedor) #21934
  • transfer global checkpoint after peer recovery (@jasontedor) #22212
  • add translog no-op (@jasontedor) #22291
  • Handle retry on primary exceptions in shard bulk action; some operations might already have a sequence number assigned and we shouldn't necessarily just reindex them (think about primary relocations) (@bleskes)
  • Don't fail shards as stale if they fail a global check point sync - this will happen all the time when a primary is started and wants to sync the global checkpoint. To achieve this we agreed to change default behavior of shard failures for replication operations (with the exception of write operations, see later). If an operation fails on a replica, the primary shouldn't fail the replica or mark it as stale. Instead it should report this to the user by failing the entire operation. Write Replication Operation (i.e., sub classes of TransportWriteAction) should keep the current behavior. (@dakrone)

Replica recovery (no rollback)

A best effort doc based replica recovery, based on local last commit. By best effort we refer to having no guarantees on the primary
translog state and the likelyhood of doc based recovery to succeed and not requiring a file sync

  • Move local checkpoint to max seq# in commit when opening engine #22212
    We currently have no guarantee that all ops above the local checkpoint baked into the commit will be replayed. That means that delete operations with a seq# > local checkpoint will not be replayed. To work around it (for now), we will move the local checkpoint aritfially (at the potential expense of correctness) (@jasontedor)
  • Review correctness of POC and extract requirements for the primary side (@jasontedor)
  • Place holder for requirement from above. Potential candidates:
    • Primary should always be able to advance local checkpoint
      • Primary recovery should jump local checkpoint to max seq#
      • Primary promotion should close gaps?
  • Use seq# checkpoints for replica recovery

Translog recovery based on seq#

Currently translog keeps all oPerations that are not persisted into the last lucene commit. This doesn't imply that it can serve all operations from a given seq# and up. We want to move seq# based recovery where a lucene commit indicates what seq# a fully baked into it and the translog recovers from there.

  • Add min/max seq# to translog generations.
  • Add a smaller maximum generation size and automatically create new generations. Note that the total translog size can still grow to 512MB. However, divding this in smaller pieces will allow a lucene commit will be able to trim the translog, even though we may need some operations from the non-current generation (see next bullet point).
  • Lucene flush should "bake" the min generation that guarantees having all ops above the current local checkpoint (translog will trim based on that lucene commit like it does now).

Primary promotion

  • Shoud close gaps in history. Note that this doesn't imply this information will be transfered to the replicas. That is the job of the primary/replica sync.

Replica recovery with rollback

Needed to throw away potential wrong doc versions that ended up in lucene. Those "wrong doc versions" may still be in the translog of the replica but since we ignore the translog on replica recovery they will be removed.

  • A custom deletion policy to keep old commits around
  • need to open a specific commit (based on last known global checkpoint) to serve as base for translog recovery

Primary recovery with rollback

Needed to deal with descrepencies between translog and commit point that can result of failure during primary/replica sync

  • A custom deletion policy to keep old commits around
  • Make sure the translog keeps data based on the oldest commit
  • Roll back before starting to recover from translog

Live replica/primary sync

Needs primary history to contian a continous sequence of operations with a high chance of success

  • Allow a shard to rollback to a seq# from before last known checkpoint# based on NRT readers
  • translog need to be able to transfer all ops above the global checkpoint and up to current max, in order for replica to restore it's checkpoint state (and avoid failing shards)
  • TBD

Seq# as versioning

  • Change InternalEngine to to resolve collision based on seq# on replicas and recovery
  • Change write API to allow specifying desired current seq# for the operation to succeed
  • Make doc level versioning an opt in feature (mostly for external verisoning)

Adopt Me

  • Properly store seq# in lucene: we expect to use the seq# for sorting, during collision checking and for doing range searches. The primary term will only be used during collision checking when the seq# of the two document copies is identical. Mapping this need to lucene means that the seq# it self should be stored both as a numeric doc value and as numerical indexed field (BKD). The primary term should be stored as a doc value field and doesn't need an indexed variant. We also considered the alternative of encoding both term and seq# into a single numeric/binary field as it may save on a the disk lookup implied by two separate fields. Since we expect the primary term to be rarely retrieved, we opted for the simplicity of the two doc value fields solution. We also expect it to mean better compression. (@dakrone) #21480
  • Add primary term to DocWriteResponse

TBD

  • Review Shadow replicas
  • Make local check point storage lazy intitialized to protect against memory usage during recovery (TBD)
  • Translog API changes to be seq# based (TBD)
  • Review feasibility of old indices (done and implemented in #22185 ) (@bleskes)
  • Documentation
  • Failed shards who's local checkpoint is lagging with more than 10000 (?) ops behind the primary . This is a temporary measure to allow merging into master without closing translog gaps during primary promotion on a live shard. Those will require the replicas to pick them up, which will take a replica/primary live sync
  • If minimum of all local checkpoints is less than global checkpoint on the primary, do we fail the shard? No, this can happen when replicas pull back their local checkpoints to their version of the global checkpoint
  • how do we deal with the BWC aspects in the case that - a primary is running on a new node will one replica is on an old node and one replica is on a new one. In that case the primary will maintain seq# and checkpoints for itself and the replica on the new node. However if the primary fails it may be the case that the old replica is elected as primary. That means that the other replica will suddenly stop receiving sequence numbers. It is not clear if this really a problem and if so what the best approach to solve it.
@bleskes bleskes added the resiliency label Apr 21, 2015
@shikhar
Contributor
shikhar commented May 5, 2015

First use case - faster replica recovery

I'd argue the first use case is making replication semantics more sound :)

@bleskes bleskes added a commit that referenced this issue Oct 21, 2015
@bleskes bleskes Introduce Primary Terms
Every shard group in Elasticsearch has a selected copy called a primary. When a primary shard fails a new primary would be selected from the existing replica copies. This PR introduces `primary terms` to track the number of times this has happened. This will allow us, as follow up work and among other things, to identify operations that come from old stale primaries. It is also the first step in road towards sequence numbers.

Relates to #10708
Closes #14062
b364cf5
@bleskes bleskes added a commit that referenced this issue Nov 19, 2015
@bleskes bleskes Add Sequence Numbers and enforce Primary Terms
Adds a counter to each write operation on a shard. This sequence numbers is indexed into lucene using doc values, for now (we will probably require indexing to support range searchers in the future).

On top of this, primary term semantics are enforced and shards will refuse write operation coming from an older primary.

Other notes:
- The add SequenceServiceNumber is just a skeleton and will be replaced with much heavier one, once we have all the building blocks (i.e., checkpoints).
- I completely ignored recovery - for this we will need checkpoints as well.
- A new based class is introduced for all single doc write operations. This is handy to unify common logic (like toXContent).
- For now, we don't use seq# as versioning. We could in the future.

Relates to #10708
Closes #14651
5fb0f9a
@bleskes bleskes added a commit to bleskes/elasticsearch that referenced this issue Nov 22, 2015
@bleskes bleskes Set an newly created IndexShard's ShardRouting before exposing it to …
…operations

The work for #10708 requires tighter integration with the current shard routing of a shard. As such, we need to make sure it is set before the IndexService exposes the shard to external operations.
fe2218e
@bleskes bleskes added a commit that referenced this issue Nov 23, 2015
@bleskes bleskes Set an newly created IndexShard's ShardRouting before exposing it to …
…operations

The work for #10708 requires tighter integration with the current shard routing of a shard. As such, we need to make sure it is set before the IndexService exposes the shard to external operations.

Closes #14918
6e2e91c
@bleskes bleskes added a commit that referenced this issue Nov 23, 2015
@bleskes bleskes Set an newly created IndexShard's ShardRouting before exposing it to …
…operations

The work for #10708 requires tighter integration with the current shard routing of a shard. As such, we need to make sure it is set before the IndexService exposes the shard to external operations.

Closes #14918
3f145d0
@bleskes bleskes added a commit that referenced this issue Dec 15, 2015
@bleskes bleskes Introduce Local checkpoints
This PR introduces the notion of a local checkpoint on the shard level. A local check point is defined as a the highest sequence number for which all previous operations (i.e. with a lower seq#) have been processed.

relates to #10708

Closes #15390

formatting
3106948
@rkonda
rkonda commented Mar 8, 2016

It's not clear as to what would happen in the following split brain scenario (scenario-1):

  1. split occurs, forming two networks
  2. the network that didn't have a master, elects a master (call this network-2)
  3. the master will elect a new primary (in network-2)
  4. the primary in network-2 now has incremented term value (say 11). The primary in network-1 continues to have the same term value (10 in this example)
  5. The connection between the networks is re-established.

In this case we need a strategy for reconciling the differences in the indexes, if there were change operations in both the networks. Does a strategy like that exist today? So far it seems like this situation is preventable by using min_master_nodes. However in case min_master_nodes is not set appropriately, some default strategy should come into effect I would think.

An example strategy could be:

  1. Keep logs of write operations in both networks for a configurable amount of time. If the networks' connectivity is restored within this time period: (a) Drop all nodes in network-2 to read-only replica status (b) Attempt to reconcile the differences, and use network-1's state if the differences are not reconcilable. (c) Remove read-only status
    If the connectivity isn't restored within that time, when connection is restored, all indices in network-2 that have competing primaries in network-1 will lose their shards, and replicas are created from network-1.

Another interesting situation (scenario-2) to consider:

  1. Continuing with the scenario described above until (4) ...
  2. network-1 has another split, creating network-1 and network-1a. Network-1a gives term value of 11 to the new primary in that network.
  3. network-1 completely fails, and connectivity between network-1a and network-2 are restored. Now we may have a scenario where the subsequent change operations might not fail but still lead to different indexes in the replicas, with some operations failing some of the time, creating a messy situation.

This would happen if there is no reconciliation strategy in effect.

I do see that the sequence numbering method will keep shards that have connectivity to both the networks, in integral state, in the case of scenario-1. In the case of scenario-2, it is possible that the same shard gets operations with same term values from multiple primaries, and that again could create faulty index in that replica.

I am still trying to understand Elasticsearch's cluster behavior. It's possible that I might have made assumptions that aren't correct.

@bleskes
Member
bleskes commented Mar 8, 2016

In this case we need a strategy for reconciling the differences in the indexes, if there were change operations in both the networks. Does a strategy like that exist today?

The current strategy, which seq# will keep enforcing but in easier/faster way, is that all replicas are "reset" to be an exact copy of the primary currently chosen by the master. As you noted, this falls apart when there are two residing masters in the cluster. Indeed, the only way to prevent this is by setting minimum master nodes - which is the number one most important setting to set in ES (tell it what the expected cluster size is)

If min master nodes is not set and a split brain occurs, resolution will come when one of the masters steps down (either by manual intervention or by detecting the other one). In that case all replicas will "reset" to the primary designated by the left over master.

Drop all nodes in network-2 to read-only replica status

This is similar to what ES does - nodes with no master will only serve read requests and block writes (by default, it can be configured to block reads).

it is possible that the same shard gets operations with same term values from multiple primaries, and that again could create faulty index in that replica.

If the term is the same from both primaries, the replica will accept them according to the current plan. The situation will be resolved when the network restores and the left over primary and replica sync but indeed there are potential troubles there. I have some ideas on how to fix this specific secondary failure (split brain is the true issue, after which all bets are off) but there are bigger fish to catch first :)

@rkonda
rkonda commented Mar 8, 2016

Thank you very much for your clarification. I rather enjoy all these discussions and your comments.

The current strategy, which seq# will keep enforcing but in easier/faster way, is that all replicas are "reset" to be an exact copy of the primary currently chosen by the master.

The situation will be resolved when the network restores and the left over primary and replica sync

I would like to clearly understand the reset/sync scenarios. What triggers reset/sync?

I can think of a couple of "normal" operation scenarios

  1. I would think that whenever a node joins a network, the master would initiate a sync/reset.
  2. If a replica fails for a request, I suppose the primary should keep attempting a sync/reset, otherwise the replica might keep diverging, and at some point the master has to decommission that replica, otherwise the reads would be inconsistent.

In the case of split brain, with multi-network replicas (assuming min master nodes is set), primary-1 has been assuming that this replica R (on this third node, say N-3) has been failing (because of its allegiance to primary-2 ) but still is in the network. Hence it would attempt sync/reset. How does this protocol work? Should master-1 attempt to decommission R at some point, going by assumption (2)?

This problem will occur in a loop if R is decommissioned but another replica is installed on N-3 in its place, by the same protocol. There will be contention on N-3 for "reset"-ing replica shards by both the masters.

I suppose one way to resolve this is by letting a node choose a master if there are multiple masters. If we did this, then whenever a node loses its master, it would choose the other master, and there will be a sync/reset and all is well.

However if the node chooses its master, the other master will lose quorum, and hence cease to exist, which is a good resolution for this issue in my opinion.

@bleskes
Member
bleskes commented Mar 8, 2016

The two issues you mention indeed trigger a primary/replica sync. I'm not sure I follow the rest, I would like to ask you to continue the discussion on discuss.elastic.co . We try to keep github for issues and work items. Thx!

@makeyang
Contributor
makeyang commented Apr 5, 2016

any plan to release this?
it seems after this release, u guys will make ES a AP system? will u provide config paramters to allow users to control ES to be a AP or CP system eventually?

@bleskes
Member
bleskes commented Apr 5, 2016

@makeyang this will be released as soon as it is done. There's still a lot of work to do.

it seems after this release, u guys will make ES a AP system? will u provide config paramters to allow users to control ES to be a AP or CP system eventually?

ES is currently and will stay CP in the foreseeable future. If a node is partitioned away from the cluster it will serve read requests (configurable) but will block writes, in which case we drop availability. Of course in future there are many options but currently there are no concrete plans to make it any different.

@bleskes bleskes added a commit that referenced this issue Jun 6, 2016
@bleskes bleskes Introduced Global checkpoints for Sequence Numbers (#15485)
Global checkpoints are update by the primary and represent the common part of history across shard copies, as know at a given time. The primary is also in charge of periodically broadcast this information to the replicas. See #10708 for more details.
4844325
@dakrone dakrone added a commit to dakrone/elasticsearch that referenced this issue Nov 10, 2016
@dakrone dakrone Add internal _primary_term field
This adds the `_primary_term` field internally to the mappings. This
field is populated with the current shard's primary term.

It is intended to be used for collision resolution when two document
copies have the same sequence id, therefore, doc_values for the field
are stored but the filed itself is not indexed.

This also fixes the `_seq_no` field so that doc_values are
retrievable (they were previously stored but irretrievable) and changes
the `stats` implementation to more efficiently use the points API to
retrieve the min/max instead of iterating on each doc_value value.

Relates to #10708
af5ecda
@dakrone dakrone added a commit to dakrone/elasticsearch that referenced this issue Nov 10, 2016
@dakrone dakrone Add internal _primary_term field
This adds the `_primary_term` field internally to the mappings. This
field is populated with the current shard's primary term.

It is intended to be used for collision resolution when two document
copies have the same sequence id, therefore, doc_values for the field
are stored but the filed itself is not indexed.

This also fixes the `_seq_no` field so that doc_values are
retrievable (they were previously stored but irretrievable) and changes
the `stats` implementation to more efficiently use the points API to
retrieve the min/max instead of iterating on each doc_value value.

Relates to #10708
8a1213c
@dakrone dakrone added a commit to dakrone/elasticsearch that referenced this issue Nov 11, 2016
@dakrone dakrone Add internal _primary_term field
This adds the `_primary_term` field internally to the mappings. This
field is populated with the current shard's primary term.

It is intended to be used for collision resolution when two document
copies have the same sequence id, therefore, doc_values for the field
are stored but the filed itself is not indexed.

This also fixes the `_seq_no` field so that doc_values are
retrievable (they were previously stored but irretrievable) and changes
the `stats` implementation to more efficiently use the points API to
retrieve the min/max instead of iterating on each doc_value value.

Relates to #10708
17d695f
@mivano mivano referenced this issue in serilog/serilog-sinks-elasticsearch Nov 15, 2016
Closed

Is there a way to preserve the order of log events? #71

@dakrone dakrone added a commit to dakrone/elasticsearch that referenced this issue Nov 16, 2016
@dakrone dakrone Add internal _primary_term field
This adds the `_primary_term` field internally to the mappings. This
field is populated with the current shard's primary term.

It is intended to be used for collision resolution when two document
copies have the same sequence id, therefore, doc_values for the field
are stored but the filed itself is not indexed.

This also fixes the `_seq_no` field so that doc_values are
retrievable (they were previously stored but irretrievable) and changes
the `stats` implementation to more efficiently use the points API to
retrieve the min/max instead of iterating on each doc_value value.

Relates to #10708
c4f3e96
@dakrone dakrone added a commit to dakrone/elasticsearch that referenced this issue Nov 17, 2016
@dakrone dakrone Add internal _primary_term doc values field, fix _seq_no indexing
This adds the `_primary_term` field internally to the mappings. This field is
populated with the current shard's primary term.

It is intended to be used for collision resolution when two document copies have
the same sequence id, therefore, doc_values for the field are stored but the
filed itself is not indexed.

This also fixes the `_seq_no` field so that doc_values are retrievable (they
were previously stored but irretrievable) and changes the `stats` implementation
to more efficiently use the points API to retrieve the min/max instead of
iterating on each doc_value value. Additionally, even though we intend to be
able to search on the field, it was previously not searchable. This commit makes
it searchable.

There is no user-visible `_primary_term` field. Instead, the fields are
updated by calling:

```java
index.parsedDoc().updateSeqID(seqNum, primaryTerm);
```

This includes example methods in `Versions` and `Engine` for retrieving the
sequence id values from the index (see `Engine.getSequenceID`) that are only
used in unit tests. These will be extended/replaced by actual implementations
once we make use of sequence numbers as a conflict resolution measure.

Relates to #10708
Supercedes #21480

P.S. As a side effect of this commit, `SlowCompositeReaderWrapper` cannot be
used for documents that contain `_seq_no` because it is a Point value and SCRW
cannot wrap documents with points, so the tests have been updated to loop
through the `LeafReaderContext`s now instead.
30eca01
@dakrone dakrone added a commit to dakrone/elasticsearch that referenced this issue Dec 5, 2016
@dakrone dakrone Add internal _primary_term doc values field, fix _seq_no indexing
This adds the `_primary_term` field internally to the mappings. This field is
populated with the current shard's primary term.

It is intended to be used for collision resolution when two document copies have
the same sequence id, therefore, doc_values for the field are stored but the
filed itself is not indexed.

This also fixes the `_seq_no` field so that doc_values are retrievable (they
were previously stored but irretrievable) and changes the `stats` implementation
to more efficiently use the points API to retrieve the min/max instead of
iterating on each doc_value value. Additionally, even though we intend to be
able to search on the field, it was previously not searchable. This commit makes
it searchable.

There is no user-visible `_primary_term` field. Instead, the fields are
updated by calling:

```java
index.parsedDoc().updateSeqID(seqNum, primaryTerm);
```

This includes example methods in `Versions` and `Engine` for retrieving the
sequence id values from the index (see `Engine.getSequenceID`) that are only
used in unit tests. These will be extended/replaced by actual implementations
once we make use of sequence numbers as a conflict resolution measure.

Relates to #10708
Supercedes #21480

P.S. As a side effect of this commit, `SlowCompositeReaderWrapper` cannot be
used for documents that contain `_seq_no` because it is a Point value and SCRW
cannot wrap documents with points, so the tests have been updated to loop
through the `LeafReaderContext`s now instead.
26f2a38
@dakrone dakrone added a commit to dakrone/elasticsearch that referenced this issue Dec 9, 2016
@dakrone dakrone Add internal _primary_term doc values field, fix _seq_no indexing
This adds the `_primary_term` field internally to the mappings. This field is
populated with the current shard's primary term.

It is intended to be used for collision resolution when two document copies have
the same sequence id, therefore, doc_values for the field are stored but the
filed itself is not indexed.

This also fixes the `_seq_no` field so that doc_values are retrievable (they
were previously stored but irretrievable) and changes the `stats` implementation
to more efficiently use the points API to retrieve the min/max instead of
iterating on each doc_value value. Additionally, even though we intend to be
able to search on the field, it was previously not searchable. This commit makes
it searchable.

There is no user-visible `_primary_term` field. Instead, the fields are
updated by calling:

```java
index.parsedDoc().updateSeqID(seqNum, primaryTerm);
```

This includes example methods in `Versions` and `Engine` for retrieving the
sequence id values from the index (see `Engine.getSequenceID`) that are only
used in unit tests. These will be extended/replaced by actual implementations
once we make use of sequence numbers as a conflict resolution measure.

Relates to #10708
Supercedes #21480

P.S. As a side effect of this commit, `SlowCompositeReaderWrapper` cannot be
used for documents that contain `_seq_no` because it is a Point value and SCRW
cannot wrap documents with points, so the tests have been updated to loop
through the `LeafReaderContext`s now instead.
ee22a47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment