Optimize indexing in create once and never update scenarios #19813

bleskes · 2016-08-04T21:18:24Z

The Elasticsearch API supports all CRUD operations on every single document, topped by things like optimistic versioning controls and real time get API. To achieve this we need several data structures that support looking up existing documents both in real time (before a refresh) and if not found there, we need to do a lookup by ID in Lucene to make sure we find and replace any existing document. For certain use cases, if our users can guarantee that documents are inserted once and are never changed again (like, for example logging, metrics or a one off re-indexing of existing data) this turns out to be a significant overhead in terms of indexing speed, without adding any real value. We sometimes refer to these scenarios as the "append only usecase" as well.

In the past, we have tried to introduce optimizations in this area, but had to go and revert them as their have proven to be unsafe under cetrain race/error condition in our distributed system.

Since then things have evolved and we now have more tools at our disposal. This issue is to track the list of things it will take to optimize the append only use case, while keeping it safe. Most concretely, we should explore not maintaining the LiveVersionMap and always use addDocument (rather then updateDocument) when indexing to Lucene. Some of the things listed here are already done. Some require more work.

The list is mostly focused on dealing with the main issue in this cases: when an indexing operation is sent to the primary, it maybe be that the sending node is disconnect from the node with a primary and thus have no idea if the operation was successful or failed. In those cases, the node will resend it's request once the connection is restored. This may result in a duplicate delivery on the primary as thus requires the usage of updateDocument. It is easy to add a 'isRetry` flag on those retry requests but we still run the danger of the retry request being processed before the original, in which case we will have duplicates. This scenario needs to be avoided at all costs.

Disable optimization when shard is recovering and do things as we do today. During recovery document may be written twice to a shard.
Rely in primary terms to solve collisions between request originating from an old primary and a retry.
Rely on the task manager to avoid have an original request and a retry request executing concurrently on a node. For this we need to make sure they use the same task id and reject/have one wait on another.

~~Having dealt with concurrency, we need to deal with the case where the retry request is already executed and removed form the task manager. This can be done as follows:~~

Introduce a new “connection generation” id (a long) that is incremented every time a connection is (re)established between the nodes. This generation id will be added to every request and will also be exchanged as part of establishing a connection, making sure the target node is aware of the new generation id before exposing the connection outside of the transport service.
Networking threads on the target node will check that the generation id of a request is still the current one after adding the request to the task manager. If this is not the case the request will be removed from the task manager and will be failed (though there is no way to respond now, because the channel is closed). This guarantees that an old request will never be executed after a new connection was established. This in turn means it can not be processed after a retry request was processed.

Instead of the above we went with a simpler approach of dealing with concurrency:

add a timestamp to each request with an autogen id and create a "safety" barrier guaranteeing that all problematic requests (i.e., an original request that is processed after a retry is request) have a lower timestamp. This allows us to identify them and do the the right things. See details below
~~Use an index level settings to indicate to the engine it can disable the LiveVersionMap and use addDocument, when the shard is started. This can be updated on a closed index.~~
~~Disable real time get API on these indices.~~
~~Disable indexing with an id.~~
~~Disable delete & update API ~~

Instead of disabling single doc update/delete/index with id, we have managed to come up with a scheme to skip adding append only operations to the LiveVersionMap and marking it as unsafe. Whenever a non-append only indexing operation comes along we detect that and force a refresh. See #27752

The text was updated successfully, but these errors were encountered:

abeyad · 2016-08-04T22:21:58Z

Disable real time get API and optimistic concurrency control on these indices.

I was wondering what the reason is that real-time gets wouldn't work with the solutions outlined above?

s1monw · 2016-08-04T22:48:55Z

I was wondering what the reason is that real-time gets wouldn't work with the solutions outlined above?

We do RAM heavy accounting in the engine to allow GET operations to work in real-time vs. near-real-time. If we optimize for append only I think the need for realtime GET is a corner case and we can make use of the RAM for indexing instead. Hope that helps?

abeyad · 2016-08-04T22:51:03Z

@s1monw makes sense, thank you!

s1monw · 2016-08-29T07:31:38Z

@bleskes I was looking into the problem of more than once delivery due to retries on the primary. The entire task manager / network generation ideas that I had might work but I'd want to discuss a potentially simpler and more elegant way of doing this. IMO we just need a "happens before" relationship on the target shard. For instance if we send a document that is has the canHaveDuplicates property set (we can set it on the primary when we retry or when we recover from translog) we remember for instance the sequence_id or maybe the timestamp something that is the same on the duplicates and inside the engine we just deoptimize (use updateDocument) if we see a document with a deoptimizeProperty =< minSeenDeoptimizeProperty this might then hit other potentially unrelated documents which is ok but it would guarantee that we always deoptimize on the one that looses the race? this sounds simpler to me instead of adding stuff to the network layer etc?

bleskes · 2016-08-29T09:40:00Z

@s1monw that sounds very promising. I would love it if we can come up with something which is limited to the engine working together with the replication code and leave the networking code/task management out of it.

Let me try and reword what you say, to make sure we are on the same page.

On the engine level, we already lock based on doc id, so we don't have to worry about concurrency on that level (as it seems this is not an indexing bottleneck , we can keep this lock around, at least for now). With this in place, we need to make sure that an "optimized" request is never executed (either at all, or in an unsafe manner) after a "retry" request has been completed - i.e., we need some barrier which the "retry" request leaves behind which blocks the "optimized" request. A naive implementation would be to have a map of doc ids for which "optimized" requests should be dealt with carefully. This obviously problematic is there is now clear way to indicate when the map should be cleaned. Instead the idea is to do it as follows:

Mark every request with a timestamp. This is done once on the first node that receives a request and is fixed for this request. This can be even the machine local time (see why later). The important part is that retry requests will have the same value as the original one.
In the engine we make sure we keep the highest seen time stamp of "retry" requests. This is updated while the retry request has its doc id lock. Call this highestRetriedTimestamp
When the engine runs an "optimized" request comes, it compares it's timestamp with the current highestRetriedTimestamp (but doesn't update it). If the the request timestamp is higher it is safe to execute it as optimized (no retry request with the same timestamp has been run before). If not we fall back to "non-optimzed" mode and run the request as a retry one.

The first and last step guarantees that we will never run an "optimized" request after a "retry" request on the same doc (because the will have the same timestamp)/

The roughly monotonic nature of the timestamp (which can be different on different nodes, and can go back in time) limits the potentially secondary damage where a retry request cause other unrelated "optimized" requests to run as retries.

@s1monw are we on the same page. If so 🏆 🎉 :)

s1monw · 2016-08-29T13:04:23Z

🎉 ++ that's it :)

If elasticsearch controls the ID values as well as the documents version we can optimize the code that adds / appends the documents to the index. Essentially we an skip the version lookup for all documents unless the same document is delivered more than once. On the lucene level we can simply call IndexWriter#addDocument instead of #updateDocument but on the Engine level we need to ensure that we deoptimize the case once we see the same documetn more than once. This is done as follows: 1. Mark every request with a timestamp. This is done once on the first node that receives a request and is fixed for this request. This can be even the machine local time (see why later). The important part is that retry requests will have the same value as the original one. 2. In the engine we make sure we keep the highest seen time stamp of "retry" requests. This is updated while the retry request has its doc id lock. Call this `highestDeOptimzeAddDocumentTimestamp` 3. When the engine runs an "optimized" request comes, it compares it's timestamp with the current `highestDeOptimzeAddDocumentTimestamp` (but doesn't update it). If the the request timestamp is higher it is safe to execute it as optimized (no retry request with the same timestamp has been run before). If not we fall back to "non-optimzed" mode and run the request as a retry one and update the `highestDeOptimzeAddDocumentTimestamp` unless it's been updated already to a higher value Closes elastic#19813

If elasticsearch controls the ID values as well as the documents version we can optimize the code that adds / appends the documents to the index. Essentially we an skip the version lookup for all documents unless the same document is delivered more than once. On the lucene level we can simply call IndexWriter#addDocument instead of #updateDocument but on the Engine level we need to ensure that we deoptimize the case once we see the same document more than once. This is done as follows: 1. Mark every request with a timestamp. This is done once on the first node that receives a request and is fixed for this request. This can be even the machine local time (see why later). The important part is that retry requests will have the same value as the original one. 2. In the engine we make sure we keep the highest seen time stamp of "retry" requests. This is updated while the retry request has its doc id lock. Call this `maxUnsafeAutoIdTimestamp` 3. When the engine runs an "optimized" request comes, it compares it's timestamp with the current `maxUnsafeAutoIdTimestamp` (but doesn't update it). If the the request timestamp is higher it is safe to execute it as optimized (no retry request with the same timestamp has been run before). If not we fall back to "non-optimzed" mode and run the request as a retry one and update the `maxUnsafeAutoIdTimestamp` unless it's been updated already to a higher value Relates to #19813

Today we force refresh on get calls if the document has changes that are not yet refreshed. This might have implications and uses might want to disable this to ensure cluster stability. We also can, if we don't need to serve realtime requests at all optimize indexing to not populate the version map in such a case and use more memory for the index writer buffers to speed up indexing. Relates to elastic#19813

clintongormley · 2016-09-17T12:26:14Z

@bleskes can this issue be closed now or is there more to do?

s1monw · 2016-09-17T13:10:22Z

I think we should update the description we haven't done the live version map stuff yet

bleskes · 2016-09-18T18:24:02Z

@s1monw ++ . I have updated the issue description.

s1monw · 2016-09-19T07:52:44Z

thx @bleskes

Today we still maintain a version map even if we only index append-only or in other words, documents with auto-generated IDs. We can instead maintain an un-safe version map that will be swapped to a safe version map only if necessary once we see the first document that requires access to the version map. For instance: * a auto-generated id retry * any kind of deletes * a document with a foreign ID (non-autogenerated In these cases we forcefully refresh then internal reader and start maintaining a version map until such a safe map wasn't necessary for two refresh cycles. Indices / shards that never see an autogenerated ID document will always meintain a version map and in the case of a delete / retry in a pure append-only index the version map will be de-optimized for a short amount of time until we know it's safe again to swap back. This will also minimize the requried refeshes. Closes #19813

bleskes added the Meta label Aug 4, 2016

bleskes mentioned this issue Aug 4, 2016

Can we somehow make the engine's LiveVersionMap tracking optional? #19787

Closed

mikemccand mentioned this issue Aug 10, 2016

Optimize the case when _version is never specified / always 0 #19913

Closed

clintongormley added the :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. label Aug 11, 2016

s1monw mentioned this issue Aug 29, 2016

Optimize indexing for the autogenerated ID append-only case #20211

Merged

s1monw mentioned this issue Sep 1, 2016

Allow to disable realtime=true per index #20280

Closed

bleskes mentioned this issue Dec 22, 2016

Append-only indices #18069

Closed

s1monw mentioned this issue Dec 14, 2017

Optimize version map for append-only indexing #27752

Merged

s1monw closed this as completed in d941c64 Dec 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize indexing in create once and never update scenarios #19813

Optimize indexing in create once and never update scenarios #19813

bleskes commented Aug 4, 2016 •

edited

Loading

abeyad commented Aug 4, 2016

s1monw commented Aug 4, 2016

abeyad commented Aug 4, 2016

s1monw commented Aug 29, 2016

bleskes commented Aug 29, 2016

s1monw commented Aug 29, 2016

clintongormley commented Sep 17, 2016

s1monw commented Sep 17, 2016

bleskes commented Sep 18, 2016

s1monw commented Sep 19, 2016

Optimize indexing in create once and never update scenarios #19813

Optimize indexing in create once and never update scenarios #19813

Comments

bleskes commented Aug 4, 2016 • edited Loading

abeyad commented Aug 4, 2016

s1monw commented Aug 4, 2016

abeyad commented Aug 4, 2016

s1monw commented Aug 29, 2016

bleskes commented Aug 29, 2016

s1monw commented Aug 29, 2016

clintongormley commented Sep 17, 2016

s1monw commented Sep 17, 2016

bleskes commented Sep 18, 2016

s1monw commented Sep 19, 2016

bleskes commented Aug 4, 2016 •

edited

Loading