Don't lookup version for auto generated id and create #5917

kimchy · 2014-04-23T11:22:16Z

When a create document is executed, and its an auto generated id (based on UUID), we know that the document will not exists in the index, so there is no need to try and lookup the version from the index.
For many cases, like logging, where ids are auto generated, this can improve the indexing performance, specifically for lightweight documents where analysis is not a big part of the execution.

clintongormley · 2014-04-23T12:38:40Z

It's extremely unlikely that the ID doesn't exist, but not completely impossible (eg if somebody previously specified the same ID manually). What would be the consequences of a clash? Just that the version number would be incorrect? Or would you end up with duplicate docs?

kimchy · 2014-04-23T12:55:08Z

@clintongormley in the extreme event that there is a clash, then we will end with duplicate docs. I don't think this will happen though. Clashes in terms of UUID are close to impossible, and generated the same id on the client side that clashes with one in ES and ending up doing a create operation... (this optimization is only for create use case)

s1monw · 2014-04-23T15:20:52Z

...a/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java

+                request.enablePotentialDup(); // safe side, cluster state changed, we might have dups
+            } else{
+                shardIt.reset();
+                while ((shard = shardIt.nextOrNull()) != null) {


maybe we can utli method to the shardIter here? I'd prefer to keep this code short here it's large enough

s1monw · 2014-04-23T15:35:17Z

I added some comments regarding naming. Yet what I miss here is a really good test that adds a lot of stuff with auto IDs while starting up replicas etc. we might also randomize some existing tests but I think they should at least use bulk or concurrent indexing.

kimchy · 2014-04-23T19:31:39Z

I added a test to master/1.x, and rebased this branch against it.

s1monw · 2014-04-24T14:39:14Z

src/main/java/org/elasticsearch/index/engine/Engine.java

+            return this.autoGeneratedId;
+        }
+
+        public Create chanHaveDuplicates(boolean canHaveDuplicates) {


typo I guess s/chanHaveDuplicates/canHaveDuplicates/

s1monw · 2014-04-24T14:51:04Z

I really want to have this optimization but I think we need to fix the API for Engine#Create and friends to be immutable. The builder like patter is so error prone I don't think we should carry that on over this change. Other than that I left some comments & it looks good

Today we use a builder pattern / setters to set relevant information to Engine#Delete|Create|Index. Yet almost all the values are required but they are not passed via ctor arguments but via an error prone builder pattern. If we add a required argument we should see compile errors on that level to make sure we don't miss any place to set them. Prerequisite for elastic#5917

Today we use a builder pattern / setters to set relevant information to Engine#Delete|Create|Index. Yet almost all the values are required but they are not passed via ctor arguments but via an error prone builder pattern. If we add a required argument we should see compile errors on that level to make sure we don't miss any place to set them. Prerequisite for #5917

kimchy · 2014-04-25T10:47:34Z

I am good with getting this in, @s1monw thanks for the improvement on making things more immutable, should we get it in?

s1monw · 2014-04-25T10:48:11Z

@kimchy I left on TODO in the code can you take a look at it?

kimchy · 2014-04-25T10:53:24Z

ahh, yea, I see, I think its safe to remove it, I will remove it and squash

s1monw · 2014-04-25T10:58:34Z

@kimchy can you add an assertion at that place? and make the member final :)

kimchy · 2014-04-25T10:59:46Z

@s1monw aye, already done, running tests

s1monw · 2014-04-25T11:30:39Z

LGTM

When a create document is executed, and its an auto generated id (based on UUID), we know that the document will not exists in the index, so there is no need to try and lookup the version from the index. For many cases, like logging, where ids are auto generated, this can improve the indexing performance, specifically for lightweight documents where analysis is not a big part of the execution.

When a create document is executed, and its an auto generated id (based on UUID), we know that the document will not exists in the index, so there is no need to try and lookup the version from the index. For many cases, like logging, where ids are auto generated, this can improve the indexing performance, specifically for lightweight documents where analysis is not a big part of the execution. closes #5917

avleen · 2014-04-25T19:31:42Z

There's an interesting use case here, for which I hope someone can clarify the impact of this change:

In the logging case, it can be desirable to have the same log line generate the same ID (maybe based on the hash of the log line or something).
Eg, if I lose a shard which has no replica and I need to replay my logs to fill in the missing data, two facts are probably true:

I can't pick out exactly which lines are missing from a shard, out of millions in a log file.
I probably want to just replay the entire log and let the system handle indexing, deduplication, etc.

In this case, what would happen? From the comments, I think I'd get a lot of duplicated entries.
That may be acceptable if there's a cheap way to clean them up?

s1monw · 2014-04-25T19:40:12Z

@avleen in your case you provide the id as you said (maybe based on the hash of the log line or something) in that case this change doesn't apply since you are not using an ES autogenerated id. This only applied to documents where the internal ID generator is used and this means the docs are by definition unique.

kimchy · 2014-04-25T19:44:50Z

@avleen also, (depending on your system), if you end up replying the entire log, you are basically reindexing the data, in which case, assuming rolling indices, you might want to reindex today. Pretty much use case dependent, but thats another option. (obviously thats assuming you don't want to run with replicas, even with bulk async replication)

avleen · 2014-04-25T22:54:07Z

Thanks @s1monw, @kimchy! Makes perfect sense :-)

kimchy added the review label Apr 23, 2014

s1monw reviewed Apr 23, 2014
View reviewed changes

s1monw reviewed Apr 24, 2014
View reviewed changes

s1monw added v2.0.0 labels Apr 24, 2014

s1monw self-assigned this Apr 24, 2014

s1monw mentioned this pull request Apr 25, 2014

Make Create/Update/Delete classes less mutable #5939

Closed

kimchy added the enhancement label Apr 25, 2014

kimchy closed this in 65bc017 Apr 25, 2014

kimchy removed blocker labels Apr 25, 2014

kimchy deleted the enhacement/index_auto_generated_id branch April 25, 2014 12:33

brwe mentioned this pull request Apr 25, 2014

Optimize auto generated id indexing operation / no need to lookup version #5812

Closed

clintongormley added the :Core/Infra/Core Core issues without another label label Jun 7, 2015

bleskes mentioned this pull request Aug 4, 2016

Optimize indexing in create once and never update scenarios #19813

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't lookup version for auto generated id and create #5917

Don't lookup version for auto generated id and create #5917

kimchy commented Apr 23, 2014

clintongormley commented Apr 23, 2014

kimchy commented Apr 23, 2014

s1monw Apr 23, 2014

s1monw commented Apr 23, 2014

kimchy commented Apr 23, 2014

s1monw Apr 24, 2014

s1monw commented Apr 24, 2014

kimchy commented Apr 25, 2014

s1monw commented Apr 25, 2014

kimchy commented Apr 25, 2014

s1monw commented Apr 25, 2014

kimchy commented Apr 25, 2014

s1monw commented Apr 25, 2014

avleen commented Apr 25, 2014

s1monw commented Apr 25, 2014

kimchy commented Apr 25, 2014

avleen commented Apr 25, 2014

Don't lookup version for auto generated id and create #5917

Don't lookup version for auto generated id and create #5917

Conversation

kimchy commented Apr 23, 2014

clintongormley commented Apr 23, 2014

kimchy commented Apr 23, 2014

s1monw Apr 23, 2014

Choose a reason for hiding this comment

s1monw commented Apr 23, 2014

kimchy commented Apr 23, 2014

s1monw Apr 24, 2014

Choose a reason for hiding this comment

s1monw commented Apr 24, 2014

kimchy commented Apr 25, 2014

s1monw commented Apr 25, 2014

kimchy commented Apr 25, 2014

s1monw commented Apr 25, 2014

kimchy commented Apr 25, 2014

s1monw commented Apr 25, 2014

avleen commented Apr 25, 2014

s1monw commented Apr 25, 2014

kimchy commented Apr 25, 2014

avleen commented Apr 25, 2014