-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't lookup version for auto generated id and create #5917
Conversation
It's extremely unlikely that the ID doesn't exist, but not completely impossible (eg if somebody previously specified the same ID manually). What would be the consequences of a clash? Just that the version number would be incorrect? Or would you end up with duplicate docs? |
@clintongormley in the extreme event that there is a clash, then we will end with duplicate docs. I don't think this will happen though. Clashes in terms of UUID are close to impossible, and generated the same id on the client side that clashes with one in ES and ending up doing a create operation... (this optimization is only for create use case) |
request.enablePotentialDup(); // safe side, cluster state changed, we might have dups | ||
} else{ | ||
shardIt.reset(); | ||
while ((shard = shardIt.nextOrNull()) != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can utli method to the shardIter here? I'd prefer to keep this code short here it's large enough
I added some comments regarding naming. Yet what I miss here is a really good test that adds a lot of stuff with auto IDs while starting up replicas etc. we might also randomize some existing tests but I think they should at least use bulk or concurrent indexing. |
I added a test to master/1.x, and rebased this branch against it. |
return this.autoGeneratedId; | ||
} | ||
|
||
public Create chanHaveDuplicates(boolean canHaveDuplicates) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo I guess s/chanHaveDuplicates/canHaveDuplicates/
I really want to have this optimization but I think we need to fix the API for |
Today we use a builder pattern / setters to set relevant information to Engine#Delete|Create|Index. Yet almost all the values are required but they are not passed via ctor arguments but via an error prone builder pattern. If we add a required argument we should see compile errors on that level to make sure we don't miss any place to set them. Prerequisite for elastic#5917
Today we use a builder pattern / setters to set relevant information to Engine#Delete|Create|Index. Yet almost all the values are required but they are not passed via ctor arguments but via an error prone builder pattern. If we add a required argument we should see compile errors on that level to make sure we don't miss any place to set them. Prerequisite for #5917
Today we use a builder pattern / setters to set relevant information to Engine#Delete|Create|Index. Yet almost all the values are required but they are not passed via ctor arguments but via an error prone builder pattern. If we add a required argument we should see compile errors on that level to make sure we don't miss any place to set them. Prerequisite for #5917
I am good with getting this in, @s1monw thanks for the improvement on making things more immutable, should we get it in? |
@kimchy I left on TODO in the code can you take a look at it? |
ahh, yea, I see, I think its safe to remove it, I will remove it and squash |
@kimchy can you add an assertion at that place? and make the member final :) |
@s1monw aye, already done, running tests |
LGTM |
When a create document is executed, and its an auto generated id (based on UUID), we know that the document will not exists in the index, so there is no need to try and lookup the version from the index. For many cases, like logging, where ids are auto generated, this can improve the indexing performance, specifically for lightweight documents where analysis is not a big part of the execution.
When a create document is executed, and its an auto generated id (based on UUID), we know that the document will not exists in the index, so there is no need to try and lookup the version from the index. For many cases, like logging, where ids are auto generated, this can improve the indexing performance, specifically for lightweight documents where analysis is not a big part of the execution. closes #5917
There's an interesting use case here, for which I hope someone can clarify the impact of this change: In the logging case, it can be desirable to have the same log line generate the same ID (maybe based on the hash of the log line or something).
In this case, what would happen? From the comments, I think I'd get a lot of duplicated entries. |
@avleen in your case you provide the id as you said |
@avleen also, (depending on your system), if you end up replying the entire log, you are basically reindexing the data, in which case, assuming rolling indices, you might want to reindex today. Pretty much use case dependent, but thats another option. (obviously thats assuming you don't want to run with replicas, even with bulk async replication) |
When a create document is executed, and its an auto generated id (based on UUID), we know that the document will not exists in the index, so there is no need to try and lookup the version from the index.
For many cases, like logging, where ids are auto generated, this can improve the indexing performance, specifically for lightweight documents where analysis is not a big part of the execution.