Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't lookup version for auto generated id and create #5917

Closed
wants to merge 1 commit into from

Conversation

kimchy
Copy link
Member

@kimchy kimchy commented Apr 23, 2014

When a create document is executed, and its an auto generated id (based on UUID), we know that the document will not exists in the index, so there is no need to try and lookup the version from the index.
For many cases, like logging, where ids are auto generated, this can improve the indexing performance, specifically for lightweight documents where analysis is not a big part of the execution.

@kimchy kimchy added the review label Apr 23, 2014
@clintongormley
Copy link

It's extremely unlikely that the ID doesn't exist, but not completely impossible (eg if somebody previously specified the same ID manually). What would be the consequences of a clash? Just that the version number would be incorrect? Or would you end up with duplicate docs?

@kimchy
Copy link
Member Author

kimchy commented Apr 23, 2014

@clintongormley in the extreme event that there is a clash, then we will end with duplicate docs. I don't think this will happen though. Clashes in terms of UUID are close to impossible, and generated the same id on the client side that clashes with one in ES and ending up doing a create operation... (this optimization is only for create use case)

request.enablePotentialDup(); // safe side, cluster state changed, we might have dups
} else{
shardIt.reset();
while ((shard = shardIt.nextOrNull()) != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can utli method to the shardIter here? I'd prefer to keep this code short here it's large enough

@s1monw
Copy link
Contributor

s1monw commented Apr 23, 2014

I added some comments regarding naming. Yet what I miss here is a really good test that adds a lot of stuff with auto IDs while starting up replicas etc. we might also randomize some existing tests but I think they should at least use bulk or concurrent indexing.

@kimchy
Copy link
Member Author

kimchy commented Apr 23, 2014

I added a test to master/1.x, and rebased this branch against it.

return this.autoGeneratedId;
}

public Create chanHaveDuplicates(boolean canHaveDuplicates) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo I guess s/chanHaveDuplicates/canHaveDuplicates/

@s1monw
Copy link
Contributor

s1monw commented Apr 24, 2014

I really want to have this optimization but I think we need to fix the API for Engine#Create and friends to be immutable. The builder like patter is so error prone I don't think we should carry that on over this change. Other than that I left some comments & it looks good

@s1monw s1monw self-assigned this Apr 24, 2014
s1monw added a commit to s1monw/elasticsearch that referenced this pull request Apr 25, 2014
Today we use a builder pattern / setters to set relevant information
to Engine#Delete|Create|Index. Yet almost all the values are required
but they are not passed via ctor arguments but via an error prone builder
pattern. If we add a required argument we should see compile errors on that
level to make sure we don't miss any place to set them.

Prerequisite for elastic#5917
s1monw added a commit that referenced this pull request Apr 25, 2014
Today we use a builder pattern / setters to set relevant information
to Engine#Delete|Create|Index. Yet almost all the values are required
but they are not passed via ctor arguments but via an error prone builder
pattern. If we add a required argument we should see compile errors on that
level to make sure we don't miss any place to set them.

Prerequisite for #5917
s1monw added a commit that referenced this pull request Apr 25, 2014
Today we use a builder pattern / setters to set relevant information
to Engine#Delete|Create|Index. Yet almost all the values are required
but they are not passed via ctor arguments but via an error prone builder
pattern. If we add a required argument we should see compile errors on that
level to make sure we don't miss any place to set them.

Prerequisite for #5917
@kimchy
Copy link
Member Author

kimchy commented Apr 25, 2014

I am good with getting this in, @s1monw thanks for the improvement on making things more immutable, should we get it in?

@s1monw
Copy link
Contributor

s1monw commented Apr 25, 2014

@kimchy I left on TODO in the code can you take a look at it?

@kimchy
Copy link
Member Author

kimchy commented Apr 25, 2014

ahh, yea, I see, I think its safe to remove it, I will remove it and squash

@s1monw
Copy link
Contributor

s1monw commented Apr 25, 2014

@kimchy can you add an assertion at that place? and make the member final :)

@kimchy
Copy link
Member Author

kimchy commented Apr 25, 2014

@s1monw aye, already done, running tests

@s1monw
Copy link
Contributor

s1monw commented Apr 25, 2014

LGTM

When a create document is executed, and its an auto generated id (based on UUID), we know that the document will not exists in the index, so there is no need to try and lookup the version from the index.
For many cases, like logging, where ids are auto generated, this can improve the indexing performance, specifically for lightweight documents where analysis is not a big part of the execution.
@kimchy kimchy closed this in 65bc017 Apr 25, 2014
kimchy added a commit that referenced this pull request Apr 25, 2014
When a create document is executed, and its an auto generated id (based on UUID), we know that the document will not exists in the index, so there is no need to try and lookup the version from the index.
For many cases, like logging, where ids are auto generated, this can improve the indexing performance, specifically for lightweight documents where analysis is not a big part of the execution.
closes #5917
@kimchy kimchy deleted the enhacement/index_auto_generated_id branch April 25, 2014 12:33
@avleen
Copy link

avleen commented Apr 25, 2014

There's an interesting use case here, for which I hope someone can clarify the impact of this change:

In the logging case, it can be desirable to have the same log line generate the same ID (maybe based on the hash of the log line or something).
Eg, if I lose a shard which has no replica and I need to replay my logs to fill in the missing data, two facts are probably true:

  1. I can't pick out exactly which lines are missing from a shard, out of millions in a log file.
  2. I probably want to just replay the entire log and let the system handle indexing, deduplication, etc.

In this case, what would happen? From the comments, I think I'd get a lot of duplicated entries.
That may be acceptable if there's a cheap way to clean them up?

@s1monw
Copy link
Contributor

s1monw commented Apr 25, 2014

@avleen in your case you provide the id as you said (maybe based on the hash of the log line or something) in that case this change doesn't apply since you are not using an ES autogenerated id. This only applied to documents where the internal ID generator is used and this means the docs are by definition unique.

@kimchy
Copy link
Member Author

kimchy commented Apr 25, 2014

@avleen also, (depending on your system), if you end up replying the entire log, you are basically reindexing the data, in which case, assuming rolling indices, you might want to reindex today. Pretty much use case dependent, but thats another option. (obviously thats assuming you don't want to run with replicas, even with bulk async replication)

@avleen
Copy link

avleen commented Apr 25, 2014

Thanks @s1monw, @kimchy! Makes perfect sense :-)

@clintongormley clintongormley added the :Core/Infra/Core Core issues without another label label Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants