Don't lookup version for auto generated id and create #5917

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
4 participants
@kimchy
Member

kimchy commented Apr 23, 2014

When a create document is executed, and its an auto generated id (based on UUID), we know that the document will not exists in the index, so there is no need to try and lookup the version from the index.
For many cases, like logging, where ids are auto generated, this can improve the indexing performance, specifically for lightweight documents where analysis is not a big part of the execution.

@kimchy kimchy added the review label Apr 23, 2014

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Apr 23, 2014

Member

It's extremely unlikely that the ID doesn't exist, but not completely impossible (eg if somebody previously specified the same ID manually). What would be the consequences of a clash? Just that the version number would be incorrect? Or would you end up with duplicate docs?

Member

clintongormley commented Apr 23, 2014

It's extremely unlikely that the ID doesn't exist, but not completely impossible (eg if somebody previously specified the same ID manually). What would be the consequences of a clash? Just that the version number would be incorrect? Or would you end up with duplicate docs?

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Apr 23, 2014

Member

@clintongormley in the extreme event that there is a clash, then we will end with duplicate docs. I don't think this will happen though. Clashes in terms of UUID are close to impossible, and generated the same id on the client side that clashes with one in ES and ending up doing a create operation... (this optimization is only for create use case)

Member

kimchy commented Apr 23, 2014

@clintongormley in the extreme event that there is a clash, then we will end with duplicate docs. I don't think this will happen though. Clashes in terms of UUID are close to impossible, and generated the same id on the client side that clashes with one in ES and ending up doing a create operation... (this optimization is only for create use case)

+ request.enablePotentialDup(); // safe side, cluster state changed, we might have dups
+ } else{
+ shardIt.reset();
+ while ((shard = shardIt.nextOrNull()) != null) {

This comment has been minimized.

@s1monw

s1monw Apr 23, 2014

Contributor

maybe we can utli method to the shardIter here? I'd prefer to keep this code short here it's large enough

@s1monw

s1monw Apr 23, 2014

Contributor

maybe we can utli method to the shardIter here? I'd prefer to keep this code short here it's large enough

+ return this.autoGeneratedId;
+ }
+
+ public Create latchPotentialDup(boolean potentialDup) {

This comment has been minimized.

@s1monw

s1monw Apr 23, 2014

Contributor

I am sorry but this name is completely off :) I don't get what it means :) Can we find a really good name here? here are some suggestions:

  • updateRequired
  • canHaveDuplicates
  • requiresUpdate
  • resolveDuplicates
@s1monw

s1monw Apr 23, 2014

Contributor

I am sorry but this name is completely off :) I don't get what it means :) Can we find a really good name here? here are some suggestions:

  • updateRequired
  • canHaveDuplicates
  • requiresUpdate
  • resolveDuplicates

This comment has been minimized.

@kimchy

kimchy Apr 23, 2014

Member

no problem, I was having trouble with a name here, will go with canHaveDuplicates across the board

@kimchy

kimchy Apr 23, 2014

Member

no problem, I was having trouble with a name here, will go with canHaveDuplicates across the board

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Apr 23, 2014

Contributor

I added some comments regarding naming. Yet what I miss here is a really good test that adds a lot of stuff with auto IDs while starting up replicas etc. we might also randomize some existing tests but I think they should at least use bulk or concurrent indexing.

Contributor

s1monw commented Apr 23, 2014

I added some comments regarding naming. Yet what I miss here is a really good test that adds a lot of stuff with auto IDs while starting up replicas etc. we might also randomize some existing tests but I think they should at least use bulk or concurrent indexing.

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Apr 23, 2014

Member

I added a test to master/1.x, and rebased this branch against it.

Member

kimchy commented Apr 23, 2014

I added a test to master/1.x, and rebased this branch against it.

+ return this.autoGeneratedId;
+ }
+
+ public Create chanHaveDuplicates(boolean canHaveDuplicates) {

This comment has been minimized.

@s1monw

s1monw Apr 24, 2014

Contributor

typo I guess s/chanHaveDuplicates/canHaveDuplicates/

@s1monw

s1monw Apr 24, 2014

Contributor

typo I guess s/chanHaveDuplicates/canHaveDuplicates/

@@ -410,7 +410,8 @@ private WriteResult shardIndexOperation(BulkShardRequest request, IndexRequest i
Engine.IndexingOperation op;
try {
if (indexRequest.opType() == IndexRequest.OpType.INDEX) {
- Engine.Index index = indexShard.prepareIndex(sourceToParse).version(indexRequest.version()).versionType(indexRequest.versionType()).origin(Engine.Operation.Origin.PRIMARY);
+ Engine.Index index = indexShard.prepareIndex(sourceToParse).version(indexRequest.version()).versionType(indexRequest.versionType()).origin(Engine.Operation.Origin.PRIMARY)

This comment has been minimized.

@s1monw

s1monw Apr 24, 2014

Contributor

I think we should work on this before we commit this change. The mutable fashion of this is to dangerous to miss something. I think this should be a single function that requires you to specify all the different parameters such that we don't have a chance to miss it. This is too risky for my taste here.

@s1monw

s1monw Apr 24, 2014

Contributor

I think we should work on this before we commit this change. The mutable fashion of this is to dangerous to miss something. I think this should be a single function that requires you to specify all the different parameters such that we don't have a chance to miss it. This is too risky for my taste here.

This comment has been minimized.

@kimchy

kimchy Apr 24, 2014

Member

agreed, but I don't agree that this should go with this change, since its a problem we had before. I would be happy to open another issue and work on it separately, and not make this pull request much bigger than it should be.

@kimchy

kimchy Apr 24, 2014

Member

agreed, but I don't agree that this should go with this change, since its a problem we had before. I would be happy to open another issue and work on it separately, and not make this pull request much bigger than it should be.

@@ -63,6 +65,17 @@ public ShardReplicationOperationRequest(T request) {
this.consistencyLevel = request.consistencyLevel();
}
+ void setCanHaveDuplicates() {

This comment has been minimized.

@s1monw

s1monw Apr 24, 2014

Contributor

in this context the name doesn't make too much sense, I guess something like replicatedMoreThanOnce is better here, ie. more generic. Makes sense?

@s1monw

s1monw Apr 24, 2014

Contributor

in this context the name doesn't make too much sense, I guess something like replicatedMoreThanOnce is better here, ie. more generic. Makes sense?

This comment has been minimized.

@kimchy

kimchy Apr 24, 2014

Member

I actually like the name, since it effectively means that this request can come in duplicated, same as with the engine case.

@kimchy

kimchy Apr 24, 2014

Member

I actually like the name, since it effectively means that this request can come in duplicated, same as with the engine case.

+ shardIt.reset();
+ while ((shard = shardIt.nextOrNull()) != null) {
+ if (shard.state() != ShardRoutingState.STARTED) {
+ request.setCanHaveDuplicates();

This comment has been minimized.

@s1monw

s1monw Apr 24, 2014

Contributor

we can break out here if we see one that is not started no?

@s1monw

s1monw Apr 24, 2014

Contributor

we can break out here if we see one that is not started no?

This comment has been minimized.

@kimchy

kimchy Apr 24, 2014

Member

I was thinking about it :), but decided against it because in practice, there really not going to be many replicas, and its just noise in the code. Can add it if you feel strongly about it.

@kimchy

kimchy Apr 24, 2014

Member

I was thinking about it :), but decided against it because in practice, there really not going to be many replicas, and its just noise in the code. Can add it if you feel strongly about it.

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Apr 24, 2014

Contributor

I really want to have this optimization but I think we need to fix the API for Engine#Create and friends to be immutable. The builder like patter is so error prone I don't think we should carry that on over this change. Other than that I left some comments & it looks good

Contributor

s1monw commented Apr 24, 2014

I really want to have this optimization but I think we need to fix the API for Engine#Create and friends to be immutable. The builder like patter is so error prone I don't think we should carry that on over this change. Other than that I left some comments & it looks good

@s1monw s1monw added v2.0.0 labels Apr 24, 2014

@s1monw s1monw self-assigned this Apr 24, 2014

s1monw added a commit to s1monw/elasticsearch that referenced this pull request Apr 25, 2014

Make Create/Update/Delete classes less mutable
Today we use a builder pattern / setters to set relevant information
to Engine#Delete|Create|Index. Yet almost all the values are required
but they are not passed via ctor arguments but via an error prone builder
pattern. If we add a required argument we should see compile errors on that
level to make sure we don't miss any place to set them.

Prerequisite for #5917

s1monw added a commit that referenced this pull request Apr 25, 2014

Make Create/Update/Delete classes less mutable
Today we use a builder pattern / setters to set relevant information
to Engine#Delete|Create|Index. Yet almost all the values are required
but they are not passed via ctor arguments but via an error prone builder
pattern. If we add a required argument we should see compile errors on that
level to make sure we don't miss any place to set them.

Prerequisite for #5917

s1monw added a commit that referenced this pull request Apr 25, 2014

Make Create/Update/Delete classes less mutable
Today we use a builder pattern / setters to set relevant information
to Engine#Delete|Create|Index. Yet almost all the values are required
but they are not passed via ctor arguments but via an error prone builder
pattern. If we add a required argument we should see compile errors on that
level to make sure we don't miss any place to set them.

Prerequisite for #5917
@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Apr 25, 2014

Member

I am good with getting this in, @s1monw thanks for the improvement on making things more immutable, should we get it in?

Member

kimchy commented Apr 25, 2014

I am good with getting this in, @s1monw thanks for the improvement on making things more immutable, should we get it in?

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Apr 25, 2014

Contributor

@kimchy I left on TODO in the code can you take a look at it?

Contributor

s1monw commented Apr 25, 2014

@kimchy I left on TODO in the code can you take a look at it?

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Apr 25, 2014

Member

ahh, yea, I see, I think its safe to remove it, I will remove it and squash

Member

kimchy commented Apr 25, 2014

ahh, yea, I see, I think its safe to remove it, I will remove it and squash

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Apr 25, 2014

Contributor

@kimchy can you add an assertion at that place? and make the member final :)

Contributor

s1monw commented Apr 25, 2014

@kimchy can you add an assertion at that place? and make the member final :)

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Apr 25, 2014

Member

@s1monw aye, already done, running tests

Member

kimchy commented Apr 25, 2014

@s1monw aye, already done, running tests

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Apr 25, 2014

Contributor

LGTM

Contributor

s1monw commented Apr 25, 2014

LGTM

Don't lookup version for auto generated id and create
When a create document is executed, and its an auto generated id (based on UUID), we know that the document will not exists in the index, so there is no need to try and lookup the version from the index.
For many cases, like logging, where ids are auto generated, this can improve the indexing performance, specifically for lightweight documents where analysis is not a big part of the execution.

@kimchy kimchy added the enhancement label Apr 25, 2014

@kimchy kimchy closed this in 65bc017 Apr 25, 2014

kimchy added a commit that referenced this pull request Apr 25, 2014

Don't lookup version for auto generated id and create
When a create document is executed, and its an auto generated id (based on UUID), we know that the document will not exists in the index, so there is no need to try and lookup the version from the index.
For many cases, like logging, where ids are auto generated, this can improve the indexing performance, specifically for lightweight documents where analysis is not a big part of the execution.
closes #5917

@kimchy kimchy removed blocker labels Apr 25, 2014

@kimchy kimchy deleted the enhacement/index_auto_generated_id branch Apr 25, 2014

@avleen

This comment has been minimized.

Show comment
Hide comment
@avleen

avleen Apr 25, 2014

There's an interesting use case here, for which I hope someone can clarify the impact of this change:

In the logging case, it can be desirable to have the same log line generate the same ID (maybe based on the hash of the log line or something).
Eg, if I lose a shard which has no replica and I need to replay my logs to fill in the missing data, two facts are probably true:

  1. I can't pick out exactly which lines are missing from a shard, out of millions in a log file.
  2. I probably want to just replay the entire log and let the system handle indexing, deduplication, etc.

In this case, what would happen? From the comments, I think I'd get a lot of duplicated entries.
That may be acceptable if there's a cheap way to clean them up?

avleen commented Apr 25, 2014

There's an interesting use case here, for which I hope someone can clarify the impact of this change:

In the logging case, it can be desirable to have the same log line generate the same ID (maybe based on the hash of the log line or something).
Eg, if I lose a shard which has no replica and I need to replay my logs to fill in the missing data, two facts are probably true:

  1. I can't pick out exactly which lines are missing from a shard, out of millions in a log file.
  2. I probably want to just replay the entire log and let the system handle indexing, deduplication, etc.

In this case, what would happen? From the comments, I think I'd get a lot of duplicated entries.
That may be acceptable if there's a cheap way to clean them up?

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Apr 25, 2014

Contributor

@avleen in your case you provide the id as you said (maybe based on the hash of the log line or something) in that case this change doesn't apply since you are not using an ES autogenerated id. This only applied to documents where the internal ID generator is used and this means the docs are by definition unique.

Contributor

s1monw commented Apr 25, 2014

@avleen in your case you provide the id as you said (maybe based on the hash of the log line or something) in that case this change doesn't apply since you are not using an ES autogenerated id. This only applied to documents where the internal ID generator is used and this means the docs are by definition unique.

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Apr 25, 2014

Member

@avleen also, (depending on your system), if you end up replying the entire log, you are basically reindexing the data, in which case, assuming rolling indices, you might want to reindex today. Pretty much use case dependent, but thats another option. (obviously thats assuming you don't want to run with replicas, even with bulk async replication)

Member

kimchy commented Apr 25, 2014

@avleen also, (depending on your system), if you end up replying the entire log, you are basically reindexing the data, in which case, assuming rolling indices, you might want to reindex today. Pretty much use case dependent, but thats another option. (obviously thats assuming you don't want to run with replicas, even with bulk async replication)

@avleen

This comment has been minimized.

Show comment
Hide comment
@avleen

avleen Apr 25, 2014

Thanks @s1monw, @kimchy! Makes perfect sense :-)

avleen commented Apr 25, 2014

Thanks @s1monw, @kimchy! Makes perfect sense :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment