Disable auto gen id optimization #9468

Closed
wants to merge 3 commits into
from

Projects

None yet

4 participants

@brwe
Contributor
brwe commented Jan 28, 2015

This pr removes the optimization for auto generated ids.
Previously, when ids were auto generated by elasticsearch then there was no
check to see if a document with same id already existed and instead the new
document was only appended. However, due to lucene improvements this
optimization does not add much value. In addition, under rare circumstances it might
cause duplicate documents:

When an indexing request is retried (due to connect lost, node closed etc),
then a flag 'canHaveDuplicates' is set to true for the indexing request
that is send a second time. This was to make sure that even
when an indexing request for a document with autogenerated id comes in
we do not have to update unless this flag is set and instead only append.

However, it might happen that for a retry or for the replication the
indexing request that has the canHaveDuplicates set to true (the retried request) arrives
at the destination before the original request that does have it set false.
In this case both request add a document and we have a duplicated a document.
This commit adds a workaround: remove the optimization for auto
generated ids and always update the document.
The asumtion is that this will not slow down indexing more than 10 percent,
see: http://benchmarks.elasticsearch.org/

closes #8788

@brwe brwe core: fix duplicate docs with autogenerated ids
When an indexing request is retried (due to connect lost, node closed etc),
then a flag 'canHaveDuplicates' is set to true for the indexing request
that is send a second time. This was to make sure that even
when an indexing request for a document with autogenerated id comes in
we do not have to update unless this flag is set and instead only append.

However, it might happen that for a retry or for the replication the
indexing request that has the canHaveDuplicates set to true (the retried request) arrives
at the destination before the original request that does have it set false.
In this case both request add a document and we have a duplicated a document.
This commit adds a workaround: remove the optimization for auto
generated ids and always update the document.
The asumtion is that this will not slow down indexing more than 10 percent,
see: http://benchmarks.elasticsearch.org/

closes #8788
bf7752e
@s1monw s1monw and 1 other commented on an outdated diff Jan 29, 2015
...asticsearch/index/engine/internal/InternalEngine.java
@@ -401,11 +401,11 @@ private void innerCreateNoLock(Create create, IndexWriter writer, long currentVe
if ((versionValue != null && versionValue.delete() == false) || (versionValue == null && currentVersion != Versions.NOT_FOUND)) {
if (create.origin() == Operation.Origin.RECOVERY) {
return;
- } else if (create.origin() == Operation.Origin.REPLICA) {
@s1monw
s1monw Jan 29, 2015 Contributor

do we need this change here? I mean removing the optimization should be enough?

@brwe
brwe Jan 29, 2015 Contributor

Without the change there are two scenarios:

  1. First and second create request arrive in order on replica: on the replica 'doUpdate' is set but the version is not set to 1 -> versions on primary and replica are out of sync.
  2. First and second create request arrive out of order on primary: we get a 'DocumentAlreadyExistsException' because none of the other criteria match.

The tests I added in InternalEngineTests fail that way.
It seems to me we have to change something to make this work but I might be missing something.

@bleskes I agree we do not have to update the doc at all if we find it already exists and create.autoGeneratedId() is true. We can just return. I added a commit, test pass.

@brwe
brwe Jan 29, 2015 Contributor

ok, ignore the comment for now, might just have gotten the versioning wrong...

@brwe
brwe Jan 29, 2015 Contributor

yes, that was wrong. I removed the change and fixed the test instead

@s1monw s1monw commented on the diff Jan 29, 2015
...in/java/org/elasticsearch/index/shard/IndexShard.java
@@ -212,7 +212,7 @@ public IndexShard(ShardId shardId, @IndexSettings Settings indexSettings, IndexS
/* create engine config */
this.config = new EngineConfig(shardId,
- indexSettings.getAsBoolean(EngineConfig.INDEX_OPTIMIZE_AUTOGENERATED_ID_SETTING, true),
@s1monw
s1monw Jan 29, 2015 Contributor

I'd want to remove this entirely but this change is supposed to go into 1.4 as well afaik so I guess that is fine.

@brwe
brwe Jan 29, 2015 Contributor

I tried to make it as minimal invasive as possible for 1.4.

@s1monw
Contributor
s1monw commented Jan 29, 2015

left some comments

@bleskes bleskes and 1 other commented on an outdated diff Jan 29, 2015
...asticsearch/index/engine/internal/InternalEngine.java
@@ -401,11 +401,11 @@ private void innerCreateNoLock(Create create, IndexWriter writer, long currentVe
if ((versionValue != null && versionValue.delete() == false) || (versionValue == null && currentVersion != Versions.NOT_FOUND)) {
if (create.origin() == Operation.Origin.RECOVERY) {
return;
- } else if (create.origin() == Operation.Origin.REPLICA) {
+ } else if (create.origin() == Operation.Origin.REPLICA && !create.autoGeneratedId()) {
@bleskes
bleskes Jan 29, 2015 Member

I think this is wrong? I this is an indexing request on a replica and we're here, that means that there is already a doc with this id. In this case we want to just ignore the request. +1 on what Simon said regarding not changing this code.

@brwe
brwe Jan 29, 2015 Contributor

indeed, I removed the change and fixed the test instead

@bleskes bleskes commented on the diff Jan 29, 2015
...search/index/engine/internal/InternalEngineTests.java
@@ -1559,4 +1562,78 @@ public void testSettings() {
}
}
+
+ @Test
+ public void testRetryWithAutogeneratedIdWorksAndNoDuplicateDocs() throws IOException {
+
+ ParsedDocument doc = testParsedDocument("1", "1", "test", null, -1, -1, testDocument(), B_1, false);
+ boolean canHaveDuplicates = false;
@bleskes
bleskes Jan 29, 2015 Member

This shouldn't have any effect any more, right? do we want to randomize it with a comment it's being removed?

@bleskes
bleskes Jan 29, 2015 Member

Oh, I see. you have two tests. We can maybe fold it into one and have the canHaveDuplicates set randomly and then set it at the reverse?

@brwe
brwe Jan 29, 2015 Contributor

It looked like the two tests do the exact same but now that I fixed the test it is not so anymore. The behavior is different depending on the order. Please take a look and let me know if you still want to fold.

@bleskes
bleskes Jan 29, 2015 Member

Yeah, I see now. We'll clean it up later.

@bleskes bleskes commented on an outdated diff Jan 29, 2015
...search/index/engine/internal/InternalEngineTests.java
+ public void testRetryWithAutogeneratedIdWorksAndNoDuplicateDocs() throws IOException {
+
+ ParsedDocument doc = testParsedDocument("1", "1", "test", null, -1, -1, testDocument(), B_1, false);
+ boolean canHaveDuplicates = false;
+ boolean autoGeneratedId = true;
+
+
+ Engine.Create index = new Engine.Create(null, analyzer, newUid("1"), doc, Versions.MATCH_ANY, VersionType.INTERNAL, PRIMARY, System.nanoTime(), canHaveDuplicates, autoGeneratedId);
+ engine.create(index);
+ assertThat(index.version(), equalTo(1l));
+
+ index = new Engine.Create(null, analyzer, newUid("1"), doc, Versions.MATCH_ANY, VersionType.INTERNAL, REPLICA, System.nanoTime(), canHaveDuplicates, autoGeneratedId);
+ replicaEngine.create(index);
+ assertThat(index.version(), equalTo(1l));
+
+ canHaveDuplicates = true;
@bleskes
bleskes Jan 29, 2015 Member

same comment here. Randomize?

@bleskes bleskes commented on an outdated diff Jan 29, 2015
...search/index/engine/internal/InternalEngineTests.java
+ assertThat(index.version(), equalTo(1l));
+
+ index = new Engine.Create(null, analyzer, newUid("1"), doc, Versions.MATCH_ANY, VersionType.INTERNAL, REPLICA, System.nanoTime(), canHaveDuplicates, autoGeneratedId);
+ replicaEngine.create(index);
+ assertThat(index.version(), equalTo(1l));
+
+ canHaveDuplicates = true;
+ index = new Engine.Create(null, analyzer, newUid("1"), doc, Versions.MATCH_ANY, VersionType.INTERNAL, PRIMARY, System.nanoTime(), canHaveDuplicates, autoGeneratedId);
+ engine.create(index);
+ assertThat(index.version(), equalTo(1l));
+ engine.refresh("test", true);
+ Engine.Searcher searcher = engine.acquireSearcher("test");
+ TopDocs topDocs = searcher.searcher().search(new MatchAllDocsQuery(), 10);
+ assertThat(topDocs.totalHits, equalTo(1));
+
+ index = new Engine.Create(null, analyzer, newUid("1"), doc, Versions.MATCH_ANY, VersionType.INTERNAL, REPLICA, System.nanoTime(), canHaveDuplicates, autoGeneratedId);
@brwe
Contributor
brwe commented Jan 29, 2015

thanks for the comments, I got it all wrong the first time. please have another look!

@bleskes bleskes commented on the diff Jan 29, 2015
...search/index/engine/internal/InternalEngineTests.java
+
+ index = new Engine.Create(null, analyzer, newUid("1"), doc, index.version(), index.versionType().versionTypeForReplicationAndRecovery(), REPLICA, System.nanoTime(), canHaveDuplicates, autoGeneratedId);
+ replicaEngine.create(index);
+ assertThat(index.version(), equalTo(1l));
+
+ canHaveDuplicates = true;
+ index = new Engine.Create(null, analyzer, newUid("1"), doc, Versions.MATCH_ANY, VersionType.INTERNAL, PRIMARY, System.nanoTime(), canHaveDuplicates, autoGeneratedId);
+ engine.create(index);
+ assertThat(index.version(), equalTo(1l));
+ engine.refresh("test", true);
+ Engine.Searcher searcher = engine.acquireSearcher("test");
+ TopDocs topDocs = searcher.searcher().search(new MatchAllDocsQuery(), 10);
+ assertThat(topDocs.totalHits, equalTo(1));
+
+ index = new Engine.Create(null, analyzer, newUid("1"), doc, index.version(), index.versionType().versionTypeForReplicationAndRecovery(), REPLICA, System.nanoTime(), canHaveDuplicates, autoGeneratedId);
+ try {
@bleskes
bleskes Jan 29, 2015 Member

we can use ElasticsearchAssertions.assertThrows. Slightly cleaner code.

@brwe
brwe Jan 29, 2015 Contributor

assertThrows currently only accepts ActionFuture. we can change that, but I think that should probably a different pr.

@bleskes bleskes commented on the diff Jan 29, 2015
...search/index/engine/internal/InternalEngineTests.java
+ public void testRetryWithAutogeneratedIdsAndWrongOrderWorksAndNoDuplicateDocs() throws IOException {
+
+ ParsedDocument doc = testParsedDocument("1", "1", "test", null, -1, -1, testDocument(), B_1, false);
+ boolean canHaveDuplicates = true;
+ boolean autoGeneratedId = true;
+
+ Engine.Create firstIndexRequest = new Engine.Create(null, analyzer, newUid("1"), doc, Versions.MATCH_ANY, VersionType.INTERNAL, PRIMARY, System.nanoTime(), canHaveDuplicates, autoGeneratedId);
+ engine.create(firstIndexRequest);
+ assertThat(firstIndexRequest.version(), equalTo(1l));
+
+ Engine.Create firstIndexRequestReplica = new Engine.Create(null, analyzer, newUid("1"), doc, firstIndexRequest.version(), firstIndexRequest.versionType().versionTypeForReplicationAndRecovery(), REPLICA, System.nanoTime(), canHaveDuplicates, autoGeneratedId);
+ replicaEngine.create(firstIndexRequestReplica);
+ assertThat(firstIndexRequestReplica.version(), equalTo(1l));
+
+ canHaveDuplicates = false;
+ Engine.Create secondIndexRequest = new Engine.Create(null, analyzer, newUid("1"), doc, Versions.MATCH_ANY, VersionType.INTERNAL, PRIMARY, System.nanoTime(), canHaveDuplicates, autoGeneratedId);
@bleskes
bleskes Jan 29, 2015 Member

same here. ElasticsearchAssertions.assertThrows wil help

@bleskes
Member
bleskes commented Jan 29, 2015

LGTM. Left a minor suggestion.

@s1monw
Contributor
s1monw commented Jan 29, 2015

LGTM too

@brwe brwe changed the title from core: fix duplicate docs with autogenerated ids to core: disable auto gen id optimization Jan 29, 2015
@brwe brwe added a commit that referenced this pull request Jan 29, 2015
@brwe brwe core: disable auto gen id optimization
This pr removes the optimization for auto generated ids.
Previously, when ids were auto generated by elasticsearch then there was no
check to see if a document with same id already existed and instead the new
document was only appended. However, due to lucene improvements this
optimization does not add much value. In addition, under rare circumstances it might
cause duplicate documents:

When an indexing request is retried (due to connect lost, node closed etc),
then a flag 'canHaveDuplicates' is set to true for the indexing request
that is send a second time. This was to make sure that even
when an indexing request for a document with autogenerated id comes in
we do not have to update unless this flag is set and instead only append.

However, it might happen that for a retry or for the replication the
indexing request that has the canHaveDuplicates set to true (the retried request) arrives
at the destination before the original request that does have it set false.
In this case both request add a document and we have a duplicated a document.
This commit adds a workaround: remove the optimization for auto
generated ids and always update the document.
The asumtion is that this will not slow down indexing more than 10 percent,
see: http://benchmarks.elasticsearch.org/

closes #8788
closes #9468
bd8c409
@brwe brwe added a commit that referenced this pull request Jan 29, 2015
@brwe brwe core: disable auto gen id optimization
This pr removes the optimization for auto generated ids.
Previously, when ids were auto generated by elasticsearch then there was no
check to see if a document with same id already existed and instead the new
document was only appended. However, due to lucene improvements this
optimization does not add much value. In addition, under rare circumstances it might
cause duplicate documents:

When an indexing request is retried (due to connect lost, node closed etc),
then a flag 'canHaveDuplicates' is set to true for the indexing request
that is send a second time. This was to make sure that even
when an indexing request for a document with autogenerated id comes in
we do not have to update unless this flag is set and instead only append.

However, it might happen that for a retry or for the replication the
indexing request that has the canHaveDuplicates set to true (the retried request) arrives
at the destination before the original request that does have it set false.
In this case both request add a document and we have a duplicated a document.
This commit adds a workaround: remove the optimization for auto
generated ids and always update the document.
The asumtion is that this will not slow down indexing more than 10 percent,
see: http://benchmarks.elasticsearch.org/

closes #8788
closes #9468
a0ef2e4
@brwe brwe added a commit that closed this pull request Jan 29, 2015
@brwe brwe core: disable auto gen id optimization
This pr removes the optimization for auto generated ids.
Previously, when ids were auto generated by elasticsearch then there was no
check to see if a document with same id already existed and instead the new
document was only appended. However, due to lucene improvements this
optimization does not add much value. In addition, under rare circumstances it might
cause duplicate documents:

When an indexing request is retried (due to connect lost, node closed etc),
then a flag 'canHaveDuplicates' is set to true for the indexing request
that is send a second time. This was to make sure that even
when an indexing request for a document with autogenerated id comes in
we do not have to update unless this flag is set and instead only append.

However, it might happen that for a retry or for the replication the
indexing request that has the canHaveDuplicates set to true (the retried request) arrives
at the destination before the original request that does have it set false.
In this case both request add a document and we have a duplicated a document.
This commit adds a workaround: remove the optimization for auto
generated ids and always update the document.
The asumtion is that this will not slow down indexing more than 10 percent,
see: http://benchmarks.elasticsearch.org/

closes #8788
closes #9468
0a07ce8
@brwe brwe closed this in 0a07ce8 Jan 29, 2015
@brwe brwe added a commit that referenced this pull request Feb 4, 2015
@brwe brwe core: disable auto gen id optimization
This pr removes the optimization for auto generated ids.
Previously, when ids were auto generated by elasticsearch then there was no
check to see if a document with same id already existed and instead the new
document was only appended. However, due to lucene improvements this
optimization does not add much value. In addition, under rare circumstances it might
cause duplicate documents:

When an indexing request is retried (due to connect lost, node closed etc),
then a flag 'canHaveDuplicates' is set to true for the indexing request
that is send a second time. This was to make sure that even
when an indexing request for a document with autogenerated id comes in
we do not have to update unless this flag is set and instead only append.

However, it might happen that for a retry or for the replication the
indexing request that has the canHaveDuplicates set to true (the retried request) arrives
at the destination before the original request that does have it set false.
In this case both request add a document and we have a duplicated a document.
This commit adds a workaround: remove the optimization for auto
generated ids and always update the document.
The asumtion is that this will not slow down indexing more than 10 percent,
see: http://benchmarks.elasticsearch.org/

closes #8788
closes #9468
357ab46
@clintongormley clintongormley changed the title from core: disable auto gen id optimization to Core: disable auto gen id optimization Feb 10, 2015
@clintongormley clintongormley changed the title from Core: disable auto gen id optimization to Core: Disable auto gen id optimization Feb 10, 2015
@clintongormley clintongormley changed the title from Core: Disable auto gen id optimization to Disable auto gen id optimization Jun 7, 2015
@mute mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
@brwe brwe core: disable auto gen id optimization
This pr removes the optimization for auto generated ids.
Previously, when ids were auto generated by elasticsearch then there was no
check to see if a document with same id already existed and instead the new
document was only appended. However, due to lucene improvements this
optimization does not add much value. In addition, under rare circumstances it might
cause duplicate documents:

When an indexing request is retried (due to connect lost, node closed etc),
then a flag 'canHaveDuplicates' is set to true for the indexing request
that is send a second time. This was to make sure that even
when an indexing request for a document with autogenerated id comes in
we do not have to update unless this flag is set and instead only append.

However, it might happen that for a retry or for the replication the
indexing request that has the canHaveDuplicates set to true (the retried request) arrives
at the destination before the original request that does have it set false.
In this case both request add a document and we have a duplicated a document.
This commit adds a workaround: remove the optimization for auto
generated ids and always update the document.
The asumtion is that this will not slow down indexing more than 10 percent,
see: http://benchmarks.elasticsearch.org/

closes #8788
closes #9468
b87b070
@mute mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
@brwe brwe core: disable auto gen id optimization
This pr removes the optimization for auto generated ids.
Previously, when ids were auto generated by elasticsearch then there was no
check to see if a document with same id already existed and instead the new
document was only appended. However, due to lucene improvements this
optimization does not add much value. In addition, under rare circumstances it might
cause duplicate documents:

When an indexing request is retried (due to connect lost, node closed etc),
then a flag 'canHaveDuplicates' is set to true for the indexing request
that is send a second time. This was to make sure that even
when an indexing request for a document with autogenerated id comes in
we do not have to update unless this flag is set and instead only append.

However, it might happen that for a retry or for the replication the
indexing request that has the canHaveDuplicates set to true (the retried request) arrives
at the destination before the original request that does have it set false.
In this case both request add a document and we have a duplicated a document.
This commit adds a workaround: remove the optimization for auto
generated ids and always update the document.
The asumtion is that this will not slow down indexing more than 10 percent,
see: http://benchmarks.elasticsearch.org/

closes #8788
closes #9468
23cd9ec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment