Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] IndexShardTests.testConcurrentTermIncreaseOnReplicaShard failed on 6.4 #34862

Closed
tlrx opened this issue Oct 25, 2018 · 3 comments
Closed
Assignees
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI v6.4.4

Comments

@tlrx
Copy link
Member

tlrx commented Oct 25, 2018

The test IndexShardTests.testConcurrentTermIncreaseOnReplicaShard failed today on the 6.4 branch:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.4+release-tests/25

with the error:

java.lang.IllegalArgumentException: unexpected mapping update

The test execution log is:
consoleText.txt

I can/can't reproduce the error using:

gradlew :server:test -Dtests.seed=934A6D0349D74193 -Dtests.class=org.elasticsearch.index.shard.IndexShardTests -Dtests.method="testConcurrentTermIncreaseOnReplicaShard" -Dtests.security.manager=true -Dbuild.snapshot=false -Dtests.jvm.argline="-Dbuild.snapshot=false" -Dtests.locale=ar -Dtests.timezone=ROK
@tlrx tlrx added >test-failure Triaged test failures from CI :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v6.4.3 labels Oct 25, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@colings86 colings86 added v6.4.4 and removed v6.4.3 labels Oct 25, 2018
@DaveCTurner
Copy link
Contributor

I looked into this some more. I'm not sure the unexpected mapping update is the issue, because I also found this:

   > Throwable #1: java.lang.Exception: Suite timeout exceeded (>= 1200000 msec).
   > 	at __randomizedtesting.SeedInfo.seed([934A6D0349D74193]:0)Throwable #2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=1816, name=elasticsearch[org.elasticsearch.index.shard.IndexShardTests][index][T#1], state=RUNNABLE, group=TGRP-IndexShardTests]
   > Caused by: java.lang.AssertionError: 
   > Expected: <92L>
   >      but: was <93L>
   > 	at __randomizedtesting.SeedInfo.seed([934A6D0349D74193]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.index.shard.IndexShardTests$9.onResponse(IndexShardTests.java:1000)
   > 	at org.elasticsearch.index.shard.IndexShardTests$9.onResponse(IndexShardTests.java:996)
   > 	at org.elasticsearch.index.shard.IndexShard$3.onResponse(IndexShard.java:2357)
   > 	at org.elasticsearch.index.shard.IndexShard$3.onResponse(IndexShard.java:2337)
   > 	at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)
   > 	at org.elasticsearch.index.shard.IndexShardOperationPermits$PermitAwareThreadedActionListener$1.doRun(IndexShardOperationPermits.java:378)
   > 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723)
   > 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   > 	at java.lang.Thread.run(Thread.java:748)

The failing assertion is here:

assertThat(indexShard.getPendingPrimaryTerm(), equalTo(primaryTerm + increment));

This might mean that the race fixed in #32442 is still present, or may be something else? @ywelsch do you have any thoughts?

Also I think we should release that releasable in a finally block rather than keeping hold of it and letting the suite time out when that assertion fails.

@DaveCTurner
Copy link
Contributor

I added some more logging and left this test (in master) running for a while with -Dtests.iters=1000 and it failed eventually in the same way:

  2> WARNING: Uncaught exception in thread: Thread[elasticsearch[org.elasticsearch.index.shard.IndexShardTests][write][T#1],5,TGRP-IndexShardTests]
  2> java.lang.AssertionError:
  2> Expected: <13L>
  2>      but: was <14L>
  2>    at __randomizedtesting.SeedInfo.seed([934A6D0349D74193]:0)
  2>    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
  2>    at org.junit.Assert.assertThat(Assert.java:956)
  2>    at org.junit.Assert.assertThat(Assert.java:923)
  2>    at org.elasticsearch.index.shard.IndexShardTests$10.onResponse(IndexShardTests.java:1057)
  2>    at org.elasticsearch.index.shard.IndexShardTests$10.onResponse(IndexShardTests.java:1051)
  2>    at org.elasticsearch.index.shard.IndexShard$3.onResponse(IndexShard.java:2387)
  2>    at org.elasticsearch.index.shard.IndexShard$3.onResponse(IndexShard.java:2365)
  2>    at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)
  2>    at org.elasticsearch.index.shard.IndexShardOperationPermits$PermitAwareThreadedActionListener$1.doRun(IndexShardOperationPermits.java:374)
  2>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723)
  2>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
  2>    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  2>    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  2>    at java.base/java.lang.Thread.run(Thread.java:834)

Extra logging:

diff --git a/server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java b/server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java
index 487ac7e..3ad2471 100644
--- a/server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java
+++ b/server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java
@@ -125,6 +125,7 @@ import org.elasticsearch.test.FieldMaskingReader;
 import org.elasticsearch.test.VersionUtils;
 import org.elasticsearch.threadpool.ThreadPool;
 import org.elasticsearch.ElasticsearchException;
+import org.elasticsearch.test.junit.annotations.TestLogging;

 import java.io.IOException;
 import java.nio.charset.Charset;
@@ -1025,6 +1026,7 @@ public class IndexShardTests extends IndexShardTestCase {
         closeShard(indexShard, false);
     }

+    @TestLogging("org.elasticsearch.index.shard:TRACE,org.elasticsearch.index.seqno:TRACE")
     public void testConcurrentTermIncreaseOnReplicaShard() throws BrokenBarrierException, InterruptedException, IOException {
         final IndexShard indexShard = newStartedShard(false);

@@ -1049,10 +1051,18 @@ public class IndexShardTests extends IndexShardTestCase {
                     new ActionListener<Releasable>() {
                         @Override
                         public void onResponse(Releasable releasable) {
-                            counter.incrementAndGet();
-                            assertThat(indexShard.getPendingPrimaryTerm(), equalTo(primaryTerm + increment));
-                            latch.countDown();
-                            releasable.close();
+                            logger.info("obtained replica operation permit for increment={}", increment);
+                            try {
+                              counter.incrementAndGet();
+                              assertThat(indexShard.getPendingPrimaryTerm(), equalTo(primaryTerm + increment));
+                              logger.info("passed term check for increment={}", increment);
+                            } finally {
+                              logger.info("counting down for increment={}", increment);
+                              latch.countDown();
+                              logger.info("releasing replica operation permit for increment={}", increment);
+                              releasable.close();
+                              logger.info("released replica operation permit for increment={}", increment);
+                            }
                         }

                         @Override
@@ -1062,6 +1072,7 @@ public class IndexShardTests extends IndexShardTestCase {
                         }
                     },
                     ThreadPool.Names.WRITE, "");
+            logger.info("launched wait for increment={}", increment);
         };

         final long firstIncrement = 1 + (randomBoolean() ? 0 : 1);
@@ -1069,6 +1080,8 @@ public class IndexShardTests extends IndexShardTestCase {
         final Thread first = new Thread(function.apply(firstIncrement));
         final Thread second = new Thread(function.apply(secondIncrement));

+        logger.info("firstIncrement={}, secondIncrement={}, primaryTerm={}", firstIncrement, secondIncrement, primaryTerm);
+
         first.start();
         second.start();

The log from this run is:

  1> [2018-10-26T20:02:36,879][INFO ][o.e.i.s.IndexShardTests  ] [testConcurrentTermIncreaseOnReplicaShard] firstIncrement=2, secondIncrement=1, primaryTerm=12
  1> [2018-10-26T20:02:36,879][INFO ][o.e.i.s.IndexShardTests  ] [[Thread-214]] launched wait for increment=1
  1> [2018-10-26T20:02:36,879][INFO ][o.e.i.s.IndexShard       ] [org.elasticsearch.index.shard.IndexShardTests] [index][0] detected new primary with primary term [13], global checkpoint [-1], max_seq_no [-1]
  1> [2018-10-26T20:02:36,880][INFO ][o.e.i.s.IndexShardTests  ] [org.elasticsearch.index.shard.IndexShardTests] obtained replica operation permit for increment=1
  1> [2018-10-26T20:02:36,881][INFO ][o.e.i.s.IndexShardTests  ] [[Thread-213]] launched wait for increment=2
  1> [2018-10-26T20:02:36,883][INFO ][o.e.i.s.IndexShardTests  ] [org.elasticsearch.index.shard.IndexShardTests] counting down for increment=1
  1> [2018-10-26T20:02:36,883][INFO ][o.e.i.s.IndexShardTests  ] [org.elasticsearch.index.shard.IndexShardTests] releasing replica operation permit for increment=1
  1> [2018-10-26T20:02:36,883][INFO ][o.e.i.s.IndexShardTests  ] [org.elasticsearch.index.shard.IndexShardTests] released replica operation permit for increment=1
  1> [2018-10-26T20:02:36,883][INFO ][o.e.i.s.IndexShard       ] [org.elasticsearch.index.shard.IndexShardTests] [index][0] detected new primary with primary term [14], global checkpoint [-1], max_seq_no [-1]
  1> [2018-10-26T20:02:36,884][INFO ][o.e.i.s.IndexShardTests  ] [org.elasticsearch.index.shard.IndexShardTests] obtained replica operation permit for increment=2
  1> [2018-10-26T20:02:36,884][INFO ][o.e.i.s.IndexShardTests  ] [org.elasticsearch.index.shard.IndexShardTests] passed term check for increment=2
  1> [2018-10-26T20:02:36,884][INFO ][o.e.i.s.IndexShardTests  ] [org.elasticsearch.index.shard.IndexShardTests] counting down for increment=2
  1> [2018-10-26T20:02:36,884][INFO ][o.e.i.s.IndexShardTests  ] [org.elasticsearch.index.shard.IndexShardTests] releasing replica operation permit for increment=2
  1> [2018-10-26T20:02:36,884][INFO ][o.e.i.s.IndexShardTests  ] [org.elasticsearch.index.shard.IndexShardTests] released replica operation permit for increment=2

NB the first check is the one that fails (lack of passed term check log message), apparently because the second term bump got in there first. Looks like a race, I've no idea if it's benign or not.

ywelsch added a commit that referenced this issue Nov 5, 2018
This test has a bug that got introduced during the refactoring of #32442. With 2 concurrent term increments,
we can only assert under the operation permit that we are in the correct operation term, not that there is
not already another term bump pending.

Closes #34862
ywelsch added a commit that referenced this issue Nov 5, 2018
This test has a bug that got introduced during the refactoring of #32442. With 2 concurrent term increments,
we can only assert under the operation permit that we are in the correct operation term, not that there is
not already another term bump pending.

Closes #34862
@ywelsch ywelsch closed this as completed in 4f35eea Nov 5, 2018
ywelsch added a commit that referenced this issue Nov 5, 2018
This test has a bug that got introduced during the refactoring of #32442. With 2 concurrent term increments,
we can only assert under the operation permit that we are in the correct operation term, not that there is
not already another term bump pending.

Closes #34862
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI v6.4.4
Projects
None yet
Development

No branches or pull requests

5 participants