SmokeTestMultiNodeClientYamlTestSuiteIT/indices.stats/20_translog failed retaining too much translog #46425

DaveCTurner · 2019-09-06T07:41:41Z

On repeated runs of org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT I hit the following failure:

org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT > test {yaml=indices.stats/20_translog/Translog retention without soft_deletes} FAILED
    java.lang.AssertionError: Failure at [indices.stats/20_translog:62]: field [indices.test.primaries.translog.size_in_bytes] is not less than or equal to [$creation_size]
    Expected: a value less than or equal to <110>
         but: <285> was greater than <110>

        Caused by:
        java.lang.AssertionError: field [indices.test.primaries.translog.size_in_bytes] is not less than or equal to [$creation_size]
        Expected: a value less than or equal to <110>
             but: <285> was greater than <110>

The REPRODUCE WITH line said:

REPRODUCE WITH: ./gradlew ':qa:smoke-test-multinode:integTestRunner' --tests "org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT" -Dtests.method="test {yaml=indices.stats/20_translog/Translog retention without soft_deletes}" -Dtests.seed=D3278D281C0378A6 -Dtests.security.manager=true -Dtests.jvms=4 -Dtests.locale=es-IC -Dtests.timezone=Kwajalein -Dcompiler.java=12 -Druntime.java=12

However this did not reproduce for me in ~30 retries.

I see a small number of similar failures from CI too, for instance https://gradle-enterprise.elastic.co/s/wi2oqndt4254w/console-log?task=:qa:smoke-test-multinode:integTestRunner.

There's not much information in the logs either:

  1> [2019-09-06T16:30:09,601][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] [yaml=indices.stats/20_translog/Translog retention without soft_deletes] before test
  1> [2019-09-06T16:30:10,653][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] Stash dump on test failure [{
  1>   "stash" : {
  1>     "body" : {
  1>       "_shards" : {
  1>         "total" : 2,
  1>         "successful" : 2,
  1>         "failed" : 0
  1>       },
  1>       "_all" : {
  1>         "primaries" : {
  1>           "translog" : {
  1>             "operations" : 1,
  1>             "size_in_bytes" : 285,
  1>             "uncommitted_operations" : 0,
  1>             "uncommitted_size_in_bytes" : 55,
  1>             "earliest_last_modified_age" : 0
  1>           }
  1>         },
  1>         "total" : {
  1>           "translog" : {
  1>             "operations" : 1,
  1>             "size_in_bytes" : 285,
  1>             "uncommitted_operations" : 0,
  1>             "uncommitted_size_in_bytes" : 55,
  1>             "earliest_last_modified_age" : 0
  1>           }
  1>         }
  1>       },
  1>       "indices" : {
  1>         "test" : {
  1>           "uuid" : "ZymXAFZgRFaFdk3a1DetOg",
  1>           "primaries" : {
  1>             "translog" : {
  1>               "operations" : 1,
  1>               "size_in_bytes" : 285,
  1>               "uncommitted_operations" : 0,
  1>               "uncommitted_size_in_bytes" : 55,
  1>               "earliest_last_modified_age" : 0
  1>             }
  1>           },
  1>           "total" : {
  1>             "translog" : {
  1>               "operations" : 1,
  1>               "size_in_bytes" : 285,
  1>               "uncommitted_operations" : 0,
  1>               "uncommitted_size_in_bytes" : 55,
  1>               "earliest_last_modified_age" : 0
  1>             }
  1>           }
  1>         }
  1>       }
  1>     },
  1>     "creation_size" : 110
  1>   }
  1> }]
  1> [2019-09-06T16:30:11,095][INFO ][o.e.s.SmokeTestMultiNodeClientYamlTestSuiteIT] [test] [yaml=indices.stats/20_translog/Translog retention without soft_deletes] after test

More logs attached here: failure-1567755613.tar.gz

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-09-06T07:41:42Z

Pinging @elastic/es-distributed

We leave replicas unassigned until we reroute after the primary shard starts. If a cluster health request with wait_for_no_initializing_shards is executed before the reroute, it will return immediately although there will be some initializing replicas. Peer recoveries of those shards can prevent translog on the primary from trimming. We add wait_for_events to the cluster health request so that it will execute after the reroute. Closes #46425

We leave replicas unassigned until we reroute after the primary shard starts. If a cluster health request with wait_for_no_initializing_shards is executed before the reroute, it will return immediately although there will be some initializing replicas. Peer recoveries of those shards can prevent translog on the primary from trimming. We add wait_for_events to the cluster health request so that it will execute after the reroute. Closes elastic#46425

DaveCTurner added >test-failure Triaged test failures from CI :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v8.0.0 labels Sep 6, 2019

dnhatn self-assigned this Sep 7, 2019

This was referenced Sep 7, 2019

Ignore replication for noop updates #46458

Merged

Ensure no ongoing peer recovery in translog yaml test #46476

Merged

dnhatn closed this as completed in #46476 Sep 9, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SmokeTestMultiNodeClientYamlTestSuiteIT/indices.stats/20_translog failed retaining too much translog #46425

SmokeTestMultiNodeClientYamlTestSuiteIT/indices.stats/20_translog failed retaining too much translog #46425

DaveCTurner commented Sep 6, 2019 •

edited

elasticmachine commented Sep 6, 2019

SmokeTestMultiNodeClientYamlTestSuiteIT/indices.stats/20_translog failed retaining too much translog #46425

SmokeTestMultiNodeClientYamlTestSuiteIT/indices.stats/20_translog failed retaining too much translog #46425

Comments

DaveCTurner commented Sep 6, 2019 • edited

elasticmachine commented Sep 6, 2019

DaveCTurner commented Sep 6, 2019 •

edited