Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] TimeSeriesLifecycleActionsIT testHistoryIsWrittenWithFailure #50353

Closed
davidkyle opened this issue Dec 19, 2019 · 24 comments · Fixed by #50403 or #53039
Closed

[CI] TimeSeriesLifecycleActionsIT testHistoryIsWrittenWithFailure #50353

davidkyle opened this issue Dec 19, 2019 · 24 comments · Fixed by #50403 or #53039
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI v8.0.0-alpha1

Comments

@davidkyle
Copy link
Member

A number of failures on the master branch in the last day:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/os=windows-2012-r2/331/console
https://gradle-enterprise.elastic.co/s/clxgk64w274hw

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu-18.04&&immutable/460/console
https://gradle-enterprise.elastic.co/s/ptsngwrgxzn64

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=centos-6&&immutable/460/console
https://gradle-enterprise.elastic.co/s/jdmkulxql5ris

java.lang.AssertionError: 

Expected: a value equal to or greater than <1>
     but: <0> was less than <1>

at __randomizedtesting.SeedInfo.seed([9BF20B84CA3FCF0E:49B89F3F80A0B0A0]:0)
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
at org.junit.Assert.assertThat(Assert.java:956)
at org.junit.Assert.assertThat(Assert.java:923)
at org.elasticsearch.xpack.ilm.TimeSeriesLifecycleActionsIT.assertHistoryIsPresent(TimeSeriesLifecycleActionsIT.java:1157)
at org.elasticsearch.xpack.ilm.TimeSeriesLifecycleActionsIT.lambda$testHistoryIsWrittenWithDeletion$72(TimeSeriesLifecycleActionsIT.java:1103)
at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:866)
at org.elasticsearch.xpack.ilm.TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithDeletion(TimeSeriesLifecycleActionsIT.java:1102)

Does not reproduce:

gradlew ':x-pack:plugin:ilm:qa:multi-node:integTestRunner' --tests "org.elasticsearch.xpack.ilm.TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithSuccess" \
  -Dtests.seed=DA4311B28F41839E \
  -Dtests.security.manager=true \
  -Dtests.locale=id-ID \
  -Dtests.timezone=SystemV/MST7MDT \
  -Dcompiler.java=13
@davidkyle davidkyle added >test-failure Triaged test failures from CI :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.0.0 labels Dec 19, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@davidkyle
Copy link
Member Author

Muted in c1580dc

@dakrone dakrone self-assigned this Dec 19, 2019
dakrone added a commit to dakrone/elasticsearch that referenced this issue Dec 19, 2019
This moves the `putAsync` method in `ILMHistoryStore` never to block.
Previously due to the way that the `BulkProcessor` works, it was possible
for `BulkProcessor#add` to block executing a bulk request. This was bad
as we may be adding things to the history store in cluster state update
threads.

This also moves the index creation to be done prior to the bulk request
execution, rather than being checked every time an operation was added
to the queue. This lessens the chance of the index being created, then
deleted (by some external force), and then recreated via a bulk indexing
request.

Resolves elastic#50353
dnhatn added a commit that referenced this issue Dec 20, 2019
@dnhatn
Copy link
Member

dnhatn commented Dec 20, 2019

I also muted testHistoryIsWrittenWithSuccess in ceec4ea.

dnhatn added a commit that referenced this issue Dec 20, 2019
dakrone added a commit that referenced this issue Dec 20, 2019
* Make ILMHistoryStore.putAsync truly async

This moves the `putAsync` method in `ILMHistoryStore` never to block.
Previously due to the way that the `BulkProcessor` works, it was possible
for `BulkProcessor#add` to block executing a bulk request. This was bad
as we may be adding things to the history store in cluster state update
threads.

This also moves the index creation to be done prior to the bulk request
execution, rather than being checked every time an operation was added
to the queue. This lessens the chance of the index being created, then
deleted (by some external force), and then recreated via a bulk indexing
request.

Resolves #50353
dakrone added a commit to dakrone/elasticsearch that referenced this issue Dec 20, 2019
* Make ILMHistoryStore.putAsync truly async

This moves the `putAsync` method in `ILMHistoryStore` never to block.
Previously due to the way that the `BulkProcessor` works, it was possible
for `BulkProcessor#add` to block executing a bulk request. This was bad
as we may be adding things to the history store in cluster state update
threads.

This also moves the index creation to be done prior to the bulk request
execution, rather than being checked every time an operation was added
to the queue. This lessens the chance of the index being created, then
deleted (by some external force), and then recreated via a bulk indexing
request.

Resolves elastic#50353
dakrone added a commit that referenced this issue Dec 20, 2019
* Add ILM histore store index (#50287)

* Add ILM histore store index

This commit adds an ILM history store that tracks the lifecycle
execution state as an index progresses through its ILM policy. ILM
history documents store output similar to what the ILM explain API
returns.

An example document with ALL fields (not all documents will have all
fields) would look like:

```json
{
  "@timestamp": 1203012389,
  "policy": "my-ilm-policy",
  "index": "index-2019.1.1-000023",
  "index_age":123120,
  "success": true,
  "state": {
    "phase": "warm",
    "action": "allocate",
    "step": "ERROR",
    "failed_step": "update-settings",
    "is_auto-retryable_error": true,
    "creation_date": 12389012039,
    "phase_time": 12908389120,
    "action_time": 1283901209,
    "step_time": 123904107140,
    "phase_definition": "{\"policy\":\"ilm-history-ilm-policy\",\"phase_definition\":{\"min_age\":\"0ms\",\"actions\":{\"rollover\":{\"max_size\":\"50gb\",\"max_age\":\"30d\"}}},\"version\":1,\"modified_date_in_millis\":1576517253463}",
    "step_info": "{... etc step info here as json ...}"
  },
  "error_details": "java.lang.RuntimeException: etc\n\tcaused by:etc etc etc full stacktrace"
}
```

These documents go into the `ilm-history-1-00000N` index to provide an
audit trail of the operations ILM has performed.

This history storage is enabled by default but can be disabled by setting
`index.lifecycle.history_index_enabled` to `false.`

Resolves #49180

* Make ILMHistoryStore.putAsync truly async (#50403)

This moves the `putAsync` method in `ILMHistoryStore` never to block.
Previously due to the way that the `BulkProcessor` works, it was possible
for `BulkProcessor#add` to block executing a bulk request. This was bad
as we may be adding things to the history store in cluster state update
threads.

This also moves the index creation to be done prior to the bulk request
execution, rather than being checked every time an operation was added
to the queue. This lessens the chance of the index being created, then
deleted (by some external force), and then recreated via a bulk indexing
request.

Resolves #50353
@andreidan
Copy link
Contributor

andreidan commented Dec 27, 2019

This happened again on 7.x - https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-unix-compatibility/os=centos-6&&immutable/471/consoleFull.

There's a build scan available https://gradle-enterprise.elastic.co/s/u7p4epuphegx6

I was, however, unable to reproduce this locally in ~300 runs

@andreidan andreidan reopened this Dec 27, 2019
@albertzaharovits
Copy link
Contributor

Another failure in a PR build check against master-ish : https://gradle-enterprise.elastic.co/s/aryfl6vkivmcm

@cbuescher
Copy link
Member

Just got failures in TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithFailure today that also look similar, so I'm adding these here as well:

Log https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=sles-15&&immutable/490/console
Build Scans https://gradle-enterprise.elastic.co/s/7rsdhwtqxkpka
java.lang.AssertionError: 
Expected: a value equal to or greater than <1>
     but: <0> was less than <1>
	at __randomizedtesting.SeedInfo.seed([308B5D256356833E:B0E5874928ACC0FA]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.junit.Assert.assertThat(Assert.java:923)
	at org.elasticsearch.xpack.ilm.TimeSeriesLifecycleActionsIT.assertHistoryIsPresent(TimeSeriesLifecycleActionsIT.java:1158)
	at org.elasticsearch.xpack.ilm.TimeSeriesLifecycleActionsIT.assertHistoryIsPresent(TimeSeriesLifecycleActionsIT.java:1111)
	at org.elasticsearch.xpack.ilm.TimeSeriesLifecycleActionsIT.lambda$testHistoryIsWrittenWithFailure$70(TimeSeriesLifecycleActionsIT.java:1074)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:879)
	at org.elasticsearch.xpack.ilm.TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithFailure(TimeSeriesLifecycleActionsIT.java:1074)

Could not immediately reproduce with:

./gradlew ':x-pack:plugin:ilm:qa:multi-node:integTestRunner' --tests "org.elasticsearch.xpack.ilm.TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithFailure" \
  -Dtests.seed=308B5D256356833E \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-OM \
  -Dtests.timezone=US/Arizona \
  -Dcompiler.java=13

@cbuescher
Copy link
Member

Another on master (TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithSuccess). Maybe muting makes sense again. Will wait a bit longer but this seems to get frequent.

Log https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob+fast+part2/2865/console
Build Scans https://gradle-enterprise.elastic.co/s/fsh24b2yzlr3e

cbuescher pushed a commit that referenced this issue Jan 3, 2020
Also muting TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithFailure.

Tracked in #50353
cbuescher pushed a commit that referenced this issue Jan 3, 2020
Also muting TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithFailure.

Tracked in #50353
@cbuescher
Copy link
Member

I muted TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithSuccess and TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithFailure with
a5ab7da on master and 6c8868e on 7.x

dakrone added a commit to dakrone/elasticsearch that referenced this issue Jan 3, 2020
These tests use the same index name, making it hard to read logs when
diagnosing the failures. Additionally more information about the current
state of the index could be retrieved when failing.

This changes these two things in the hope of capturing more data about
why this fails on some CI nodes but not others.

Relates to elastic#50353
dakrone added a commit that referenced this issue Jan 6, 2020
These tests use the same index name, making it hard to read logs when
diagnosing the failures. Additionally more information about the current
state of the index could be retrieved when failing.

This changes these two things in the hope of capturing more data about
why this fails on some CI nodes but not others.

Relates to #50353

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
dakrone added a commit to dakrone/elasticsearch that referenced this issue Jan 6, 2020
These tests use the same index name, making it hard to read logs when
diagnosing the failures. Additionally more information about the current
state of the index could be retrieved when failing.

This changes these two things in the hope of capturing more data about
why this fails on some CI nodes but not others.

Relates to elastic#50353

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
Also muting TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithFailure.

Tracked in elastic#50353
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
These tests use the same index name, making it hard to read logs when
diagnosing the failures. Additionally more information about the current
state of the index could be retrieved when failing.

This changes these two things in the hope of capturing more data about
why this fails on some CI nodes but not others.

Relates to elastic#50353

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
dakrone added a commit to dakrone/elasticsearch that referenced this issue Feb 19, 2020
This commit adds additional logging as well as applies some test fixes
in an attempt to address elastic#50353.
dakrone added a commit that referenced this issue Feb 20, 2020
This commit adds additional logging as well as applies some test fixes
in an attempt to address #50353.
@dakrone
Copy link
Member

dakrone commented Feb 20, 2020

I pushed yet more logging and un-muted a single test on master only, so if it fails again please do update this issue and I'll re-mute.

@mark-vieira
Copy link
Contributor

@mark-vieira
Copy link
Contributor

mark-vieira commented Feb 26, 2020

Given we've gotten a few more failures since adding the additional logging I've remuted this test with bd6d6f7.

I've also noticed this test failing recently with a different assertion error. Since it's not clear if these might be related I've opened a separate issue(#52853) to make discoverability easier for folks.

@dakrone
Copy link
Member

dakrone commented Feb 26, 2020

@mark-vieira thanks, these do have more information, thanks for re-muting for me.

probakowski added a commit to probakowski/elasticsearch that referenced this issue Mar 2, 2020
This change modifies ILMHistoryStore to always apply correct settings and mappings,
even if template is deleted and not yet recreated. This ensures that ILM history index
is correctly managed by ILM and also fixes flaky history tests that were prone to
triggenring this race.

This commit also refactors and simplifies ILM history tests.

Closes elastic#50353 and elastic#52853
probakowski added a commit that referenced this issue Mar 4, 2020
* Avoid race condition in ILMHistorySotre

This change modifies ILMHistoryStore to always apply correct settings and mappings,
even if template is deleted and not yet recreated. This ensures that ILM history index
is correctly managed by ILM and also fixes flaky history tests that were prone to
triggenring this race.

This commit also refactors and simplifies ILM history tests.

Closes #50353 and #52853

* Review comment

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
probakowski added a commit to probakowski/elasticsearch that referenced this issue Mar 4, 2020
* Avoid race condition in ILMHistorySotre

This change modifies ILMHistoryStore to always apply correct settings and mappings,
even if template is deleted and not yet recreated. This ensures that ILM history index
is correctly managed by ILM and also fixes flaky history tests that were prone to
triggenring this race.

This commit also refactors and simplifies ILM history tests.

Closes elastic#50353 and elastic#52853

* Review comment

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
probakowski added a commit that referenced this issue Mar 9, 2020
* Avoid race condition in ILMHistorySotre (#53039)

* Avoid race condition in ILMHistorySotre

This change modifies ILMHistoryStore to always apply correct settings and mappings,
even if template is deleted and not yet recreated. This ensures that ILM history index
is correctly managed by ILM and also fixes flaky history tests that were prone to
triggenring this race.

This commit also refactors and simplifies ILM history tests.

Closes #50353 and #52853

* Review comment

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

* fixed tests

* backport #53306

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
@williamrandolph
Copy link
Contributor

@mark-vieira
Copy link
Contributor

This is continuing to fail with some regularity (a dozen times in the past week on master) so I've muted this again with 73fe156.

@mayya-sharipova
Copy link
Contributor

Another failure on 7.x. I keep it unmuted for now, as it a only a first failure from Jan 2nd.

Log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_RUNTIME_JAVA=java11,nodes=general-purpose/564/console

Build Scans: https://gradle-enterprise.elastic.co/s/ehgyqqem7x2r4

Failure doesn't reproduce for me on 7.x

REPRODUCE WITH: ./gradlew ':x-pack:plugin:ilm:qa:multi-node:integTestRunner' --tests "org.elasticsearch.xpack.ilm.TimeSeriesLifecycleActionsIT.testHistoryIsWrittenWithFailure"
-Dtests.seed=4CF58E4CF1F92544
-Dtests.security.manager=true
-Dtests.locale=ar-EG
-Dtests.timezone=Europe/Helsinki
-Dcompiler.java=13

Expected: a value equal to or greater than <1>
     but: <0> was less than <1>
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
at org.junit.Assert.assertThat(Assert.java:956)
at org.junit.Assert.assertThat(Assert.java:923)
at org.elasticsearch.xpack.ilm.TimeSeriesLifecycleActionsIT.assertHistoryIsPresent(TimeSeriesLifecycleActionsIT.java:1635)
at org.elasticsearch.xpack.ilm.TimeSeriesLifecycleActionsIT.assertHistoryIsPresent(TimeSeriesLifecycleActionsIT.java:1558)
at org.elasticsearch.xpack.ilm.TimeSeriesLifecycleActionsIT.lambda$testHistoryIsWrittenWithFailure$107(TimeSeriesLifecycleActionsIT.java:1420)

@davidkyle
Copy link
Member Author

Numerous failures on 7.x I muted the test again - a38e5ca

@rjernst rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020
@andreidan
Copy link
Contributor

This test has been un-muted (and fixed) in #64521

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI v8.0.0-alpha1
Projects
None yet