[CI] Failures on 6.x for 60_ml_config_migration/Test old cluster jobs and datafeeds and delete them #36810

tvernum · 2018-12-19T02:14:58Z

This has failed at least twice since #36698 was merged to 6.x

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/693/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java11,nodes=virtual&&linux/127/console

./gradlew :x-pack:qa:rolling-upgrade:with-system-key:v5.6.15#upgradedClusterTestRunner -Dtests.seed=7721FEBE7CCB10B7 -Dtests.class=org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT -Dtests.method="test {p0=upgraded_cluster/60_ml_config_migration/Test old cluster jobs and datafeeds and delete them}" -Dtests.security.manager=true -Dtests.locale=ar-AE -Dtests.timezone=Etc/GMT-10 -Dcompiler.java=11 -Druntime.java=8 -Dtests.rest.suite=upgraded_cluster -Dtests.rest.blacklist="/30_ml_jobs_crud/Test model memory limit is updated"

00:50:45    > Throwable #1: java.lang.AssertionError: Failure at [upgraded_cluster/60_ml_config_migration:43]: datafeeds.0.state didn't match expected value:
00:50:45    >              datafeeds.0.state: expected [started] but was [stopped]
00:50:45    > 	at __randomizedtesting.SeedInfo.seed([7721FEBE7CCB10B7:FF75C164D2377D4F]:0)
00:50:45    > 	at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:406)
00:50:45    > 	at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:383)
00:50:45    > 	at java.lang.Thread.run(Thread.java:748)
00:50:45    > Caused by: java.lang.AssertionError: datafeeds.0.state didn't match expected value:
00:50:45    >              datafeeds.0.state: expected [started] but was [stopped]
00:50:45    > 	at org.elasticsearch.test.rest.yaml.section.MatchAssertion.doAssert(MatchAssertion.java:87)
00:50:45    > 	at org.elasticsearch.test.rest.yaml.section.Assertion.execute(Assertion.java:76)
00:50:45    > 	at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:399)
00:50:45    > 	... 38 more

I get failures locally as well but they have a different assertion failure.

I'm going to mute this on 6.x

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-12-19T02:15:00Z

Pinging @elastic/ml-core

tvernum · 2018-12-19T02:15:35Z

@davidkyle I assigned this to you since it seems to have come from #36698.
Please reassign if I got that wrong.

AwaitsFix: #36810 This test fails on the upgraded cluster in 6.x (details in #36810)

droberts195 · 2019-01-07T14:32:53Z

I think this is a case of an old bug, but made far more likely to happen due to the timing changes of having to do a search for job configuration rather than getting it from the cluster state.

This goes back to the age-old inconsistency that when jobs open for the first time they're in the opening state until they are ready to accept data. But when relocating to a new node jobs stay in the opened state, yet cannot accept data until the corresponding C++ process is running on the new node.

We have a check that datafeed persistent tasks cannot be assigned to a node until the associated job is assigned to a node. We also have a check that an assigned datafeed cannot be run until the associated job is in the opened state. If a datafeed's associated job is in the opening state then it waits for a subsequent cluster state update where the job state is opened, but this check assumes that relocating jobs would have a state of opening, which they do not.

Two possible solutions are:

Set the job state to opening when it is opened but the assignment is stale
Change the test in DatafeedManager to check whether a C++ process is running for the associated job, and not rely on an opened state implying this

droberts195 · 2019-01-07T14:42:52Z

The piece of code I referred to in DatafeedManager is

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/DatafeedManager.java

Lines 488 to 491 in 0000e7c

    
           if (jobState == JobState.OPENED) { 
        
               runTask(datafeedTask); 
        
           } else if (jobState == JobState.OPENING) { 
        
               remainingTasks.add(datafeedTask);

droberts195 · 2019-01-07T17:06:26Z

Although option 1 (setting the job state to opening when it is opened but the assignment is stale) seems logically nicer, there is no way to achieve this without making a change to the persistent tasks framework. There is no way to change a task's status on reassignment. Task status can only be set by a separate action, and using that mechanism would leave a period when the same problem that exists now would still exist (just much harder to debug since it would exist for a shorter and less predictable amount of time).

We already had logic to stop datafeeds running against jobs that were OPENING, but a job that relocates from one node to another while OPENED stays OPENED, and this could cause the datafeed to fail when it sent data to the OPENED job on its new node before it had a corresponding autodetect process. This change extends the check to stop datafeeds running when their job is OPENING _or_ stale (i.e. has not had its status reset since relocating to a different node). Relates elastic#36810

droberts195 · 2019-01-08T15:06:04Z

Although option 1 (setting the job state to opening when it is opened but the assignment is stale) seems logically nicer, there is no way to achieve this without making a change to the persistent tasks framework.

This problem has been dealt with elsewhere via the concept of "stale" task status. The task status of a job task stores an allocation ID as well as the task itself, and if these differ then we know the task has relocated. Then we can interpret the reported status value differently. This tactic has already been used elsewhere. I extended use of it to the datafeed manager in #37227.

#37227 should stop the problem occurring in the fully upgraded cluster. In the mixed cluster it may still happen if a job gets relocated to a node on the old version that doesn't have the fix. I'll think more carefully about this before unmuting the test.

droberts195 · 2019-01-09T10:34:10Z

I'll think more carefully about this before unmuting the test.

The test that is muted only runs in the fully upgraded cluster, so can definitely be unmuted when #37227 is merged.

The same underlying problem does sometimes happen in the mixed cluster, for example it happened in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+bwc-tests/49/console. However, the test that fails in this case is not currently muted. I'll monitor similar failures once #37227 is merged to see if the mixed cluster failures completely go away. If not then I'll adjust what tests are run in the mixed cluster.

We already had logic to stop datafeeds running against jobs that were OPENING, but a job that relocates from one node to another while OPENED stays OPENED, and this could cause the datafeed to fail when it sent data to the OPENED job on its new node before it had a corresponding autodetect process. This change extends the check to stop datafeeds running when their job is OPENING _or_ stale (i.e. has not had its status reset since relocating to a different node). Relates #36810

This reverts commit d7efadc The test should now work following the change made in #37227 Closes #36810

droberts195 · 2019-01-09T15:29:21Z

#37227 should fix the problem and the muted test is unmuted in #37258. Both these PRs are now merged.

This reverts commit d7efadc The test should now work following the change made in #37227 Closes #36810

droberts195 · 2019-01-10T09:41:59Z

The same test is now failing in CI in a slightly different way:

droberts195 · 2019-01-10T17:41:15Z

I found out what is happening. Whether a persistent task goes stale when a node leaves and rejoins the cluster depends on whether there is a cluster state update while the cluster has a master node and the node is out of the cluster. If a node leaves and rejoins the cluster while the cluster has no master node then a persistent task running on it will not go stale. Maybe this was intentional, but it's annoying for ML in the case of a node running ML jobs being restarted.

In the rolling upgrade tests all 3 nodes are master eligible and minimum master nodes is 3, so while any node is restarted the cluster has no master. When the upgraded node rejoins the cluster it has a master again. At this point a flurry of cluster state updates occur. Sometimes the ML persistent tasks that were running on the restarted node are reallocated to a different node and sometimes they stay on the restarted node. If they stay on the restarted node they're not considered stale.

This wouldn't happen in a cluster with minimum master nodes less than the number of nodes in the cluster. Then if a single ML node was restarted the remaining nodes would still be able to form a cluster with a master and the ML persistent tasks would either get allocated to other nodes or become unallocated while the node being restarted was out of the cluster. This would guarantee that they were stale.

The problem can be fixed by checking explicitly whether an autodetect process is running for a datafeed's job before the datafeed does any work. The extent to which this is a test fix or a production fix depends on whether anyone runs in production with minimum master nodes set to the number of nodes in the cluster. It seems a bit crazy because then a single node failure cripples the cluster.

Tim-Brooks · 2019-01-11T00:23:53Z

Some more failures today.

Here is one:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/937/console

Tim-Brooks · 2019-01-11T00:26:55Z

I reverted the commit that unmuted the test on 6.x as we had about 10 failures today.

This is a reinforcement of elastic#37227. It turns out that persistent tasks are not made stale if the node they were running on is restarted and the master node does not notice this. The main scenario where this happens is when minimum master nodes is the same as the number of nodes in the cluster, so the cluster cannot elect a master node when any node is restarted. When an ML node restarts we need the datafeeds for any jobs that were running on that node to not just wait until the jobs are allocated, but to wait for the autodetect process of the job to start up. In the case of reassignment of the job persistent task this was dealt with by the stale status test. But in the case where a node restarts but its persistent tasks are not reassigned we need a deeper test. Fixes elastic#36810

This is a reinforcement of #37227. It turns out that persistent tasks are not made stale if the node they were running on is restarted and the master node does not notice this. The main scenario where this happens is when minimum master nodes is the same as the number of nodes in the cluster, so the cluster cannot elect a master node when any node is restarted. When an ML node restarts we need the datafeeds for any jobs that were running on that node to not just wait until the jobs are allocated, but to wait for the autodetect process of the job to start up. In the case of reassignment of the job persistent task this was dealt with by the stale status test. But in the case where a node restarts but its persistent tasks are not reassigned we need a deeper test. Fixes #36810

gwbrown · 2019-01-11T23:36:13Z

This failed again in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu8,nodes=virtual&&linux/47/console, which includes f141f3a, which I believe was intended to fix it. It appears to be happening much less now, so I'm not going to re-mute the test, but I'm not sure the problem is fully fixed.

droberts195 · 2019-01-14T09:15:48Z

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu8,nodes=virtual&&linux/47/console says:

runbld>>> Using last successful commit 1180687 from the job elastic+elasticsearch+6.6+intake with the build id 20190111103841-CB1867D2 that was started at 2019-01-11T10:38:42.324Z and finished 5h 47m 47s ago

1180687 is the commit before f141f3a, so the failing build did not contain the fix. (At the time the build ran f141f3a was committed but had not got past an intake build so was not being used in the non-intake builds.)

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/281/consoleText is another build that failed over the weekend due to this problem but again it used a "last successful commit" from before the fix.

The only other build over the weekend where this test failed was https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/972/consoleText, but a whole load of other tests also failed in that build and the root cause looks like a problem setting up security that meant the test harness couldn't connect to the test cluster. Therefore I'll close this issue again.

Resolved by #36810

tvernum added >test-failure Triaged test failures from CI :ml Machine learning labels Dec 19, 2018

tvernum assigned davidkyle Dec 19, 2018

tvernum added a commit that referenced this issue Dec 19, 2018

[TEST] Mute 60_ml_config_migration rolling upgrade

d7efadc

AwaitsFix: #36810 This test fails on the upgraded cluster in 6.x (details in #36810)

davidkyle mentioned this issue Dec 19, 2018

[ML] Metadata Migration Meta issue #32905

Closed

43 tasks

droberts195 mentioned this issue Jan 3, 2019

MlMigrationIT#testConfigMigration fails in 6.x #36935

Closed

droberts195 assigned droberts195 and unassigned davidkyle Jan 8, 2019

droberts195 mentioned this issue Jan 8, 2019

[ML] Stop datafeeds running when their jobs are stale #37227

Merged

droberts195 mentioned this issue Jan 9, 2019

[TEST] Unmute 60_ml_config_migration rolling upgrade #37258

Merged

droberts195 added a commit that referenced this issue Jan 9, 2019

[TEST] Unmute 60_ml_config_migration rolling upgrade (#37258)

d817b46

This reverts commit d7efadc The test should now work following the change made in #37227 Closes #36810

droberts195 closed this as completed Jan 9, 2019

droberts195 added a commit that referenced this issue Jan 9, 2019

[TEST] Unmute 60_ml_config_migration rolling upgrade (#37258)

335eeeb

This reverts commit d7efadc The test should now work following the change made in #37227 Closes #36810

droberts195 reopened this Jan 10, 2019

droberts195 mentioned this issue Jan 11, 2019

[ML] Wait for autodetect to be ready in the datafeed #37349

Merged

droberts195 closed this as completed in #37349 Jan 11, 2019

gwbrown reopened this Jan 11, 2019

droberts195 closed this as completed Jan 14, 2019

davidkyle added a commit that referenced this issue Jan 15, 2019

[ML][CI] Unmute MlMigrationIT.testConfigMigration

f445390

Resolved by #36810

davidkyle added a commit that referenced this issue Jan 15, 2019

[ML][CI] Unmute MlMigrationIT.testConfigMigration

188e8e2

Resolved by #36810

davidkyle mentioned this issue Jan 16, 2019

[ML] Migrate unallocated jobs and datafeeds #37536

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Failures on 6.x for 60_ml_config_migration/Test old cluster jobs and datafeeds and delete them #36810

[CI] Failures on 6.x for 60_ml_config_migration/Test old cluster jobs and datafeeds and delete them #36810

tvernum commented Dec 19, 2018

elasticmachine commented Dec 19, 2018

tvernum commented Dec 19, 2018

droberts195 commented Jan 7, 2019

droberts195 commented Jan 7, 2019

droberts195 commented Jan 7, 2019

droberts195 commented Jan 8, 2019

droberts195 commented Jan 9, 2019

droberts195 commented Jan 9, 2019

droberts195 commented Jan 10, 2019

droberts195 commented Jan 10, 2019

Tim-Brooks commented Jan 11, 2019

Tim-Brooks commented Jan 11, 2019

gwbrown commented Jan 11, 2019

droberts195 commented Jan 14, 2019 •

edited

[CI] Failures on 6.x for 60_ml_config_migration/Test old cluster jobs and datafeeds and delete them #36810

[CI] Failures on 6.x for 60_ml_config_migration/Test old cluster jobs and datafeeds and delete them #36810

Comments

tvernum commented Dec 19, 2018

elasticmachine commented Dec 19, 2018

tvernum commented Dec 19, 2018

droberts195 commented Jan 7, 2019

droberts195 commented Jan 7, 2019

droberts195 commented Jan 7, 2019

droberts195 commented Jan 8, 2019

droberts195 commented Jan 9, 2019

droberts195 commented Jan 9, 2019

droberts195 commented Jan 10, 2019

droberts195 commented Jan 10, 2019

Tim-Brooks commented Jan 11, 2019

Tim-Brooks commented Jan 11, 2019

gwbrown commented Jan 11, 2019

droberts195 commented Jan 14, 2019 • edited

droberts195 commented Jan 14, 2019 •

edited