Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Failures on 6.x for 60_ml_config_migration/Test old cluster jobs and datafeeds and delete them #36810

Closed
tvernum opened this issue Dec 19, 2018 · 14 comments
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@tvernum
Copy link
Contributor

tvernum commented Dec 19, 2018

This has failed at least twice since #36698 was merged to 6.x

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/693/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java11,nodes=virtual&&linux/127/console

./gradlew :x-pack:qa:rolling-upgrade:with-system-key:v5.6.15#upgradedClusterTestRunner -Dtests.seed=7721FEBE7CCB10B7 -Dtests.class=org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT -Dtests.method="test {p0=upgraded_cluster/60_ml_config_migration/Test old cluster jobs and datafeeds and delete them}" -Dtests.security.manager=true -Dtests.locale=ar-AE -Dtests.timezone=Etc/GMT-10 -Dcompiler.java=11 -Druntime.java=8 -Dtests.rest.suite=upgraded_cluster -Dtests.rest.blacklist="/30_ml_jobs_crud/Test model memory limit is updated"
00:50:45    > Throwable #1: java.lang.AssertionError: Failure at [upgraded_cluster/60_ml_config_migration:43]: datafeeds.0.state didn't match expected value:
00:50:45    >              datafeeds.0.state: expected [started] but was [stopped]
00:50:45    > 	at __randomizedtesting.SeedInfo.seed([7721FEBE7CCB10B7:FF75C164D2377D4F]:0)
00:50:45    > 	at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:406)
00:50:45    > 	at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:383)
00:50:45    > 	at java.lang.Thread.run(Thread.java:748)
00:50:45    > Caused by: java.lang.AssertionError: datafeeds.0.state didn't match expected value:
00:50:45    >              datafeeds.0.state: expected [started] but was [stopped]
00:50:45    > 	at org.elasticsearch.test.rest.yaml.section.MatchAssertion.doAssert(MatchAssertion.java:87)
00:50:45    > 	at org.elasticsearch.test.rest.yaml.section.Assertion.execute(Assertion.java:76)
00:50:45    > 	at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:399)
00:50:45    > 	... 38 more

I get failures locally as well but they have a different assertion failure.

I'm going to mute this on 6.x

@tvernum tvernum added >test-failure Triaged test failures from CI :ml Machine learning labels Dec 19, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@tvernum
Copy link
Contributor Author

tvernum commented Dec 19, 2018

@davidkyle I assigned this to you since it seems to have come from #36698.
Please reassign if I got that wrong.

tvernum added a commit that referenced this issue Dec 19, 2018
AwaitsFix: #36810

This test fails on the upgraded cluster in 6.x (details in #36810)
@droberts195
Copy link
Contributor

I think this is a case of an old bug, but made far more likely to happen due to the timing changes of having to do a search for job configuration rather than getting it from the cluster state.

This goes back to the age-old inconsistency that when jobs open for the first time they're in the opening state until they are ready to accept data. But when relocating to a new node jobs stay in the opened state, yet cannot accept data until the corresponding C++ process is running on the new node.

We have a check that datafeed persistent tasks cannot be assigned to a node until the associated job is assigned to a node. We also have a check that an assigned datafeed cannot be run until the associated job is in the opened state. If a datafeed's associated job is in the opening state then it waits for a subsequent cluster state update where the job state is opened, but this check assumes that relocating jobs would have a state of opening, which they do not.

Two possible solutions are:

  1. Set the job state to opening when it is opened but the assignment is stale
  2. Change the test in DatafeedManager to check whether a C++ process is running for the associated job, and not rely on an opened state implying this

@droberts195
Copy link
Contributor

The piece of code I referred to in DatafeedManager is

if (jobState == JobState.OPENED) {
runTask(datafeedTask);
} else if (jobState == JobState.OPENING) {
remainingTasks.add(datafeedTask);

@droberts195
Copy link
Contributor

Although option 1 (setting the job state to opening when it is opened but the assignment is stale) seems logically nicer, there is no way to achieve this without making a change to the persistent tasks framework. There is no way to change a task's status on reassignment. Task status can only be set by a separate action, and using that mechanism would leave a period when the same problem that exists now would still exist (just much harder to debug since it would exist for a shorter and less predictable amount of time).

@droberts195 droberts195 assigned droberts195 and unassigned davidkyle Jan 8, 2019
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jan 8, 2019
We already had logic to stop datafeeds running against
jobs that were OPENING, but a job that relocates from
one node to another while OPENED stays OPENED, and this
could cause the datafeed to fail when it sent data to
the OPENED job on its new node before it had a
corresponding autodetect process.

This change extends the check to stop datafeeds running
when their job is OPENING _or_ stale (i.e. has not had
its status reset since relocating to a different node).

Relates elastic#36810
@droberts195
Copy link
Contributor

Although option 1 (setting the job state to opening when it is opened but the assignment is stale) seems logically nicer, there is no way to achieve this without making a change to the persistent tasks framework.

This problem has been dealt with elsewhere via the concept of "stale" task status. The task status of a job task stores an allocation ID as well as the task itself, and if these differ then we know the task has relocated. Then we can interpret the reported status value differently. This tactic has already been used elsewhere. I extended use of it to the datafeed manager in #37227.

#37227 should stop the problem occurring in the fully upgraded cluster. In the mixed cluster it may still happen if a job gets relocated to a node on the old version that doesn't have the fix. I'll think more carefully about this before unmuting the test.

@droberts195
Copy link
Contributor

I'll think more carefully about this before unmuting the test.

The test that is muted only runs in the fully upgraded cluster, so can definitely be unmuted when #37227 is merged.

The same underlying problem does sometimes happen in the mixed cluster, for example it happened in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+bwc-tests/49/console. However, the test that fails in this case is not currently muted. I'll monitor similar failures once #37227 is merged to see if the mixed cluster failures completely go away. If not then I'll adjust what tests are run in the mixed cluster.

droberts195 added a commit that referenced this issue Jan 9, 2019
We already had logic to stop datafeeds running against
jobs that were OPENING, but a job that relocates from
one node to another while OPENED stays OPENED, and this
could cause the datafeed to fail when it sent data to
the OPENED job on its new node before it had a
corresponding autodetect process.

This change extends the check to stop datafeeds running
when their job is OPENING _or_ stale (i.e. has not had
its status reset since relocating to a different node).

Relates #36810
droberts195 added a commit that referenced this issue Jan 9, 2019
We already had logic to stop datafeeds running against
jobs that were OPENING, but a job that relocates from
one node to another while OPENED stays OPENED, and this
could cause the datafeed to fail when it sent data to
the OPENED job on its new node before it had a
corresponding autodetect process.

This change extends the check to stop datafeeds running
when their job is OPENING _or_ stale (i.e. has not had
its status reset since relocating to a different node).

Relates #36810
droberts195 added a commit that referenced this issue Jan 9, 2019
We already had logic to stop datafeeds running against
jobs that were OPENING, but a job that relocates from
one node to another while OPENED stays OPENED, and this
could cause the datafeed to fail when it sent data to
the OPENED job on its new node before it had a
corresponding autodetect process.

This change extends the check to stop datafeeds running
when their job is OPENING _or_ stale (i.e. has not had
its status reset since relocating to a different node).

Relates #36810
droberts195 added a commit that referenced this issue Jan 9, 2019
This reverts commit d7efadc

The test should now work following the change made in #37227

Closes #36810
@droberts195
Copy link
Contributor

#37227 should fix the problem and the muted test is unmuted in #37258. Both these PRs are now merged.

droberts195 added a commit that referenced this issue Jan 9, 2019
This reverts commit d7efadc

The test should now work following the change made in #37227

Closes #36810
@droberts195 droberts195 reopened this Jan 10, 2019
@droberts195
Copy link
Contributor

I found out what is happening. Whether a persistent task goes stale when a node leaves and rejoins the cluster depends on whether there is a cluster state update while the cluster has a master node and the node is out of the cluster. If a node leaves and rejoins the cluster while the cluster has no master node then a persistent task running on it will not go stale. Maybe this was intentional, but it's annoying for ML in the case of a node running ML jobs being restarted.

In the rolling upgrade tests all 3 nodes are master eligible and minimum master nodes is 3, so while any node is restarted the cluster has no master. When the upgraded node rejoins the cluster it has a master again. At this point a flurry of cluster state updates occur. Sometimes the ML persistent tasks that were running on the restarted node are reallocated to a different node and sometimes they stay on the restarted node. If they stay on the restarted node they're not considered stale.

This wouldn't happen in a cluster with minimum master nodes less than the number of nodes in the cluster. Then if a single ML node was restarted the remaining nodes would still be able to form a cluster with a master and the ML persistent tasks would either get allocated to other nodes or become unallocated while the node being restarted was out of the cluster. This would guarantee that they were stale.

The problem can be fixed by checking explicitly whether an autodetect process is running for a datafeed's job before the datafeed does any work. The extent to which this is a test fix or a production fix depends on whether anyone runs in production with minimum master nodes set to the number of nodes in the cluster. It seems a bit crazy because then a single node failure cripples the cluster.

@Tim-Brooks
Copy link
Contributor

@Tim-Brooks
Copy link
Contributor

I reverted the commit that unmuted the test on 6.x as we had about 10 failures today.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jan 11, 2019
This is a reinforcement of elastic#37227.  It turns out that
persistent tasks are not made stale if the node they
were running on is restarted and the master node does
not notice this.  The main scenario where this happens
is when minimum master nodes is the same as the number
of nodes in the cluster, so the cluster cannot elect a
master node when any node is restarted.

When an ML node restarts we need the datafeeds for any
jobs that were running on that node to not just wait
until the jobs are allocated, but to wait for the
autodetect process of the job to start up.  In the case
of reassignment of the job persistent task this was
dealt with by the stale status test.  But in the case
where a node restarts but its persistent tasks are not
reassigned we need a deeper test.

Fixes elastic#36810
droberts195 added a commit that referenced this issue Jan 11, 2019
This is a reinforcement of #37227.  It turns out that
persistent tasks are not made stale if the node they
were running on is restarted and the master node does
not notice this.  The main scenario where this happens
is when minimum master nodes is the same as the number
of nodes in the cluster, so the cluster cannot elect a
master node when any node is restarted.

When an ML node restarts we need the datafeeds for any
jobs that were running on that node to not just wait
until the jobs are allocated, but to wait for the
autodetect process of the job to start up.  In the case
of reassignment of the job persistent task this was
dealt with by the stale status test.  But in the case
where a node restarts but its persistent tasks are not
reassigned we need a deeper test.

Fixes #36810
droberts195 added a commit that referenced this issue Jan 11, 2019
This is a reinforcement of #37227.  It turns out that
persistent tasks are not made stale if the node they
were running on is restarted and the master node does
not notice this.  The main scenario where this happens
is when minimum master nodes is the same as the number
of nodes in the cluster, so the cluster cannot elect a
master node when any node is restarted.

When an ML node restarts we need the datafeeds for any
jobs that were running on that node to not just wait
until the jobs are allocated, but to wait for the
autodetect process of the job to start up.  In the case
of reassignment of the job persistent task this was
dealt with by the stale status test.  But in the case
where a node restarts but its persistent tasks are not
reassigned we need a deeper test.

Fixes #36810
droberts195 added a commit that referenced this issue Jan 11, 2019
This is a reinforcement of #37227.  It turns out that
persistent tasks are not made stale if the node they
were running on is restarted and the master node does
not notice this.  The main scenario where this happens
is when minimum master nodes is the same as the number
of nodes in the cluster, so the cluster cannot elect a
master node when any node is restarted.

When an ML node restarts we need the datafeeds for any
jobs that were running on that node to not just wait
until the jobs are allocated, but to wait for the
autodetect process of the job to start up.  In the case
of reassignment of the job persistent task this was
dealt with by the stale status test.  But in the case
where a node restarts but its persistent tasks are not
reassigned we need a deeper test.

Fixes #36810
@gwbrown
Copy link
Contributor

gwbrown commented Jan 11, 2019

This failed again in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu8,nodes=virtual&&linux/47/console, which includes f141f3a, which I believe was intended to fix it. It appears to be happening much less now, so I'm not going to re-mute the test, but I'm not sure the problem is fully fixed.

@gwbrown gwbrown reopened this Jan 11, 2019
@droberts195
Copy link
Contributor

droberts195 commented Jan 14, 2019

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu8,nodes=virtual&&linux/47/console says:

runbld>>> Using last successful commit 1180687 from the job elastic+elasticsearch+6.6+intake with the build id 20190111103841-CB1867D2 that was started at 2019-01-11T10:38:42.324Z and finished 5h 47m 47s ago

1180687 is the commit before f141f3a, so the failing build did not contain the fix. (At the time the build ran f141f3a was committed but had not got past an intake build so was not being used in the non-intake builds.)

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/281/consoleText is another build that failed over the weekend due to this problem but again it used a "last successful commit" from before the fix.

The only other build over the weekend where this test failed was https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/972/consoleText, but a whole load of other tests also failed in that build and the root cause looks like a problem setting up security that meant the test harness couldn't connect to the test cluster. Therefore I'll close this issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

6 participants