[ML] Jindex: Rolling upgrade tests #35700

davidkyle · 2018-11-19T16:26:51Z

Adds rolling upgrade tests specific to the migration project and re-enables all the previously muted upgrade tests. A milestone

A number of minor issues had to be fixed to enable the tests:

Wait for the new .ml-config template to be installed before running the tests
Prevent v6.6.0 jobs (defined in the index) being assigned to pre v6.6.0 nodes which don't know about index configurations
Order the response of GET job stats by job ID. This is actually an enhancement as stats were never sorted before
Fix 2 places (start datafeed, DatafeedJobManager) where the code was looking for the config in the index and not the clusterstate
The Finalize Job Action now updates both clusterstate and index jobs
AutoDetectResultsProcessor now calls the Job Update action to update established model memory and model snapshot Ids. These updates must have finished before the processor closes (otherwise there are conflicts with the Finalize Job Action)

elasticmachine · 2018-11-19T16:26:53Z

Pinging @elastic/ml-core

davidkyle · 2018-11-20T13:54:35Z

I pushed another commit fixing the problems with AutoDetectResultsProcessor. Previously I was updating the index jobs directly by updating the document but it is simpler to call the Job Update Action and let the action figure out if the job is in the index or clusterstate. Doing this meant I could remove a countdown latch and I re-purposed the update model snapshot ID semaphore to be a general update job semaphore. All job updates must be completed before the processor closes otherwise the Finalize Job Action called on close job can conflict trying to update the same document. (Finalize job sets the finished time, the processor updates established model memory and snapshot Ids).

When reviewing please pay close attention to AutoDetectResultsProcessor as dead locks in this code are a real problem.

benwtrent · 2018-11-20T13:57:34Z

...k/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetJobsStatsAction.java

@@ -95,6 +96,7 @@ protected void doExecute(Task task, GetJobsStatsAction.Request request, ActionLi
        List<GetJobsStatsAction.Response.JobStats> stats = new ArrayList<>();
        for (QueryPage<GetJobsStatsAction.Response.JobStats> task : tasks) {
            stats.addAll(task.results());
+            Collections.sort(stats, Comparator.comparing(GetJobsStatsAction.Response.JobStats::getJobId));


I would think that it is better to sort this once once all the tasks are added instead of for every task : tasks.

davidkyle · 2018-11-21T17:29:15Z

retest this please

davidkyle · 2018-11-22T09:21:05Z

run gradle build tests

davidkyle · 2018-11-22T09:47:31Z

run gradle build tests

davidkyle · 2018-11-22T09:56:09Z

run gradle build tests

davidkyle · 2018-11-22T09:57:34Z

retest this please

Use job update to set model snapshot and established memory in autodetect results processor. This simplifies the code.

… updates have completed before close

Accounts for jobs created by the other tests etc

Look for datafeed and job config in the cluster state if not in task parameters

…pport it

droberts195

LGTM

hendrikmuhs · 2018-11-23T12:51:25Z

.../ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportFinalizeJobExecutionAction.java

+
+    private void finalizeIndexJobs(Collection<String> jobIds, ActionListener<AcknowledgedResponse> listener) {
+        String jobIdString = String.join(",", jobIds);
+        logger.debug("finalizing jobs [{}]", jobIdString);


nit: isn't it stringified to "[id1, id2, ...]" anyway? + less garbage for non-debug?

Fair point. I'll remedy in a later commit as this is merged now.

hendrikmuhs · 2018-11-23T12:54:55Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportOpenJobAction.java

@@ -158,6 +158,10 @@ static void validate(String jobId, Job job) {
                                                                            int maxMachineMemoryPercent,
                                                                            MlMemoryTracker memoryTracker,
                                                                            Logger logger) {
+        if (job == null) {
+            logger.debug("[{}] select node job is null", jobId);
+        }


maybe assert instead?

job == null is valid for the rolling upgrade as that field isn't present in the persistent task parameters for jobs < v6.6.0. I added this to help me understand what was happening during the rolling upgrade tests it probably shouldn't be in the released code however.

hendrikmuhs · 2018-11-23T12:56:05Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportOpenJobAction.java

+                    String reason = "Not opening job [" + jobId + "] on node [" + nodeNameOrId(node)
+                            + "] version [" + node.getVersion() + "], because this node does not support " +
+                            "jobs of version [" + job.getJobVersion() + "]";
+                    logger.trace(reason);


nit: debug seems more suited to me

Remember that on a 100 node cluster each allocation will generate 100 messages similar to this one, which would be significant log spam. They get concatenated into the overall reason, which is stored in the cluster state if the persistent task exists (and returned in the error message in the case of this being called prior to opening).

All the other possible reasons for ruling out a node in this method also currently log at the trace level. I think they should all log at the same level, otherwise someone reading the logs could get a misleading picture of what is happening.

I would leave this as trace to match the others.

hendrikmuhs · 2018-11-23T13:01:58Z

...java/org/elasticsearch/xpack/ml/job/process/autodetect/output/AutoDetectResultProcessor.java

+                        jobUpdateSemaphore.acquire();
+                    } catch (InterruptedException e) {
+                        Thread.currentThread().interrupt();
+                        LOGGER.info("[{}] Interrupted acquiring update established model memory semaphore", jobId);


level info correct?

Hmm, I don't think it's an error level message as the interrupt is not necessarily an error perhaps debug. ml code does not interrupt threads and the only usage I can find in es core outside of tests is in CancellableThreads

I doubt we will every see this message outside of testing. The test framework interrupts zombie threads after each test so that will probably be the only time we see it.

davidkyle added WIP :ml Machine learning labels Nov 19, 2018

davidkyle mentioned this pull request Nov 20, 2018

[ML] Metadata Migration Meta issue #32905

Closed

43 tasks

davidkyle added >feature and removed WIP labels Nov 20, 2018

benwtrent reviewed Nov 20, 2018

View reviewed changes

elastic deleted a comment Nov 20, 2018

davidkyle added 16 commits November 22, 2018 10:49

Wait for the new .ml-config index in tests

2c7ff1f

Prevent pre 6.6 job being assigned to a 6.6 node

f150c2d

Don’t index update cluster state jobs

b488806

DatafeedJobBuilder must read config from index and cluster state

49fde66

Order GetJobStats by job Id

28a58eb

Rolling upgrade tests for jobs and data feeds

8265777

Re-enable all rolling upgrade tests

9928a1e

Set expected templates based on upgrade version

74f65f2

Use finalize action to set job’s finished time

e43e256

Use job update to set model snapshot and established memory in autodetect results processor. This simplifies the code.

Use valid mapping type name for 5.6

43af2d7

Check finished time is set

ee57263

Prevent concurrent updates from AutoDetectResultsProcessor and ensure…

8bf3216

… updates have completed before close

Simplify sort

a919880

Make migration upgrade tests work with other upgrade tests

d1beb2b

Accounts for jobs created by the other tests etc

Extra debugging for job assignment

e1020fd

Look for datafeed and job config in the cluster state if not in task parameters

Another name clash, this time the index

6e86639

davidkyle force-pushed the rolling-upgrade-tests branch from 58f5eff to 6e86639 Compare November 22, 2018 10:49

davidkyle added 3 commits November 22, 2018 14:40

Skip tests that rely on wildcard expansion in versions that do not su…

bc2252d

…pport it

Remove upgrade test invalidated by the memory tracker

eb8a40d

Prevent NPE reading model memory limit

4a755d2

davidkyle force-pushed the rolling-upgrade-tests branch from bcce73b to 0a35dd6 Compare November 22, 2018 18:05

Log get job stats in the right place and fix silly error

6532995

davidkyle force-pushed the rolling-upgrade-tests branch from 0a35dd6 to 6532995 Compare November 22, 2018 18:08

droberts195 approved these changes Nov 23, 2018

View reviewed changes

davidkyle merged commit 4fd00d6 into elastic:feature-jindex-6x Nov 23, 2018

davidkyle deleted the rolling-upgrade-tests branch November 23, 2018 11:21

hendrikmuhs reviewed Nov 23, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Jindex: Rolling upgrade tests #35700

[ML] Jindex: Rolling upgrade tests #35700

davidkyle commented Nov 19, 2018 •

edited

elasticmachine commented Nov 19, 2018

davidkyle commented Nov 20, 2018

benwtrent Nov 20, 2018

davidkyle commented Nov 21, 2018

davidkyle commented Nov 22, 2018

davidkyle commented Nov 22, 2018

davidkyle commented Nov 22, 2018

davidkyle commented Nov 22, 2018

droberts195 left a comment

hendrikmuhs Nov 23, 2018

davidkyle Nov 23, 2018

hendrikmuhs Nov 23, 2018

davidkyle Nov 23, 2018

hendrikmuhs Nov 23, 2018

droberts195 Nov 23, 2018

hendrikmuhs Nov 23, 2018

davidkyle Nov 23, 2018

[ML] Jindex: Rolling upgrade tests #35700

[ML] Jindex: Rolling upgrade tests #35700

Conversation

davidkyle commented Nov 19, 2018 • edited

elasticmachine commented Nov 19, 2018

davidkyle commented Nov 20, 2018

Choose a reason for hiding this comment

davidkyle commented Nov 21, 2018

davidkyle commented Nov 22, 2018

davidkyle commented Nov 22, 2018

davidkyle commented Nov 22, 2018

davidkyle commented Nov 22, 2018

droberts195 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidkyle commented Nov 19, 2018 •

edited