Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Jindex: Rolling upgrade tests #35700

Merged
merged 20 commits into from Nov 23, 2018

Conversation

davidkyle
Copy link
Member

@davidkyle davidkyle commented Nov 19, 2018

Adds rolling upgrade tests specific to the migration project and re-enables all the previously muted upgrade tests. A milestone

A number of minor issues had to be fixed to enable the tests:

  • Wait for the new .ml-config template to be installed before running the tests
  • Prevent v6.6.0 jobs (defined in the index) being assigned to pre v6.6.0 nodes which don't know about index configurations
  • Order the response of GET job stats by job ID. This is actually an enhancement as stats were never sorted before
  • Fix 2 places (start datafeed, DatafeedJobManager) where the code was looking for the config in the index and not the clusterstate
  • The Finalize Job Action now updates both clusterstate and index jobs
  • AutoDetectResultsProcessor now calls the Job Update action to update established model memory and model snapshot Ids. These updates must have finished before the processor closes (otherwise there are conflicts with the Finalize Job Action)

@davidkyle davidkyle added WIP :ml Machine learning labels Nov 19, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@davidkyle davidkyle mentioned this pull request Nov 20, 2018
43 tasks
@davidkyle davidkyle added >feature and removed WIP labels Nov 20, 2018
@davidkyle
Copy link
Member Author

I pushed another commit fixing the problems with AutoDetectResultsProcessor. Previously I was updating the index jobs directly by updating the document but it is simpler to call the Job Update Action and let the action figure out if the job is in the index or clusterstate. Doing this meant I could remove a countdown latch and I re-purposed the update model snapshot ID semaphore to be a general update job semaphore. All job updates must be completed before the processor closes otherwise the Finalize Job Action called on close job can conflict trying to update the same document. (Finalize job sets the finished time, the processor updates established model memory and snapshot Ids).

When reviewing please pay close attention to AutoDetectResultsProcessor as dead locks in this code are a real problem.

@@ -95,6 +96,7 @@ protected void doExecute(Task task, GetJobsStatsAction.Request request, ActionLi
List<GetJobsStatsAction.Response.JobStats> stats = new ArrayList<>();
for (QueryPage<GetJobsStatsAction.Response.JobStats> task : tasks) {
stats.addAll(task.results());
Collections.sort(stats, Comparator.comparing(GetJobsStatsAction.Response.JobStats::getJobId));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that it is better to sort this once once all the tasks are added instead of for every task : tasks.

@elastic elastic deleted a comment Nov 20, 2018
@davidkyle
Copy link
Member Author

retest this please

@davidkyle
Copy link
Member Author

run gradle build tests

2 similar comments
@davidkyle
Copy link
Member Author

run gradle build tests

@davidkyle
Copy link
Member Author

run gradle build tests

@davidkyle
Copy link
Member Author

retest this please

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davidkyle davidkyle merged commit 4fd00d6 into elastic:feature-jindex-6x Nov 23, 2018
@davidkyle davidkyle deleted the rolling-upgrade-tests branch November 23, 2018 11:21

private void finalizeIndexJobs(Collection<String> jobIds, ActionListener<AcknowledgedResponse> listener) {
String jobIdString = String.join(",", jobIds);
logger.debug("finalizing jobs [{}]", jobIdString);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isn't it stringified to "[id1, id2, ...]" anyway? + less garbage for non-debug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I'll remedy in a later commit as this is merged now.

@@ -158,6 +158,10 @@ static void validate(String jobId, Job job) {
int maxMachineMemoryPercent,
MlMemoryTracker memoryTracker,
Logger logger) {
if (job == null) {
logger.debug("[{}] select node job is null", jobId);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe assert instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job == null is valid for the rolling upgrade as that field isn't present in the persistent task parameters for jobs < v6.6.0. I added this to help me understand what was happening during the rolling upgrade tests it probably shouldn't be in the released code however.

String reason = "Not opening job [" + jobId + "] on node [" + nodeNameOrId(node)
+ "] version [" + node.getVersion() + "], because this node does not support " +
"jobs of version [" + job.getJobVersion() + "]";
logger.trace(reason);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: debug seems more suited to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that on a 100 node cluster each allocation will generate 100 messages similar to this one, which would be significant log spam. They get concatenated into the overall reason, which is stored in the cluster state if the persistent task exists (and returned in the error message in the case of this being called prior to opening).

All the other possible reasons for ruling out a node in this method also currently log at the trace level. I think they should all log at the same level, otherwise someone reading the logs could get a misleading picture of what is happening.

I would leave this as trace to match the others.

jobUpdateSemaphore.acquire();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
LOGGER.info("[{}] Interrupted acquiring update established model memory semaphore", jobId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

level info correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't think it's an error level message as the interrupt is not necessarily an error perhaps debug. ml code does not interrupt threads and the only usage I can find in es core outside of tests is in CancellableThreads

I doubt we will every see this message outside of testing. The test framework interrupts zombie threads after each test so that will probably be the only time we see it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :ml Machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants