Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Store failure reason in ML job task status #34431

Closed
droberts195 opened this issue Oct 13, 2018 · 1 comment

Comments

Projects
None yet
4 participants
@droberts195
Copy link
Contributor

commented Oct 13, 2018

At present if an ML job fails then its persistent task is not deleted, but remains in the cluster state with a status of failed. However, to find the reason why it failed you currently need to look in the log file of the node that was running the persistent task at the time when it failed. This almost invariably involves multiple back-and-forth cycles when working on support cases.

It would be relatively easy to store the getMessage() of the exception that caused the job to fail in the job task status for ML job persistent tasks. This would be helpful in the case where the logs for the node the job was running on are not initially available. In some cases it may prevent the need to ask for those logs at all.

The only problem with making this change is BWC of the job task status. If it is strictly parsed in 6.0 or above then 6.7 needs to be changed to leniently ignore unknown fields from the job task status and the new exception field can only be added in 7.0.

@elasticmachine

This comment has been minimized.

Copy link

commented Oct 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.