Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Improve error handling when using transforms with inference processors #50135

Closed
sophiec20 opened this issue Dec 12, 2019 · 5 comments
Closed
Labels
:ml/Transform Transform

Comments

@sophiec20
Copy link
Contributor

Found in 7.6.0-SNAPSHOT

It is possible to create and update a transform that uses a pipeline that does not exist.

For example:

POST _transform/transform-01/_update
{
  "dest": {
    "index" : "transform-01",
    "pipeline" : "this-does-not-exist"
  }
}

Returns acknowledged: true

If you start this transform, then we log audit messages saying:
Transform encountered an exception: org.elasticsearch.xpack.transform.transforms.ClientTransformIndexer$BulkIndexingException: Bulk index experienced failures. See the logs of the node running the transform for details. Will attempt again at next scheduled trigger.

And by looking in the logs on disk, we see
[2019-12-12T11:49:24,655][DEBUG][o.e.a.b.T.BulkRequestModifier] [node3] failed to execute pipeline [_none] for document [transform-01/_doc/MTWq3Ia7HoVvarOSxUzsmioAAAAAAAAA] java.lang.IllegalArgumentException: pipeline with id [this-does-not-exist] does not exist

This continuous transform failed after several minutes with task encountered more than 10 failures; latest failure:

Secondly, if the pipeline refers to a model that does not exist, then we have similar audit messages asking you to look in the log files, which have the error:
Caused by: org.elasticsearch.ResourceNotFoundException: Could not find trained model [this-model-does-not-exist]

This continuous transform also failed after several minutes with task encountered more than 10 failures; latest failure:

Should we:

  • Validate a pipeline exists when creating or updating a transform
  • Validate that a model exists when creating a inference processor pipeline
    (although reindex does not do this.)

Regardless of whether we validate the pipeline and/or model exists, we should surface the errors as audit messages so that the user does not have to look at log files on disk.

(note,
if you use reindex with a pipeline that does not exist, it returns all docs as 400 failures

...
"failures" : [
    {
      "index" : "kibana_sample_data_flights_copy",
      "type" : "_doc",
      "id" : "3TDI6m4BXY3wzdkLGbWl",
      "cause" : {
        "type" : "illegal_argument_exception",
        "reason" : "pipeline with id [pipeline-does-not-exist] does not exist"
      },
      "status" : 400
    },
...

if you use reindex with a pipeline that exists but refers to a model that does not exist, it returns all docs as 404 failures

...
"failures" : [
    {
      "index" : "kibana_sample_data_flights_copy2",
      "type" : "_doc",
      "id" : "3jDI6m4BXY3wzdkLGbWl",
      "cause" : {
        "type" : "resource_not_found_exception",
        "reason" : "Could not find trained model [model-does-not-exist]"
      },
      "status" : 404
    },
...

cc @hendrikmuhs @benwtrent

@sophiec20 sophiec20 added the :ml/Transform Transform label Dec 12, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml/Transform)

@benwtrent
Copy link
Member

Validate a pipeline exists when creating or updating a transform

Yes, 100%, I am surprised that this is not the case now.

Validate that a model exists when creating a inference processor pipeline
(although reindex does not do this.)

At the inference processor level, this is not currently possible. When pipelines are created, there are no opportunities to asynchronously check an index. And we cannot pause the thread that is doing the creation as that effects the stack as a whole.

The transform might be able to do this in a round about way. But that is conflating two different features.

Transforms might be able to check the first couple of docs at _start and prevent it starting if the first handful of docs fails?

@hendrikmuhs
Copy link
Contributor

In addition to validation, error handling of the bulk index failure should be improved, related to #50122.

We already do this for script errors and circuit breaker exceptions.

After that fix, such an error would immediately set the transform to failed without going into the retry loop.

benwtrent added a commit that referenced this issue Jan 9, 2020
If a pipeline referenced by a transform does not exist, we should not allow the transform to be created. 

We do allow the pipeline existence check to be skipped with defer_validations, but if the pipeline still does not exist on `_start`, the pipeline will fail to start.

relates:  #50135
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Jan 9, 2020
If a pipeline referenced by a transform does not exist, we should not allow the transform to be created. 

We do allow the pipeline existence check to be skipped with defer_validations, but if the pipeline still does not exist on `_start`, the pipeline will fail to start.

relates:  elastic#50135
benwtrent added a commit that referenced this issue Jan 9, 2020
If a pipeline referenced by a transform does not exist, we should not allow the transform to be created. 

We do allow the pipeline existence check to be skipped with defer_validations, but if the pipeline still does not exist on `_start`, the pipeline will fail to start.

relates:  #50135
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
If a pipeline referenced by a transform does not exist, we should not allow the transform to be created. 

We do allow the pipeline existence check to be skipped with defer_validations, but if the pipeline still does not exist on `_start`, the pipeline will fail to start.

relates:  elastic#50135
hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this issue Feb 3, 2020
hendrikmuhs pushed a commit that referenced this issue Feb 4, 2020
treat resource not found and illegal argument exceptions as irrecoverable error

relates #50135
@hendrikmuhs
Copy link
Contributor

Enhancements made:

  • check pipeline at validation (put, start, preview)
  • set the transform to failed if pipeline disappears at runtime
  • set the transform failed if inference pipeline processor errors with resource not found

errors are booth in _stats and as audits.

I wonder what is left? As @benwtrent mentioned we could fail at start if the pipeline is broken. However, I would use the _simulate API of ingest and add this to validation.

@sophiec20 anything else?

I suggest to close this issue and - if needed - open separate follow up issues for further enhancements.

hendrikmuhs pushed a commit that referenced this issue Feb 4, 2020
treat resource not found and illegal argument exceptions as irrecoverable error

relates #50135
@sophiec20
Copy link
Contributor Author

I suggest to close this issue and - if needed - open separate follow up issues for further enhancements.

Agreed. Appropriate steps have been taken to improve error handling, so closing this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml/Transform Transform
Projects
None yet
Development

No branches or pull requests

4 participants