[ML] Improve error handling when using transforms with inference processors #50135

sophiec20 · 2019-12-12T12:24:52Z

Found in 7.6.0-SNAPSHOT

It is possible to create and update a transform that uses a pipeline that does not exist.

For example:

POST _transform/transform-01/_update
{
  "dest": {
    "index" : "transform-01",
    "pipeline" : "this-does-not-exist"
  }
}

Returns acknowledged: true

If you start this transform, then we log audit messages saying:
Transform encountered an exception: org.elasticsearch.xpack.transform.transforms.ClientTransformIndexer$BulkIndexingException: Bulk index experienced failures. See the logs of the node running the transform for details. Will attempt again at next scheduled trigger.

And by looking in the logs on disk, we see
[2019-12-12T11:49:24,655][DEBUG][o.e.a.b.T.BulkRequestModifier] [node3] failed to execute pipeline [_none] for document [transform-01/_doc/MTWq3Ia7HoVvarOSxUzsmioAAAAAAAAA] java.lang.IllegalArgumentException: pipeline with id [this-does-not-exist] does not exist

This continuous transform failed after several minutes with task encountered more than 10 failures; latest failure:

Secondly, if the pipeline refers to a model that does not exist, then we have similar audit messages asking you to look in the log files, which have the error:
Caused by: org.elasticsearch.ResourceNotFoundException: Could not find trained model [this-model-does-not-exist]

This continuous transform also failed after several minutes with task encountered more than 10 failures; latest failure:

Should we:

Validate a pipeline exists when creating or updating a transform
Validate that a model exists when creating a inference processor pipeline
(although reindex does not do this.)

Regardless of whether we validate the pipeline and/or model exists, we should surface the errors as audit messages so that the user does not have to look at log files on disk.

(note,
if you use reindex with a pipeline that does not exist, it returns all docs as 400 failures

...
"failures" : [
    {
      "index" : "kibana_sample_data_flights_copy",
      "type" : "_doc",
      "id" : "3TDI6m4BXY3wzdkLGbWl",
      "cause" : {
        "type" : "illegal_argument_exception",
        "reason" : "pipeline with id [pipeline-does-not-exist] does not exist"
      },
      "status" : 400
    },
...

if you use reindex with a pipeline that exists but refers to a model that does not exist, it returns all docs as 404 failures

...
"failures" : [
    {
      "index" : "kibana_sample_data_flights_copy2",
      "type" : "_doc",
      "id" : "3jDI6m4BXY3wzdkLGbWl",
      "cause" : {
        "type" : "resource_not_found_exception",
        "reason" : "Could not find trained model [model-does-not-exist]"
      },
      "status" : 404
    },
...

cc @hendrikmuhs @benwtrent

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-12-12T12:24:54Z

Pinging @elastic/ml-core (:ml/Transform)

benwtrent · 2019-12-12T12:33:49Z

Validate a pipeline exists when creating or updating a transform

Yes, 100%, I am surprised that this is not the case now.

Validate that a model exists when creating a inference processor pipeline
(although reindex does not do this.)

At the inference processor level, this is not currently possible. When pipelines are created, there are no opportunities to asynchronously check an index. And we cannot pause the thread that is doing the creation as that effects the stack as a whole.

The transform might be able to do this in a round about way. But that is conflating two different features.

Transforms might be able to check the first couple of docs at _start and prevent it starting if the first handful of docs fails?

hendrikmuhs · 2019-12-12T12:58:53Z

In addition to validation, error handling of the bulk index failure should be improved, related to #50122.

We already do this for script errors and circuit breaker exceptions.

After that fix, such an error would immediately set the transform to failed without going into the retry loop.

If a pipeline referenced by a transform does not exist, we should not allow the transform to be created. We do allow the pipeline existence check to be skipped with defer_validations, but if the pipeline still does not exist on `_start`, the pipeline will fail to start. relates: #50135

If a pipeline referenced by a transform does not exist, we should not allow the transform to be created. We do allow the pipeline existence check to be skipped with defer_validations, but if the pipeline still does not exist on `_start`, the pipeline will fail to start. relates: elastic#50135

If a pipeline referenced by a transform does not exist, we should not allow the transform to be created. We do allow the pipeline existence check to be skipped with defer_validations, but if the pipeline still does not exist on `_start`, the pipeline will fail to start. relates: #50135

If a pipeline referenced by a transform does not exist, we should not allow the transform to be created. We do allow the pipeline existence check to be skipped with defer_validations, but if the pipeline still does not exist on `_start`, the pipeline will fail to start. relates: elastic#50135

…able error relates elastic#50135

treat resource not found and illegal argument exceptions as irrecoverable error relates #50135

hendrikmuhs · 2020-02-04T09:44:32Z

Enhancements made:

check pipeline at validation (put, start, preview)
set the transform to failed if pipeline disappears at runtime
set the transform failed if inference pipeline processor errors with resource not found

errors are booth in _stats and as audits.

I wonder what is left? As @benwtrent mentioned we could fail at start if the pipeline is broken. However, I would use the _simulate API of ingest and add this to validation.

@sophiec20 anything else?

I suggest to close this issue and - if needed - open separate follow up issues for further enhancements.

treat resource not found and illegal argument exceptions as irrecoverable error relates #50135

sophiec20 · 2020-02-04T09:50:56Z

I suggest to close this issue and - if needed - open separate follow up issues for further enhancements.

Agreed. Appropriate steps have been taken to improve error handling, so closing this ticket.

sophiec20 added the :ml/Transform Transform label Dec 12, 2019

benwtrent mentioned this issue Jan 7, 2020

[Transform] fail to start/put on missing pipeline #50701

Merged

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this issue Feb 3, 2020

treat resource not found and illegal argument exceptions as irrecover…

864d9b4

…able error relates elastic#50135

hendrikmuhs mentioned this issue Feb 3, 2020

[Transform] improve irrecoverable error detection #51820

Merged

hendrikmuhs pushed a commit that referenced this issue Feb 4, 2020

[Transform] improve irrecoverable error detection

12a1f44

treat resource not found and illegal argument exceptions as irrecoverable error relates #50135

hendrikmuhs pushed a commit that referenced this issue Feb 4, 2020

[Transform] improve irrecoverable error detection

5d5f3ce

treat resource not found and illegal argument exceptions as irrecoverable error relates #50135

sophiec20 closed this as completed Feb 4, 2020

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 3) elastic/elasticsearch-net#4534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Improve error handling when using transforms with inference processors #50135

[ML] Improve error handling when using transforms with inference processors #50135

sophiec20 commented Dec 12, 2019

elasticmachine commented Dec 12, 2019

benwtrent commented Dec 12, 2019

hendrikmuhs commented Dec 12, 2019

hendrikmuhs commented Feb 4, 2020

sophiec20 commented Feb 4, 2020

[ML] Improve error handling when using transforms with inference processors #50135

[ML] Improve error handling when using transforms with inference processors #50135

Comments

sophiec20 commented Dec 12, 2019

elasticmachine commented Dec 12, 2019

benwtrent commented Dec 12, 2019

hendrikmuhs commented Dec 12, 2019

hendrikmuhs commented Feb 4, 2020

sophiec20 commented Feb 4, 2020