Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Incorrect response when trying to stop a failed data frame transform without force=true #44103

Closed
dolaru opened this issue Jul 9, 2019 · 3 comments · Fixed by #44231
Closed
Assignees
Labels
>bug :ml/Transform Transform :ml Machine learning

Comments

@dolaru
Copy link
Member

dolaru commented Jul 9, 2019

Spotted in 7.3.0

When trying to stop a failed data frame transform without specifying force=true in query params, the status code and format of the response is not what you'd expect.

Currently the API responds with 200 and the body looks like this:


{
  "task_failures" : [ {
    "task_id" : 15302,
    "node_id" : "1TYVPNuERNe40f9tTRPVAQ",
    "status" : "CONFLICT",
    "reason" : {
      "type" : "status_exception",
      "reason" : "Unable to stop data frame transform [some_data_frame_id] as it is in a failed state with reason: [task encountered more than 10 failures; latest failure: Failed to retrieve checkpoint]. Use force stop to stop the data frame transform."
    }
  } ],
  "acknowledged" : false
}


In order to be consistent with the rest of the ML APIs, what we should be receiving instead should be a 409 and a body that looks like this:

{
    "error": {
        "root_cause": [
            {
                "type": "status_exception",
                "reason": "Unable to stop data frame transform [some_data_frame_id] as it is in a failed state. Use force stop to stop the data frame transform."
            }
        ],
        "type": "status_exception",
        "reason": "Unable to stop data frame transform [some_data_frame_id] as it is in a failed state. Use force stop to stop the data frame transform."
    },
    "status": 409
}

Steps to reproduce

  1. Create a data frame
  2. Start it and cause it to fail. (deleting the source index is one way to do it)
  3. After the task state becomes failed, send the _stop request without specifying force=true as a query parameter.
  4. Notice that the response status code and body format is not consistent with the usual error response format in ES APIs.
@dolaru dolaru added >bug :ml Machine learning :ml/Transform Transform labels Jul 9, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@davidkyle
Copy link
Member

The response format is not new to 7.3 it was like this in 7.2. _stop can take wild card arguments _all in which case multiple responses are expected some of which will be successful and others may fail. This is analogous to bulk index where some index ops may fail and others succeed.

If anything other than a 2xx status code is returned the Rest High Level client will not parse the response, see the comment in #44058.

@dolaru
Copy link
Member Author

dolaru commented Jul 9, 2019

The 200 response is misleading because it doesn't indicate that something didn't go as planned. You only get to find out that something went wrong if you actually parse the response and look at acknowledged.

I don't think the _bulk API is what we're looking to be consistent with. I'm referencing the behaviour of _close in anomaly detection APIs when force is not specified instead:

  • If _all or a pattern is used as the job ID, and one of the jobs that match are in a failed state, we don't touch any of the jobs and ask the user to use the force parameter
{
  "error": {
    "root_cause": [
      {
        "type": "status_exception",
        "reason": "one or more jobs have state failed, use force close"
      }
    ],
    "type": "status_exception",
    "reason": "one or more jobs have state failed, use force close"
  },
  "status": 409
}
  • If a single job is targeted with the _close request, and the job is in a failed state, we do the same... tell the user to use force.
{
  "error": {
    "root_cause": [
      {
        "type": "status_exception",
        "reason": "cannot close job [some_anomaly_detection_job] because it failed, use force close"
      }
    ],
    "type": "status_exception",
    "reason": "cannot close job [some_anomaly_detection_job] because it failed, use force close"
  },
  "status": 409
}

In my opinion, this is the correct behaviour. If a data frame transform that has the task state failed is targeted by the _stop request and force is not specified:

  • don't stop any of the matching transforms
  • return 409 asking the user to force stop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml/Transform Transform :ml Machine learning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants