Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

Handling batch submit/status failure for many tasks #210

Open
knagaitsev opened this issue Feb 22, 2021 · 7 comments
Open

Handling batch submit/status failure for many tasks #210

knagaitsev opened this issue Feb 22, 2021 · 7 comments

Comments

@knagaitsev
Copy link
Contributor

knagaitsev commented Feb 22, 2021

There is currently some inconsistent behavior with how /batch_status and /submit error handling is done if many tasks are being dealt with.

In the case of /batch_status, there is no real error handling. This route returns a list of task statuses that are requested for using the task_ids parameter. If a task is not found, it is added to this list with 'status': 'Failed', and if it succeeds it is added to the list with the desired data for that task. The route always responds with a success response, even if all the tasks are marked as 'Failed'.

In the case of /submit, there is some error handling, and the route will respond with a 4xx/5xx if any of the submitted tasks fail during submission. This means that even if some tasks were submitted successfully during the request, a failure will be sent back with no additional info if any task fails.

This is a bit tricky particularly for the /submit endpoint, because if a user submits a single task the ideal standard is good error readability. But since this single submission is internally just a batch submit, it needs to also maintain consistency with what happens when many tasks are submitted, where some fail and others succeed.

My proposal: If an internal error occurs preventing the request from being processed at all, or if status/submit fails for all of the tasks, a 4xx/5xx is sent back with an error. If a status/submit succeeds for one of the tasks, send back a 200, with a list of successes/failures for each individual task. For both status and submit, even if a task fails, proceed with the other tasks in the batch until they have all been tried.

The pros of this approach are readability for simple, single submission tasks. The cons of this are that it could be confusing that a 4xx/5xx is sent back if all tasks fail, but a 200 is sent back if some tasks fail.

Proposed Changes

The /submit route response would change from

{'status': 'Success',
'task_uuids': ['a', 'b'],
'task_uuid': ""}

(response code usually 200 unless some or all task launches fail)

to

{
'response': 'batch',
'results': [
  {
    'status': 'Success',
    'task_uuid': 'a',
    'http_status_code': 200
  },
  {
    'status': 'Failed',
    'code': 1,
    'task_uuid': 'b',
    'reason': 'human readable reason',
    'http_status_code': 4XX/5XX,
    ...
  },
  ...
]
}

(This response code would be a 207 HTTP multi-response since there were some successes and some fails. If everything was a success, it would be a 200. If an internal error occurred that made everything fail, it would be some 5XX)

When the funcx sdk receives such a batch response, it would store all the failed task submits in the local table, to be retrieved with get_result or get_batch_result. These failures would not be saved on the service side.

I think similar changes to the /batch_status route would be fitting, though they wouldn't need to be as drastic. Each status response object in the list would be an "http response object" of its own like above.

@yadudoc
Copy link
Contributor

yadudoc commented Feb 22, 2021

I agree with 200 on the full batch going through. On a partial failure, we could do 207: multi-status error code, more details here: https://evertpot.com/http/207-multi-status. In this situation, you'd try each item in the batch, and return a status code for each item that can be iterated on.

So we'd have:
200 -> All OK
207 -> Some or all tasks failed to submit, must iterate on the response to figure out what failed.
4XX/5XX -> Errors that block task submission entirely. No tasks could've been submitted.

@knagaitsev
Copy link
Contributor Author

@yadudoc This seems reasonable, my only concern is about this case: user does fxc.run() for a single task, the task fails due to something that would normally cause a 207 (such as EndpointNotFound for the particular task they sent), how should the failure be displayed in a user friendly way?

If the user had called batch_run, it would be reasonable for them to expect a multi-status response. But should we raise an Exception on the client sdk if we get a 207? And if so, should we try to display it in a generic way that works for both single and multi-task requests like Submission failed for the following tasks: {failed_task_ids}. Task submission failed for the following reason(s): {list of stringified exceptions from the error responses}. Maybe Exception names like BatchSubmitFailed and BatchStatusFailed, which can then contain all the response data.

Alternatively, maybe when we get a 207 back, we just want to start iterating through it and raise the first error that we come across in the list of responses. Like if a user had multiple EndpointNotFound errors for a batch run, it would just show them the first one, they would have to fix that, then it would show them the next one if they tried again. But I'm not sure that is a great approach unless there is some way to extract the response data out of the batch_run python API call so that the user can see all the results.

@yadudoc
Copy link
Contributor

yadudoc commented Feb 24, 2021

Handling partial failures is easier in fxc.run() for a single task because anything other than status_code:200 implies that the task submission failed, and the SDK ought to inspect the response to the appropriate exception. With a single task, you can immediately raise the exception with additional info placed into the traceback.

# This raises say, EndpointAccessForbidden which is contained in the 207 response 
fxc.run(args, endpoint_id='BAD', function_id=fn_id)

batch_run is a bit more problematic because it isn't quite clear at what point we'd raise the exception. If we raise an exception at batch_run and some tasks were launched, we have to be careful not to lose the task_ids. I'm inclined to just return task_ids for the one's that failed, with the exception added to the internal table so that, and we raise it when the task status is queried.

# We raise an exception for 4xx and 5xx errors.
task_uuids = fxc.batch_run(batch)

# For 207 multi-status, we set the results for the failed tasks with their corresponding exceptions
for tid in task_uuids:
     fxc.get_result(tid)   # This might raise EndpointAccessForbidden for tasks which failed to launch in the batch.

@knagaitsev
Copy link
Contributor Author

knagaitsev commented Feb 26, 2021

Some design thoughts for this approach:

  • On the service, auth_and_launch will need to save the task in redis with the Task constructor even if the task fails to launch, and needs to indicate the exact exception that made this task fail to launch (this must be different from the current Task.exception property which refers to an exception that occurred while running the task, from my understanding).
  • The funcx sdk error_handling_client will need to be modified to allow HTTP status 207 to pass through, where it can then be handled differently based on if fxc.run or fxc.batch_run is called.
  • fxc.run will need to be modified to raise an exception if there is a 207
  • fxc.batch_run will need to be modified to collect the task_ids from the collection of responses
  • the /tasks/<task_id> and /batch_status routes will be modified to understand the failed to launch Task data from above

the returned format for /submit will change from

{'status': 'Success',
'task_uuids': [],
'task_uuid': ""}

to:

[
  {
    'status': 'Success',
    'task_uuid': "",
    'http_status_code': 200
  },
  {
    'status': 'Failed',
    'code': 1,
    'task_uuid': "",
    'http_status_code': 4XX/5XX,
    ...
  },
  ...
]

@knagaitsev
Copy link
Contributor Author

We need to consider this case:

  • A user submits some tasks a and b in a script, but doesn't bother checking the tasks later with fxc.get_result(tid) in that same script. Task b fails because an invalid endpoint was provided for that task, so a 207 is sent back and fxc.batch_run returns the list of task ids (this response contains the EndpointNotFound for task b)
  • The user runs a new script that checks the results with fxc.get_result, but gets a TaskNotFound for task b instead of an EndpointNotFound, because the launch failure info was not saved on the service-side

This behavior is consistent with the current model that once the client gets a state once, that state is no longer saved on our end. If the user wanted to know that b failed with EndpointNotFound, they should've checked the results with fxc.get_result after submitting the task. This behavior may not be ideal, but it seems to be the best for scale and maintaining a model where state is not stored after it has been made known to the client. We should make it clear in documentation and examples that a user should always check the results of tasks after submitting using the same funcx client.

@knagaitsev
Copy link
Contributor Author

We also need to do more thinking about standardizing the format of task status objects. The current schema for task info from the service is pretty nice:

{
            'task_id': task_id,
            'status': task_status,
            'result': task_result,
            'completion_t': task_completion_t,
            'exception': task_exception
        }

though we should switch status to task_status I think. The messy part I feel is that on the sdk side, just a subset of these fields are saved in the sdk task data structure. I think it would be best to keep everything uniform and have an identical task status object on the sdk side when it is available.

Also, we need to think about if it is a bad idea to mix these with error objects, like when we are returning status info.

@knagaitsev
Copy link
Contributor Author

FuncX SDK Changes

Non-async mode for single task run (default)

run() returns a single task_id for a 200 response, raises the exception if an error is sent back during submission

async mode

run() returns a single future for a 200 response (need to discuss - should it raise exception if an error is sent back during submission, or should it add the exception to the returned future with set_exception in certain cases?)

batch_run() returns a list of futures corresponding to each submitted batch item for a 200/207 response. For a 4XX/5XX response, the exception is raised, as this indicates the entirety of the submission failed. If a task in the list fails to submit, the exception can be immediately attached to that future with set_exception and we don't need to add it to the async queue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants