Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flyte DBX (GO) plugin does not handle HTTP 429 response #5203

Closed
2 tasks done
rambrus opened this issue Apr 9, 2024 · 6 comments · Fixed by #5226
Closed
2 tasks done

[BUG] Flyte DBX (GO) plugin does not handle HTTP 429 response #5203

rambrus opened this issue Apr 9, 2024 · 6 comments · Fixed by #5226
Assignees
Labels
bug Something isn't working

Comments

@rambrus
Copy link

rambrus commented Apr 9, 2024

Describe the bug

We are using Flyte with Databricks plugin (GO) with the following config:

databricks:
  enabled: true
  upload_entrypoint: true
  plugin_config:
    plugins:
      databricks:
        databricksInstance: <!--- set to our DBX instance --->
        # this is the entrypoint.py for flyte on databricks
        entrypointFile: <!--- set to our DBX entrypoint file location --->
        # this file is mounted by vault agent injector at /vault/secrets
        databricksTokenKey: <!--- set to our DBX token key --->
        webApi:
          caching:
            maxSystemFailures: 5
            resyncInterval: 60s #default value is 30s!
            size: 500000
            workers: 10
          readRateLimiter:
            burst: 20 #default value is 100!
            qps: 10
          resourceMeta: null
          resourceQuotas:
            default: 1000
          writeRateLimiter:
            burst: 20 #default value is 100!
            qps: 10

Databricks API uses a very low API rate limit for job polling (100), Flyte tasks often run into HTTP 429 - Too Many Requests response.
We have done some code analysis and found out that HTTP 429 is not handled properly in DBX plugin.
The plugin expects Databricks to return a JSON response in all cases, no matter what the response code is.
Actually, Databricks API only returns JSON response in case of HTTP 200, in other cases (e.g. HTTP 429) a non-json response is returned. DBX plugin tries to parse the response in every cases, but fails with unexpected end of JSON input (see attached screenshot).

Expected behavior

Flyte DBX plugin handles HTTP 429 reponse properly (skip status task status update)

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@rambrus rambrus added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Apr 9, 2024
Copy link

dosubot bot commented Apr 9, 2024

Hey there @rambrus! Good to see you back. 😊 I remember our last cosmic dance with the code. I'm diving into this issue now and will circle back with some insights soon!

@rambrus
Copy link
Author

rambrus commented Apr 9, 2024

Screenshot:
image

Copy link

dosubot bot commented Apr 9, 2024

Thank you for bringing this to our attention. It seems like you're encountering two issues: one related to the handling of HTTP 429 responses and another related to an HTTP 401 Unauthorized error.

For the first issue, the Flyte Databricks (DBX) plugin currently attempts to parse the response body for all HTTP status codes, which is causing the "unexpected end of JSON input" error when a non-JSON response is returned for HTTP 429 responses. To address this, the plugin could be updated to handle HTTP 429 responses by skipping the task status update instead of attempting to parse the non-JSON response. This could be done by modifying the sendRequest function in the plugin code to check for HTTP 429 status codes and handle them accordingly. Here's an example of how this could be done:

if resp.StatusCode == http.StatusTooManyRequests {
    // Log the rate limit issue, skip the update, or implement retry logic
    return nil, nil // or a specific error indicating to skip the update
}

For the second issue, the plugin currently does not handle HTTP 401 Unauthorized responses differently from other types of errors. When the sendRequest function encounters an HTTP status code that is not http.StatusOK (200), it generalizes the error handling by returning a formatted error message that includes the HTTP method and the error message received from the Databricks API, if available. This could be improved by adding specific handling for HTTP 401 errors to provide more informative error messages or to implement a retry mechanism with refreshed authentication credentials, if applicable.

Regarding the CORS-related headers mentioned in the log output, this could be a separate issue related to the configuration of the Databricks API or the network environment in which your application is running. It would be helpful to have more information about the specific CORS error message and the network configuration to provide a more accurate diagnosis.

I hope this helps address your concerns. If you have any further questions or need more clarification, please don't hesitate to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@Future-Outlier Future-Outlier removed the untriaged This issues has not yet been looked at by the Maintainers label Apr 10, 2024
@rambrus
Copy link
Author

rambrus commented Apr 10, 2024

As I understand it, this block will never be reached, since responses other than HTTP 200 do not have a JSON body and json.Unmarshal(responseBody, &data) will fail with an unexpected end of JSON input error.

@rambrus
Copy link
Author

rambrus commented Apr 10, 2024

Reviewed the code, tried to understand the plugin design, my interpretation:

  • any HTTP response other than HTTP 200 results in node error (node status update skipped)
  • errors are counted per node
  • when node errors reach the webApi / caching / maxSystemFailures limit, node is terminated with a failure

This means that HTTP 429 is treated as an error, nodes are expected to fail when receiving HTTP 429 response maxSystemFailures-times.
Increasing maxSystemFailures might help us tolerate temporary load spikes.

Is that a correct assumption?

@Future-Outlier
Copy link
Member

Reviewed the code, tried to understand the plugin design, my interpretation:

  • any HTTP response other than HTTP 200 results in node error (node status update skipped)
  • errors are counted per node
  • when node errors reach the webApi / caching / maxSystemFailures limit, node is terminated with a failure

This means that HTTP 429 is treated as an error, nodes are expected to fail when receiving HTTP 429 response maxSystemFailures-times. Increasing maxSystemFailures might help us tolerate temporary load spikes.

Is that a correct assumption?

Yes I think is.
I've traced this plugin before, I didn't look it again, but you're probably right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants