Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] webApi readRateLimiter settings are not respected #5202

Closed
2 tasks done
rambrus opened this issue Apr 9, 2024 · 1 comment · Fixed by #5228
Closed
2 tasks done

[BUG] webApi readRateLimiter settings are not respected #5202

rambrus opened this issue Apr 9, 2024 · 1 comment · Fixed by #5228
Assignees
Labels
bug Something isn't working

Comments

@rambrus
Copy link

rambrus commented Apr 9, 2024

Describe the bug

We are using Flyte with Databricks plugin (GO) with the following config:

databricks:
  enabled: true
  upload_entrypoint: true
  plugin_config:
    plugins:
      databricks:
        databricksInstance: <!--- set to our DBX instance --->
        # this is the entrypoint.py for flyte on databricks
        entrypointFile: <!--- set to our DBX entrypoint file location --->
        # this file is mounted by vault agent injector at /vault/secrets
        databricksTokenKey: <!--- set to our DBX token key --->
        webApi:
          caching:
            maxSystemFailures: 5
            resyncInterval: 60s #default value is 30s!
            size: 500000
            workers: 10
          readRateLimiter:
            burst: 20 #default value is 100!
            qps: 10
          resourceMeta: null
          resourceQuotas:
            default: 1000
          writeRateLimiter:
            burst: 20 #default value is 100!
            qps: 10

Databricks API uses a very low API rate limit for job polling (100), Flyte tasks often run into HTTP 429 - Too Many Requests response.
We did some code analysis and found that Flyte triggers the calls to DBX, ignoring the readRateLimiter settings.
We identified a potentional root cause (workqueue.DefaultControllerRateLimiter() was used in cache.go), @kumare3 created a fix in #5190

We have tested the fix, unfortunately it did not resolve the issue.

  • readRateLimiter settings - qps: 10, burst: 20
  • tested workflow - dynamic workflow with 400 tasks, max-parallelism: 400
  • job polling requests peaked at 250 req/sec (see the attached screenshot)

Expected behavior

Job polling request rate never exceeds the preconfigured burst value (20).

Additional context to reproduce

No response

Screenshots

image

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@rambrus rambrus added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Apr 9, 2024
Copy link

dosubot bot commented Apr 9, 2024

Thank you for providing detailed information about the issue you're facing. It seems like the readRateLimiter settings in your Flyte configuration for the Databricks plugin are not being respected, causing the job polling requests to exceed the configured burst value.

From your description, it appears that the rate limiting logic might not be correctly implemented in the Databricks plugin's HTTP request handling. Specifically, the sendRequest method, which is responsible for making HTTP requests to the Databricks API, might not be using the rate limiter correctly.

To address this issue, you could consider integrating a rate limiter directly into the sendRequest method. Here's an example of how you could modify the Plugin struct to include a rate limiter, initialize it in the newDatabricksJobTaskPlugin function, and use it in the sendRequest method:

  1. Modify the Plugin struct to include a rate limiter:
import (
    "golang.org/x/time/rate"
    // other imports
)

type Plugin struct {
    metricScope       promutils.Scope
    cfg               *Config
    client            HTTPClient
    readRateLimiter   *rate.Limiter // Add this line
}
  1. Initialize the rate limiter in the newDatabricksJobTaskPlugin function:
func newDatabricksJobTaskPlugin() webapi.PluginEntry {
    return webapi.PluginEntry{
        ID: "databricks",
        SupportedTaskTypes: []core.TaskType{"spark"},
        PluginLoader: func(ctx context.Context, iCtx webapi.PluginSetupContext) (webapi.AsyncPlugin, error) {
            cfg := GetConfig()
            // Assuming cfg.ReadQPS and cfg.ReadBurst are the configured QPS and burst values.
            readRateLimiter := rate.NewLimiter(rate.Limit(cfg.ReadQPS), cfg.ReadBurst)
            return &Plugin{
                metricScope:     iCtx.MetricsScope(),
                cfg:             cfg,
                client:          &http.Client{},
                readRateLimiter: readRateLimiter, // Initialize the rate limiter here
            }, nil
        },
    }
}
  1. Use the rate limiter in the sendRequest method:

Before making the HTTP request in the sendRequest method, wait for permission from the rate limiter:

func (p Plugin) sendRequest(method string, databricksJob map[string]interface{}, token string, runID string) (map[string]interface{}, error) {
    // Wait for permission from the rate limiter
    err := p.readRateLimiter.Wait(context.Background())
    if err != nil {
        return nil, fmt.Errorf("rate limiter error: %v", err)
    }

    // Existing implementation of sendRequest continues here...
}

By integrating a rate limiter as shown above, you can control the rate of HTTP requests made to the Databricks API, ensuring that the readRateLimiter settings are respected and preventing the job polling requests from exceeding the configured limits.

Please note that this is a suggested solution based on the information provided and the current state of the Flyte codebase. It's always a good idea to thoroughly test any changes in a controlled environment before deploying them to production. If you have any further questions or need additional assistance, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@Future-Outlier Future-Outlier changed the title [BUG] webApi redRateLimiter settings are not respected [BUG] webApi readRateLimiter settings are not respected Apr 10, 2024
@Future-Outlier Future-Outlier removed the untriaged This issues has not yet been looked at by the Maintainers label Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants