Unexpected number of Get workflowtemplates API calls #11134

Guillermogsjc · 2023-05-25T23:13:05Z

hi all,

just wondering one thing about the call from workflow-controller that gets workflow templates.

I give the data then state the question/propposal:

kubectl get wftmpl -n argo -o name  | wc -l

gives 35 alive templates (I do not have cluster workflow templates). Some of those 35 templates are nested as some are DAG calling another templates in several tasks that have also DAGs inside, with a nesting depth with a maximum of 2.

When I get logs from workflow-controller, I get this:

cat logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log | \
    rg "Get workflowtemplates" | head -n 1
cat logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log | \
    rg "Get workflowtemplates" | tail -n 1

gives

time="2023-05-25T21:56:28.052Z" level=info msg="Get workflowtemplates 200"
time="2023-05-25T22:20:48.421Z" level=info msg="Get workflowtemplates 200"

and

cat logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log | \
    rg "Get workflowtemplates" | wc -l

gives 31229 calls.

So, in about 24 minutes, there have been 31 K calls to the K8S api for getting workflowtemplates, in a deployment that has 35 workflow templates alive. It looks too much for a k8s object being updated in periods much bigger (depending on each use cases), but surely looks like too many calls for 35 templates that are static through those 24 minutes ( in my case).

Statement (in the case my guessings are not bad):

It would be possible to cache those calls from workflow-controller inside it in a coroutine that gets the templates and creates shared variables with them, in such a way the other coroutines go and consult this shared workflow templates information instead punishing K8S API so much?

A flag --workflow-template-cache 0s for workflow-controller would be very nice, and by having this, at each deployment it could be decided the balance about how much "limited K8S API calls money" is spent in updating workflow templates.

Here the logs -> logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log

Here the workflow-controller configmap

data:
  resourceRateLimit: |
    limit: 30
    burst: 5
  parallelism: "1000"
  namespaceParallelism: "1000"
  nodeEvents: |
    enabled: false
  workflowDefaults: |
    spec:
      podGC:
        strategy: OnPodCompletion
      ttlStrategy:
        secondsAfterCompletion: 21600
        secondsAfterSuccess: 21600
        secondsAfterFailure: 21600
  artifactRepository: |
    archiveLogs: true
    s3:
      bucket: ${S3_ARGO_BUCKET}
      endpoint: s3.amazonaws.com
  persistence: |
    connectionPool:
      maxIdleConns: 100
      maxOpenConns: 0
      connMaxLifetime: 10000s
    nodeStatusOffLoad: true
    archive: true
    archiveTTL: 90d
    skipMigration: false
    postgresql:
      host: argo-server-postgresql-backend.argo
      port: 5432
      database: postgres
      tableName: argo_workflows
      # the database secrets must be in the same namespace of the controller
      userNameSecret:
        name: argo-postgres-config
        key: username
      passwordSecret:
        name: argo-postgres-config
        key: password
      ssl: false
      # sslMode must be one of: disable, require, verify-ca, verify-full
      # you can find more information about those ssl options here: https://godoc.org/github.com/lib/pq
      sslMode: disable

Here the workflow-controller deployment patch

spec:
  revisionHistoryLimit: 3
  template:
    spec:
      nodeSelector:
        nodegroup_name: core
      containers:
        - name: workflow-controller
          command:
            - workflow-controller
            - --qps=30
            - --burst=60
            - --workflow-ttl-workers=8
            - --workflow-workers=50
          env:
            - name: DEFAULT_REQUEUE_TIME
              value: 1m
          resources:
            requests:
              memory: "6100Mi"
              cpu: "1700m"
            limits:
              memory: "6100Mi"
              cpu: "1700m"

The text was updated successfully, but these errors were encountered:

caelan-io · 2023-06-21T12:44:48Z

Sounds like a bug given high amount of API calls. Updated to bug so we can track and eventually diagnose the issue.

TapTap21 · 2023-07-07T13:58:52Z

I am also facing this issue, it actually causes the UI to timeout on workflow submission, thus I can only trigger workflows from the CLI. Any news on this issue?

Guillermogsjc · 2023-07-10T14:23:10Z

please consider this bug as a high severity one.

This phenomenon is probably stealing kubernetes api QPS quota to the workflow controller in such a way the cron worfklows are not launched sometimes . In any system, reliability problems on cron daemon is catastrophic, cron daemon need to be as reliable as the sun raising on the morning

terrytangyuan · 2023-07-10T14:52:50Z

Could you test the tag dev-fix-list-load to see if the list of API calls decreased?

TapTap21 · 2023-07-10T15:01:35Z

@terrytangyuan I'll test as well. Can i simply change the images.tag value in the helm chart to deploy that version?

terrytangyuan · 2023-07-10T15:02:27Z

I think so

TapTap21 · 2023-07-10T15:14:49Z

The logs don't show much in my case, when I trigger the cronWorkflow it times out and returns 504 after 60s. The logs show;
time="2023-07-10T15:12:16.284Z" level=info duration=59.999768992s method=POST path=/api/v1/workflows/argo-workflows-data-cron/submit size=39 status=408

So not much to go on

terrytangyuan · 2023-07-10T15:53:40Z

What CronWorkflow are you referring to? Perhaps create a separate issue to discuss? This issue is for workflowtemplates API calls.

Guillermogsjc · 2023-07-12T12:30:35Z

hi @terrytangyuan , yes sorry for the confusion. In our case we think that this flooding could be related to missing wf launchs from cronwf controller, but it may not. Just forget about it, as you stated, this is not the correct issue to discuss it.

Guillermogsjc · 2023-07-12T13:20:52Z

Could you test the tag dev-fix-list-load to see if the list of API calls decreased?

Yes it in fact dissapeared completly, a movement in logging level ?

logs-from-workflow-controller-in-workflow-controller-84cdcd9f75-ct8sc (2).log

terrytangyuan · 2023-07-12T13:34:28Z

What disappeared? Could you be more specific?

Guillermogsjc · 2023-07-12T13:42:58Z

hi @terrytangyuan ,

cat logs-from-workflow-controller-in-workflow-controller-84cdcd9f75-ct8sc\ \(2\).log | rg "Get workflowtemplates" now gives 0 lines.

As i can check in changes for this image here the logging level for the call did not change. This fact, together with the fact that this image tag dev-fix-list-load is working, drives us to conclude that all those flooding "Get workflowtemplates" calls were superfluous?

Kind regards.

terrytangyuan · 2023-07-12T14:02:02Z

Is everything else working as expected now? Do you see other issues?

terrytangyuan · 2023-07-12T14:03:11Z

No logging changes. Those calls may not be necessary.

terrytangyuan · 2023-07-12T14:14:35Z

Sent a PR for this. #11343

Please continue monitoring it and let us know if you see additional issues.

Guillermogsjc · 2023-07-12T14:14:49Z

I do not manage to find weird things or errors so far after deploying that image, but do not take this as a definitive test i guess.

Thanks @terrytangyuan :)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

juliev0 · 2023-07-13T20:19:27Z

@Guillermogsjc I believe I might be seeing the same issue as you as far as the CronWorkflow not firing the Workflow and also a lot of calls to "GET WorkflowTemplate". But I'm actually not seeing how this fix resolves it. I see that the CronWorkflow is making that call as part of its validation logic here, but it appears to not even use the WorkflowTemplate Informer that Terry changed.

Guillermogsjc · 2023-07-14T10:14:41Z

hi @juliev0

our issue about cronwf launching was assisted by @tico24 here on slack

At a cron were we have 200 concurrent launchs, we observed that some of them were missed on launch (nothing indicating a launch failure on the logs). This only happened from cron launch, if manually launching 200 parallel ones it worked OK, so we were guessing that workflow-controller pod was throttled by the K8S API and the wfs missed without proper logging about it. This throttling could be produced in part by the "Get Workflowtemplates innecesary and massive calls that @terrytangyuan just corrected, in our guesses.

After Tim advices following the scaling good practices and here, everything worked again nicely, and we saw no more missed launchs on cronwf so far. Now wfs launch almost 14 minutes after cron time (mostly because rate creatiion limits and requeue times increased), but every cronwf is triggered properly.

This, of course, does not mean that the "trigger missed from cronwf" issue is totally resolved or has any relation with this issue. It could be nice to put an eye on it if you reproduced this missing from cronwf behaviour, as missing cronwf launches is a critical situation. If I find again this behaviour, I will open new issue with logs and information about it.

juliev0 · 2023-07-17T00:48:12Z

@Guillermogsjc Thanks for the response. I think the change put in by Terry only appeared to resolve this. I’m not sure which version you were running before but it seems like another change was put in recently to remove logging of all K8S API responses. At least when I ran with a version of master from early June I no longer saw all of those API response logs like before.

I think we need to use the WorkflowTemplate informer (cache like you said) when the CronWorkflow needs to validate the WorkflowTemplate instead of making the GET calls. I will open up an enhancement issue for that and if I have time work on it.

juliev0 · 2023-07-17T01:13:24Z

Also, I see in your original log:
I0525 08:49:01.033669 1 request.go:601] Waited for 1.024306564s due to client-side throttling, not priority and fairness, request: PATCH:https://10.100.0.1:443/apis/argoproj.io/v1alpha1/namespaces/argo/cronworkflows/parallelism-test-cron-wf-44
This is the kubernetes client locally throttling the message. Until we start using the Informer, you need to increase the --qps and --burst settings passed to the workflow controller to allow more messages per second. The call shown above (PATCH) is being used in the CronWorkflow code to update the Workflow's Status both when the Workflow starts running and when it's done.

I also see this line:

time="2023-05-25T09:05:31.996Z" level=info msg=healthz age=5m0s err="workflow never reconciled: b12-34-1685001600-anl" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=

This is your Controller basically failing the health check here because there are Workflows in the system that have never been seen by the Controller even once. Maybe somehow the throttling (and maybe the retries from throttling) are blocking goroutines from running to process these Workflows?

juliev0 · 2023-07-17T01:23:20Z

Added #11372

terrytangyuan · 2023-07-17T03:14:25Z

I’m not sure which version you were running before but it seems like another change was put in recently to remove logging of all K8S API responses.

You are right this commit might have slipped into the image I built: c6862e9

Could you perhaps change loglevel to DEBUG and try again? Apologies for the confusion.

Guillermogsjc · 2023-07-18T10:22:39Z

hi @terrytangyuan tried again with debug on and image dev-fix-list-load,

in the attached example, there is a massive get workflow template calls at "2023-07-18T10:14", where 93 cronwf have to be launched. I guess that #11372 will improve a lot the cronwf cronwf controller launching k8s api burden.

kubernetes v1.27 (so affected by QPS abd BURST bump)
argo v3.4.8 with workflow controller tag specified

workflow controller launched as

 - workflow-controller
 - '--qps=50'
 - '--burst=100'
 - '--workflow-ttl-workers=8'
 - '--workflow-workers=100'
 - '--loglevel=debug'

all 93 workflows have been launched 5 minutes later at 10:19 happily (k8s v1.27 rocks here)

logs-from-workflow-controller-in-workflow-controller-76ffd854c4-fk8lt (1).log

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

…argoproj#11343) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Signed-off-by: Dillen Padhiar <dillen_padhiar@intuit.com>

Guillermogsjc added the type/feature Feature request label May 25, 2023

caelan-io added type/bug area/controller Controller issues, panics area/workflow-templates P3 Low priority and removed type/feature Feature request labels Jun 21, 2023

juliev0 added P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important and removed P3 Low priority labels Jun 29, 2023

terrytangyuan mentioned this issue Jul 12, 2023

fix: Correct limit in controller List API calls. Fixes #11134 #11343

Merged

terrytangyuan changed the title ~~Get workflowtemplates API call~~ Unexpected number of Get workflowtemplates API calls Jul 12, 2023

terrytangyuan closed this as completed in #11343 Jul 13, 2023

terrytangyuan added a commit that referenced this issue Jul 13, 2023

fix: Correct limit in controller List API calls. Fixes #11134 (#11343)

43d667e

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

terrytangyuan mentioned this issue Jul 18, 2023

Release v3.4.9 cherry-pick candidates #11380

Closed

61 tasks

terrytangyuan added a commit that referenced this issue Jul 19, 2023

fix: Correct limit in controller List API calls. Fixes #11134 (#11343)

f7bf6ee

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected number of Get workflowtemplates API calls #11134

Unexpected number of Get workflowtemplates API calls #11134

Guillermogsjc commented May 25, 2023 •

edited

Loading

caelan-io commented Jun 21, 2023

TapTap21 commented Jul 7, 2023

Guillermogsjc commented Jul 10, 2023 •

edited

Loading

terrytangyuan commented Jul 10, 2023

TapTap21 commented Jul 10, 2023

terrytangyuan commented Jul 10, 2023

TapTap21 commented Jul 10, 2023

terrytangyuan commented Jul 10, 2023

Guillermogsjc commented Jul 12, 2023

Guillermogsjc commented Jul 12, 2023

terrytangyuan commented Jul 12, 2023

Guillermogsjc commented Jul 12, 2023 •

edited

Loading

terrytangyuan commented Jul 12, 2023

terrytangyuan commented Jul 12, 2023

terrytangyuan commented Jul 12, 2023

Guillermogsjc commented Jul 12, 2023

juliev0 commented Jul 13, 2023

Guillermogsjc commented Jul 14, 2023 •

edited

Loading

juliev0 commented Jul 17, 2023 •

edited

Loading

juliev0 commented Jul 17, 2023 •

edited

Loading

juliev0 commented Jul 17, 2023

terrytangyuan commented Jul 17, 2023 •

edited

Loading

Guillermogsjc commented Jul 18, 2023 •

edited

Loading

Unexpected number of Get workflowtemplates API calls #11134

Unexpected number of Get workflowtemplates API calls #11134

Comments

Guillermogsjc commented May 25, 2023 • edited Loading

caelan-io commented Jun 21, 2023

TapTap21 commented Jul 7, 2023

Guillermogsjc commented Jul 10, 2023 • edited Loading

terrytangyuan commented Jul 10, 2023

TapTap21 commented Jul 10, 2023

terrytangyuan commented Jul 10, 2023

TapTap21 commented Jul 10, 2023

terrytangyuan commented Jul 10, 2023

Guillermogsjc commented Jul 12, 2023

Guillermogsjc commented Jul 12, 2023

terrytangyuan commented Jul 12, 2023

Guillermogsjc commented Jul 12, 2023 • edited Loading

terrytangyuan commented Jul 12, 2023

terrytangyuan commented Jul 12, 2023

terrytangyuan commented Jul 12, 2023

Guillermogsjc commented Jul 12, 2023

juliev0 commented Jul 13, 2023

Guillermogsjc commented Jul 14, 2023 • edited Loading

juliev0 commented Jul 17, 2023 • edited Loading

juliev0 commented Jul 17, 2023 • edited Loading

juliev0 commented Jul 17, 2023

terrytangyuan commented Jul 17, 2023 • edited Loading

Guillermogsjc commented Jul 18, 2023 • edited Loading

Guillermogsjc commented May 25, 2023 •

edited

Loading

Guillermogsjc commented Jul 10, 2023 •

edited

Loading

Guillermogsjc commented Jul 12, 2023 •

edited

Loading

Guillermogsjc commented Jul 14, 2023 •

edited

Loading

juliev0 commented Jul 17, 2023 •

edited

Loading

juliev0 commented Jul 17, 2023 •

edited

Loading

terrytangyuan commented Jul 17, 2023 •

edited

Loading

Guillermogsjc commented Jul 18, 2023 •

edited

Loading