Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected number of Get workflowtemplates API calls #11134

Closed
Guillermogsjc opened this issue May 25, 2023 · 23 comments · Fixed by #11343
Closed

Unexpected number of Get workflowtemplates API calls #11134

Guillermogsjc opened this issue May 25, 2023 · 23 comments · Fixed by #11343
Labels
area/controller Controller issues, panics area/workflow-templates P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/bug

Comments

@Guillermogsjc
Copy link

Guillermogsjc commented May 25, 2023

hi all,

just wondering one thing about the call from workflow-controller that gets workflow templates.

I give the data then state the question/propposal:

kubectl get wftmpl -n argo -o name  | wc -l

gives 35 alive templates (I do not have cluster workflow templates). Some of those 35 templates are nested as some are DAG calling another templates in several tasks that have also DAGs inside, with a nesting depth with a maximum of 2.

When I get logs from workflow-controller, I get this:

cat logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log | \
    rg "Get workflowtemplates" | head -n 1
cat logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log | \
    rg "Get workflowtemplates" | tail -n 1

gives

time="2023-05-25T21:56:28.052Z" level=info msg="Get workflowtemplates 200"
time="2023-05-25T22:20:48.421Z" level=info msg="Get workflowtemplates 200"

and

cat logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log | \
    rg "Get workflowtemplates" | wc -l

gives 31229 calls.

So, in about 24 minutes, there have been 31 K calls to the K8S api for getting workflowtemplates, in a deployment that has 35 workflow templates alive. It looks too much for a k8s object being updated in periods much bigger (depending on each use cases), but surely looks like too many calls for 35 templates that are static through those 24 minutes ( in my case).

Statement (in the case my guessings are not bad):

It would be possible to cache those calls from workflow-controller inside it in a coroutine that gets the templates and creates shared variables with them, in such a way the other coroutines go and consult this shared workflow templates information instead punishing K8S API so much?

A flag --workflow-template-cache 0s for workflow-controller would be very nice, and by having this, at each deployment it could be decided the balance about how much "limited K8S API calls money" is spent in updating workflow templates.

Here the logs -> logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log

Here the workflow-controller configmap
data:
  resourceRateLimit: |
    limit: 30
    burst: 5
  parallelism: "1000"
  namespaceParallelism: "1000"
  nodeEvents: |
    enabled: false
  workflowDefaults: |
    spec:
      podGC:
        strategy: OnPodCompletion
      ttlStrategy:
        secondsAfterCompletion: 21600
        secondsAfterSuccess: 21600
        secondsAfterFailure: 21600
  artifactRepository: |
    archiveLogs: true
    s3:
      bucket: ${S3_ARGO_BUCKET}
      endpoint: s3.amazonaws.com
  persistence: |
    connectionPool:
      maxIdleConns: 100
      maxOpenConns: 0
      connMaxLifetime: 10000s
    nodeStatusOffLoad: true
    archive: true
    archiveTTL: 90d
    skipMigration: false
    postgresql:
      host: argo-server-postgresql-backend.argo
      port: 5432
      database: postgres
      tableName: argo_workflows
      # the database secrets must be in the same namespace of the controller
      userNameSecret:
        name: argo-postgres-config
        key: username
      passwordSecret:
        name: argo-postgres-config
        key: password
      ssl: false
      # sslMode must be one of: disable, require, verify-ca, verify-full
      # you can find more information about those ssl options here: https://godoc.org/github.com/lib/pq
      sslMode: disable
Here the workflow-controller deployment patch
spec:
  revisionHistoryLimit: 3
  template:
    spec:
      nodeSelector:
        nodegroup_name: core
      containers:
        - name: workflow-controller
          command:
            - workflow-controller
            - --qps=30
            - --burst=60
            - --workflow-ttl-workers=8
            - --workflow-workers=50
          env:
            - name: DEFAULT_REQUEUE_TIME
              value: 1m
          resources:
            requests:
              memory: "6100Mi"
              cpu: "1700m"
            limits:
              memory: "6100Mi"
              cpu: "1700m"
@Guillermogsjc Guillermogsjc added the type/feature Feature request label May 25, 2023
@caelan-io caelan-io added type/bug area/controller Controller issues, panics area/workflow-templates P3 Low priority and removed type/feature Feature request labels Jun 21, 2023
@caelan-io
Copy link
Member

Sounds like a bug given high amount of API calls. Updated to bug so we can track and eventually diagnose the issue.

@juliev0 juliev0 added P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important and removed P3 Low priority labels Jun 29, 2023
@TapTap21
Copy link

TapTap21 commented Jul 7, 2023

I am also facing this issue, it actually causes the UI to timeout on workflow submission, thus I can only trigger workflows from the CLI. Any news on this issue?

@Guillermogsjc
Copy link
Author

Guillermogsjc commented Jul 10, 2023

please consider this bug as a high severity one.

This phenomenon is probably stealing kubernetes api QPS quota to the workflow controller in such a way the cron worfklows are not launched sometimes . In any system, reliability problems on cron daemon is catastrophic, cron daemon need to be as reliable as the sun raising on the morning

@terrytangyuan
Copy link
Member

Could you test the tag dev-fix-list-load to see if the list of API calls decreased?

@TapTap21
Copy link

@terrytangyuan I'll test as well. Can i simply change the images.tag value in the helm chart to deploy that version?

@terrytangyuan
Copy link
Member

I think so

@TapTap21
Copy link

The logs don't show much in my case, when I trigger the cronWorkflow it times out and returns 504 after 60s. The logs show;
time="2023-07-10T15:12:16.284Z" level=info duration=59.999768992s method=POST path=/api/v1/workflows/argo-workflows-data-cron/submit size=39 status=408

So not much to go on

@terrytangyuan
Copy link
Member

What CronWorkflow are you referring to? Perhaps create a separate issue to discuss? This issue is for workflowtemplates API calls.

@Guillermogsjc
Copy link
Author

hi @terrytangyuan , yes sorry for the confusion. In our case we think that this flooding could be related to missing wf launchs from cronwf controller, but it may not. Just forget about it, as you stated, this is not the correct issue to discuss it.

@Guillermogsjc
Copy link
Author

Could you test the tag dev-fix-list-load to see if the list of API calls decreased?

Yes it in fact dissapeared completly, a movement in logging level ?

logs-from-workflow-controller-in-workflow-controller-84cdcd9f75-ct8sc (2).log

@terrytangyuan
Copy link
Member

What disappeared? Could you be more specific?

@Guillermogsjc
Copy link
Author

Guillermogsjc commented Jul 12, 2023

hi @terrytangyuan ,

cat logs-from-workflow-controller-in-workflow-controller-84cdcd9f75-ct8sc\ \(2\).log | rg "Get workflowtemplates" now gives 0 lines.

As i can check in changes for this image here the logging level for the call did not change. This fact, together with the fact that this image tag dev-fix-list-load is working, drives us to conclude that all those flooding "Get workflowtemplates" calls were superfluous?

Kind regards.

@terrytangyuan
Copy link
Member

Is everything else working as expected now? Do you see other issues?

@terrytangyuan
Copy link
Member

No logging changes. Those calls may not be necessary.

@terrytangyuan
Copy link
Member

Sent a PR for this. #11343

Please continue monitoring it and let us know if you see additional issues.

@Guillermogsjc
Copy link
Author

I do not manage to find weird things or errors so far after deploying that image, but do not take this as a definitive test i guess.

Thanks @terrytangyuan :)

@terrytangyuan terrytangyuan changed the title Get workflowtemplates API call Unexpected number of Get workflowtemplates API calls Jul 12, 2023
terrytangyuan added a commit that referenced this issue Jul 13, 2023
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
@juliev0
Copy link
Contributor

juliev0 commented Jul 13, 2023

@Guillermogsjc I believe I might be seeing the same issue as you as far as the CronWorkflow not firing the Workflow and also a lot of calls to "GET WorkflowTemplate". But I'm actually not seeing how this fix resolves it. I see that the CronWorkflow is making that call as part of its validation logic here, but it appears to not even use the WorkflowTemplate Informer that Terry changed.

@Guillermogsjc
Copy link
Author

Guillermogsjc commented Jul 14, 2023

hi @juliev0

our issue about cronwf launching was assisted by @tico24 here on slack

At a cron were we have 200 concurrent launchs, we observed that some of them were missed on launch (nothing indicating a launch failure on the logs). This only happened from cron launch, if manually launching 200 parallel ones it worked OK, so we were guessing that workflow-controller pod was throttled by the K8S API and the wfs missed without proper logging about it. This throttling could be produced in part by the "Get Workflowtemplates innecesary and massive calls that @terrytangyuan just corrected, in our guesses.

After Tim advices following the scaling good practices and here, everything worked again nicely, and we saw no more missed launchs on cronwf so far. Now wfs launch almost 14 minutes after cron time (mostly because rate creatiion limits and requeue times increased), but every cronwf is triggered properly.

This, of course, does not mean that the "trigger missed from cronwf" issue is totally resolved or has any relation with this issue. It could be nice to put an eye on it if you reproduced this missing from cronwf behaviour, as missing cronwf launches is a critical situation. If I find again this behaviour, I will open new issue with logs and information about it.

@juliev0
Copy link
Contributor

juliev0 commented Jul 17, 2023

@Guillermogsjc Thanks for the response. I think the change put in by Terry only appeared to resolve this. I’m not sure which version you were running before but it seems like another change was put in recently to remove logging of all K8S API responses. At least when I ran with a version of master from early June I no longer saw all of those API response logs like before.

I think we need to use the WorkflowTemplate informer (cache like you said) when the CronWorkflow needs to validate the WorkflowTemplate instead of making the GET calls. I will open up an enhancement issue for that and if I have time work on it.

@juliev0
Copy link
Contributor

juliev0 commented Jul 17, 2023

Also, I see in your original log:
I0525 08:49:01.033669 1 request.go:601] Waited for 1.024306564s due to client-side throttling, not priority and fairness, request: PATCH:https://10.100.0.1:443/apis/argoproj.io/v1alpha1/namespaces/argo/cronworkflows/parallelism-test-cron-wf-44
This is the kubernetes client locally throttling the message. Until we start using the Informer, you need to increase the --qps and --burst settings passed to the workflow controller to allow more messages per second. The call shown above (PATCH) is being used in the CronWorkflow code to update the Workflow's Status both when the Workflow starts running and when it's done.

I also see this line:

time="2023-05-25T09:05:31.996Z" level=info msg=healthz age=5m0s err="workflow never reconciled: b12-34-1685001600-anl" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=

This is your Controller basically failing the health check here because there are Workflows in the system that have never been seen by the Controller even once. Maybe somehow the throttling (and maybe the retries from throttling) are blocking goroutines from running to process these Workflows?

@juliev0
Copy link
Contributor

juliev0 commented Jul 17, 2023

Added #11372

@terrytangyuan
Copy link
Member

terrytangyuan commented Jul 17, 2023

I’m not sure which version you were running before but it seems like another change was put in recently to remove logging of all K8S API responses.

You are right this commit might have slipped into the image I built: c6862e9

Could you perhaps change loglevel to DEBUG and try again? Apologies for the confusion.

@Guillermogsjc
Copy link
Author

Guillermogsjc commented Jul 18, 2023

hi @terrytangyuan tried again with debug on and image dev-fix-list-load,

in the attached example, there is a massive get workflow template calls at "2023-07-18T10:14", where 93 cronwf have to be launched. I guess that #11372 will improve a lot the cronwf cronwf controller launching k8s api burden.

kubernetes v1.27 (so affected by QPS abd BURST bump)
argo v3.4.8 with workflow controller tag specified

workflow controller launched as

 - workflow-controller
 - '--qps=50'
 - '--burst=100'
 - '--workflow-ttl-workers=8'
 - '--workflow-workers=100'
 - '--loglevel=debug'

all 93 workflows have been launched 5 minutes later at 10:19 happily (k8s v1.27 rocks here)

logs-from-workflow-controller-in-workflow-controller-76ffd854c4-fk8lt (1).log

terrytangyuan added a commit that referenced this issue Jul 19, 2023
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
dpadhiar pushed a commit to dpadhiar/argo-workflows that referenced this issue May 9, 2024
…argoproj#11343)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Dillen Padhiar <dillen_padhiar@intuit.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/workflow-templates P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants