Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulty scaling when running many workflows and/or steps #4634

Closed
tomgoren opened this issue Dec 2, 2020 · 39 comments · Fixed by #4728
Closed

Difficulty scaling when running many workflows and/or steps #4634

tomgoren opened this issue Dec 2, 2020 · 39 comments · Fixed by #4728
Labels

Comments

@tomgoren
Copy link
Contributor

tomgoren commented Dec 2, 2020

Problem statement

  • Use case: processing media assets as they are uploaded by customers
    • We are trying to test the case of continuously created workflows, with the hope of relatively predictable responsiveness.
    • In an ideal world, we would be able to run ~2500 workflows in parallel at peak times
    • Expected workflow run time varies quite a bit, but most of them will take less than five minutes
  • In the synthetic benchmarks detailed below, our nodegroup scale limits are never reached. We don’t seem to be able to run workflows quickly enough to saturate our capacity.

Environment details

  • Running on EKS
    • Kubernetes version: 1.17
    • Controller running on instance type: m5.8xlarge (32 vCPU, 128GB RAM)
    • Workflow executors instance type: m5.xlarge (4 vCPU, 16GB RAM)
  • Argo details:
    • Modified for additional metrics based on master at commit 5c538d7a918e41029d3911a92c6ac615f04d3b80
    • Running with parallelism: 800, otherwise we observed the EKS control place becoming unresponsive
    • Running with containerRuntimeExecutor: kubelet on AWS Bottlerocket instances

Case 1 (many workflows, few steps)

  • We created a workflow template (see fig.1) with a modifiable number of steps. We launched workflows at a rate of 3 per second with a script (see fig.2)
  • Initially the controller seems to keep up, with the running workflow count growing, and no noticeable delays. After a few minutes we observe a growing number of pending workflows, as well as an oscillation pattern in several metrics.
  • Instead of scaling up to meet demand, we see kubernetes node capacity go unused as the number of running workflows oscillates
  • We added custom metrics in the controller to monitor the workflow and pod queues (wfc.wfQueue and wfc.podQueue in controller/controller.go). The workflow queue oscillates between 1000 and 1500 items during our test. However, the pod queue consistently stays at 0.

Case 2 (few workflows, many steps)

  • We created a similar template (fig. 6) and script (fig. 7) that generates a number of “work” steps that run in parallel, and concludes with a final “followup” step.
  • When launching one workflow at a time, or at an interval larger than 20s, everything ran smoothly (pods completed and workflows were successful)
  • When the flow of workflows was increased, we began to see the “zombie” phenomena (fig. 4) - even though the pod was marked as Completed, the workflow lingers in the Running state (fig. 5).

Things we tried

In trying to address these issues, we changed the values of the following parameters without much success:

  • pod-workers
  • workflow-workers (the default of 32 was a bottleneck, but anything over 128 didn’t make a difference)
  • INFORMER_WRITE_BACK=false
  • --qps, --burst
  • We increased the memory and CPU resources available to the controller
  • workflowResyncPeriod and podResyncPeriod
  • We’ve also tried various experimental branches from recent seemingly related issues (and we are happy to try more!)

fig. 1

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: sleep-test-template
  generateName: sleep-test-
  namespace: argo-workflows
spec:
  entrypoint: sleep
  ttlStrategy:
    secondsAfterSuccess: 0
    secondsAfterFailure: 600
  podGC:
    strategy: OnPodCompletion
  arguments:
    parameters:
      - name: friendly-name
        value: sleep_test # Use underscores, not hyphens
      - name: cpu-limit
        value: 2000m
      - name: mem-limit
        value: 1024Mi
      - name: step-count
        value: "200"
      - name: sleep-seconds
        value: "8"
  metrics:
    prometheus:
      - name: "workflow_duration"      # Metric name (will be prepended with "argo_workflows_")
        help: "Duration gauge by name" # A help doc describing your metric. This is required.
        labels:
           - key: workflow_template
             value: "{{workflow.parameters.friendly-name}}"
        gauge:                            # The metric type. Available are "gauge", "histogram", and "counter".
          value: "{{workflow.duration}}"  # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
      - name: "workflow_processed"
        help: "Workflow processed count"
        labels:
           - key: workflow_template
             value: "{{workflow.parameters.friendly-name}}"
           - key: status
             value: "{{workflow.status}}"
        counter:
          value: "1"
  templates:
  - name: sleep
    nodeSelector:
      intent: task-workers
    steps:
      - - name: generate
          template: gen-number-list
      - - name: "sleep"
          template: snooze
          arguments:
            parameters: [{name: input_asset, value: "{{workflow.parameters.sleep-seconds}}", id: "{{item}}"}]
          withParam: "{{steps.generate.outputs.result}}"

  # Generate a list of numbers in JSON format
  - name: gen-number-list
    nodeSelector:
      intent: task-workers
    script:
      image: python:3.8.5-alpine3.12
      imagePullPolicy: IfNotPresent
      command: [python]
      source: |
        import json
        import sys
        json.dump([i for i in range(0, {{workflow.parameters.step-count}})], sys.stdout)
  - name: snooze
    metrics:
      prometheus:
        - name: "resource_duration_cpu"      # Metric name (will be prepended with "argo_workflows_")
          help: "Resource Duration CPU" # A help doc describing your metric. This is required.
          labels:
             - key: workflow_template
               value: "{{workflow.parameters.friendly-name}}"
          gauge:                            # The metric type. Available are "gauge", "histogram", and "counter".
            value: "{{resourcesDuration.cpu}}"  # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
        - name: "resource_duration_memory"      # Metric name (will be prepended with "argo_workflows_")
          help: "Resource Duration Memory" # A help doc describing your metric. This is required.
          labels:
             - key: workflow_template
               value: "{{workflow.parameters.friendly-name}}"
          gauge:                            # The metric type. Available are "gauge", "histogram", and "counter".
            value: "{{resourcesDuration.memory}}"  # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
    nodeSelector:
      intent: task-workers
    inputs:
      parameters:
        - name: input_asset
    podSpecPatch: '{"containers":[{"name":"main", "resources":{"requests":{"cpu": "{{workflow.parameters.cpu-limit}}", "memory": "{{workflow.parameters.mem-limit}}"}, "limits":{"cpu": "{{workflow.parameters.cpu-limit}}", "memory": "{{workflow.parameters.mem-limit}}" }}}]}'
    container:
      image: alpine
      imagePullPolicy: IfNotPresent
      command: [sleep]
      args: ["{{workflow.parameters.sleep-seconds}}"]

fig. 2

#!/usr/bin/env bash
set -euo pipefail
while true; do
  for i in {1..3}; do
    argo submit \
        -n argo-workflows \
        --from workflowtemplate/sleep-test-template \
        -p step-count="1" \
        -p sleep-seconds="60" &>/dev/null &
  done
  sleep 1
  echo -n "."
done

fig. 3

Screen Shot 2020-12-01 at 4 09 01 PM

fig.4

❯ argo -n argo-workflows get sleep-fanout-test-template-6dtjp
Name:                sleep-fanout-test-template-6dtjp
Namespace:           argo-workflows
ServiceAccount:      default
Status:              Running
Created:             Wed Dec 02 15:39:59 -0500 (6 minutes ago)
Started:             Wed Dec 02 15:39:59 -0500 (6 minutes ago)
Duration:            6 minutes 21 seconds
ResourcesDuration:   42m21s*(1 cpu),2h30m41s*(100Mi memory)
Parameters:
  step-count:        100
  sleep-seconds:     8

STEP                                 TEMPLATE         PODNAME                                      DURATION  MESSAGE
 ● sleep-fanout-test-template-6dtjp  sleep
 ├---✔ generate                      gen-number-list  sleep-fanout-test-template-6dtjp-2151903814  7s
 ├-·-✔ sleep(0:0)                    snooze           sleep-fanout-test-template-6dtjp-1189074090  14s
 | ├-✔ sleep(1:1)                    snooze           sleep-fanout-test-template-6dtjp-1828931302  25s
...
 | └-✔ sleep(99:99)                  snooze           sleep-fanout-test-template-6dtjp-1049774502  16s
 └---◷ followup                      snooze           sleep-fanout-test-template-6dtjp-1490893639  5m

fig. 5

❯ kubectl -n argo-workflows get pod/sleep-fanout-test-template-6dtjp-1490893639
NAME                                          READY   STATUS      RESTARTS   AGE
sleep-fanout-test-template-6dtjp-1490893639   0/2     Completed   0          5m43s

fig. 6

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: sleep-fanout-test-template
  generateName: sleep-fanout-test-
  namespace: argo-workflows
spec:
  entrypoint: sleep
  ttlStrategy:
    secondsAfterSuccess: 0
    secondsAfterFailure: 600
  podGC:
    strategy: OnPodCompletion
  arguments:
    parameters:
      - name: friendly-name
        value: sleep_fanout_test # Use underscores, not hyphens
      - name: cpu-limit
        value: 2000m
      - name: mem-limit
        value: 1024Mi
      - name: step-count
        value: "200"
      - name: sleep-seconds
        value: "8"
  metrics:
    prometheus:
      - name: "workflow_duration"      # Metric name (will be prepended with "argo_workflows_")
        help: "Duration gauge by name" # A help doc describing your metric. This is required.
        labels:
           - key: workflow_template
             value: "{{workflow.parameters.friendly-name}}"
        gauge:                            # The metric type. Available are "gauge", "histogram", and "counter".
          value: "{{workflow.duration}}"  # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
      - name: "workflow_processed"
        help: "Workflow processed count"
        labels:
           - key: workflow_template
             value: "{{workflow.parameters.friendly-name}}"
           - key: status
             value: "{{workflow.status}}"
        counter:
          value: "1"
  templates:
  - name: sleep
    nodeSelector:
      intent: task-workers
    steps:
      - - name: generate
          template: gen-number-list
      - - name: "sleep"
          template: snooze
          withParam: "{{steps.generate.outputs.result}}"
      - - name: "followup"
          template: snooze

  # Generate a list of numbers in JSON format
  - name: gen-number-list
    nodeSelector:
      intent: task-workers
    script:
      image: python:3.8.5-alpine3.12
      imagePullPolicy: IfNotPresent
      command: [python]
      source: |
        import json
        import sys
        json.dump([i for i in range(0, {{workflow.parameters.step-count}})], sys.stdout)
  - name: snooze
    metrics:
      prometheus:
        - name: "resource_duration_cpu"      # Metric name (will be prepended with "argo_workflows_")
          help: "Resource Duration CPU" # A help doc describing your metric. This is required.
          labels:
             - key: workflow_template
               value: "{{workflow.parameters.friendly-name}}"
          gauge:                            # The metric type. Available are "gauge", "histogram", and "counter".
            value: "{{resourcesDuration.cpu}}"  # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
        - name: "resource_duration_memory"      # Metric name (will be prepended with "argo_workflows_")
          help: "Resource Duration Memory" # A help doc describing your metric. This is required.
          labels:
             - key: workflow_template
               value: "{{workflow.parameters.friendly-name}}"
          gauge:                            # The metric type. Available are "gauge", "histogram", and "counter".
            value: "{{resourcesDuration.memory}}"  # The value of your metric. It could be an Argo variable (see variables doc) or a literal value
    nodeSelector:
      intent: task-workers
    podSpecPatch: '{"containers":[{"name":"main", "resources":{"requests":{"cpu": "{{workflow.parameters.cpu-limit}}", "memory": "{{workflow.parameters.mem-limit}}"}, "limits":{"cpu": "{{workflow.parameters.cpu-limit}}", "memory": "{{workflow.parameters.mem-limit}}" }}}]}'
    container:
      image: alpine
      imagePullPolicy: IfNotPresent
      command: [sleep]
      args: ["{{workflow.parameters.sleep-seconds}}"]

fig. 7

#!/usr/bin/env bash
set -euo pipefail
while true; do
  argo submit \
    -n argo-workflows \
    --from workflowtemplate/sleep-fanout-test-template \
    -p step-count="100" \
    -p sleep-seconds="8" &>/dev/null
  echo -n "."
  sleep 10
done

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@alexec
Copy link
Contributor

alexec commented Dec 2, 2020

@tomgoren
Copy link
Contributor Author

tomgoren commented Dec 3, 2020

Indeed we have read the thread, and we followed all of the workaround suggested there. We have been keeping up to date with the latest builds from master, yes.

  • Do you see any controller crashes or restarts?

We haven't seen a controller crash on the most recent builds.

  • How many nodes does each workflow have?

In our first test, we submitted three workflows per second of two steps each.
In the second test, we submitted one workflow of 102 steps each every 10 seconds.

  • How many pods are running concurrently?

The highest number of pods we've seen run concurrently in our steady state tests has been around 200.

In those tests, the number of workflows in the Pending state accumulate, at which point we fall behind, and the controller never catches up with the incoming submitted workflows.

In the case of the few workflows with many steps each, the number of pods in the Completed increases, and we observe "zombie" workflows stuck in the Running state.

@alexec
Copy link
Contributor

alexec commented Dec 3, 2020

What TZ are you in? Can we get on a Zoom in the next few days?

@alexec
Copy link
Contributor

alexec commented Dec 8, 2020

Your actions:

  • Try using no parallelism.
  • Can you check the memory settings? Should not max out at 1GiB.
  • Try :no-sig and :easyjson.
  • Separately, try using semaphores instead.

My actions

  • I'll check to see if we have a memory leak.

@acj
Copy link

acj commented Dec 8, 2020

@alexec Thanks again for your time today. We're looking forward to getting to the bottom of this :)

Try using no parallelism.

We ran this test for about 30 mins using the script in Fig 2 (submits 3 wf/sec, two steps total, about 60-70 seconds expected total time per workflow). Workflows seemed to start and complete smoothly with a steady accumulation of "succeeded" workflows. No workflow GC ever appeared to run, which is unexpected and different from our prior tests. Workflow throughput dropped by 50% about 20 minutes into the run, which is what we've normally seen when the GC operation starts, but the number of succeeded workflows never decreased. Pending workflows hovered around 300 throughout the test. When we submitted a couple of workflows in the middle of the test using the CLI tool, it took around 3.5 minutes for the first step to run. We're not sure what to conclude here, but running without parallelism doesn't seem significantly different from our test this morning except for the lack of workflow GC.

Can you check the memory settings? Should not mmax out at 1GiB.

We think this was a red herring. In later tests, memory usage seemed to increase in proportion to the number of workflows that were being tracked, which seems reasonable.

Our current manifest reserves 1GiB of memory for the Argo controller but doesn't set a limit. The node has significant memory available, so we shouldn't be bumping into any physical or OS limits.

Separately, try using semaphores instead.

We tried using semaphores while parallelism was disabled and saw very strange behavior. About 12 workflows would run to completion at the start of the test, and then none would complete. The number of running workflows increased to 500 (the limit I had set in the configmap), and pending workflows increased until I ended the test. No pods were created, though. When we tried to submit a job in the middle of the test, we saw this:

Message:             Waiting for argo-workflows/ConfigMap/semaphores/workflow lock. Lock status: 0/500

Only when we canceled the script and cleaned up the workflows (delete --all) did the locks seem to get released.

This is our configmap and the relevant part of the workflow spec, which is adapted from the synchronization-wf-level example:

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: argo-workflows
  name: semaphores
data:
  workflow: "500"
spec:
  ...
  synchronization:
    semaphore:
      configMapKeyRef:
        name: semaphores
        key: workflow

Try :no-sig and :easyjson.

We'll give these a try tomorrow and will report back.

I'll check to see if we have a memory leak.

BTW, we ran pprof against the controller a few weeks ago and didn't see any interesting CPU hot spots (aside from processNextItem and JSON processing like you'd mentioned, which is understandable) or memory leaks. We're happy to run more tests against the latest controller if that would be useful

@acj
Copy link

acj commented Dec 8, 2020

Forgot to mention that we've been seeing a lot of "Deadline exceeded" warnings in our controller logs during recent runs (not just today). I've added a metric to our controller build to count them, so we should have hard numbers to share next time if that would be a useful signal.

@alexec alexec linked a pull request Dec 8, 2020 that will close this issue
@alexec
Copy link
Contributor

alexec commented Dec 8, 2020

@acj

I've done some analysis and can't see any memory leaks. That ties with your analysis.

I've created a dev build intended to reduce the amount to Kubernetes API requests the controller makes by limiting each workflow to one reconciliation every 15s (configurable). The details are in the linked PR. Would you like to try that out?

@acj
Copy link

acj commented Dec 9, 2020

Okay, we have a few more test results. Short version: no significant improvement.

  • Using :ratel, we still saw oscillations in the key metrics after ~15 minutes and slowly fell behind. Queue depth, time in queue, and the running/pending workflow counts slowly increased. We had a fairly large backlog of completed pods (several hundred) near the end of the run, but there was no backlog of succeeded workflows. This run seemed mostly stable, but we still fell behind on workflow processing.

  • Using :easyjson (~20 days old now), we saw steep growth in the number of pending workflows. The running pod count hovered around 160, which isn't enough to keep pace with our incoming workflows. The "time in queue" metric increased much more slowly than in the first test, but it did slowly increase. After about 30 minutes, many of the metrics seemed to become unstable. A few hundred workflows were GC'd at roughly the same time, which is a pattern that we've seen fairly often. (Unsure of cause/effect here, but they happen together quite often.) After interrupting our WF submission script and letting the controller return to idle, there were still ~300 succeeded workflows and ~800 pending workflows according to the CLI. The prometheus metrics reported -3200 (not a typo) succeeded workflows. I suspect that many of these quirks have been fixed since this image was built, but they seemed worth sharing.

  • Using :no-sig (similarly old), we also saw steep growth in the queue depth and pending workflow count. No obvious change in behavior compared to :easyjson.

  • We also tried using a controller built from master (with our custom controller metrics added) with MAX_OPERATION_TIME=90s to see if it would resolve the "Deadline exceeded" warnings. It did seem to resolve the warnings, but we still slowly fell behind in workflow processing.

The close timing between succeeded workflows getting GC'd and performance becoming unstable seems interesting. Would it be worthwhile to try disabling GC or making the GC period very long to see if there's actually a connection? Anything else we should try?

@alexec
Copy link
Contributor

alexec commented Dec 10, 2020

large backlog of completed pods (several hundred) near the end of the run, but there was no backlog of succeeded workflows.

Do you mean you had zombies?

@alexec
Copy link
Contributor

alexec commented Dec 10, 2020

Completed and GC pods are bounded to 512 before blocking:

		completedPods:              make(chan string, 512),
		gcPods:                     make(chan string, 512),

@acj
Copy link

acj commented Dec 10, 2020

Do you mean you had zombies?

I don't think so, at least by the definition given in #4560. The workflows seemed to reliably complete and get GC'd, but we had a lot of completed pods hanging around (visible using kubectl) until the end of the test.

Completed and GC pods are bounded to 512 before blocking

We noticed this a while back and wondered if it was contributing to our perf issues. What controller behavior would you expect to see when those channels are full?

:ratel should reduce controller CPU notably

I didn't pay close attention to CPU usage yesterday, but I can confirm that it was about 6x lower compared to other recent runs. Nice!

@alexec
Copy link
Contributor

alexec commented Dec 10, 2020

Good news about the CPU. I think these are some design issues in pod GC (see #4693).

We noticed this a while back and wondered if it was contributing to our perf issues. What controller behavior would you expect to see when those channels are full

When the channel gets full, it will block when new entries are added to it. This means that reconciliation will take longer.

I'm creating a new build with a configurable fix for #4693.

@alexec
Copy link
Contributor

alexec commented Dec 10, 2020

@acj I've created a new dev build. Can you run this firstly without any env var changes as a baseline? I expect you to see an improvement with default settings. Then can you try the new env vars as listed in PR please?

Thank you again.

@acj
Copy link

acj commented Dec 10, 2020

Will do. We ran the latter test (today's :ratel + the new env vars related to GC and queuing) this afternoon and had the smoothest run so far. We still slowly fell behind (increasing time in queue, pending workflows, etc), but the workflow completion rate was very consistent throughout the hour-long run, which is great to see. We'll run it again with default settings tomorrow morning and then post our findings.

Do you think we should also include the zombie env vars and RECENTLY_STARTED_POD_DURATION? They didn't seem to apply to us, so we left them out.

   Environment:
      INLINE_POD_GC:             true
      COMPLETED_PODS_QUEUE:      2048
      GC_PODS_QUEUE:             2048
      LEADER_ELECTION_IDENTITY:  workflow-controller-66d46fb787-gvzcm (v1:metadata.name)

@alexec
Copy link
Contributor

alexec commented Dec 10, 2020

Do you think we should also include the zombie env vars and RECENTLY_STARTED_POD_DURATION?

I think it is best to test one thing at a time :)

@alexec
Copy link
Contributor

alexec commented Dec 10, 2020

Knapkin math:

  • 2500 workflows / 15 seconds per reconcilliation = 166 workflow workers needed
  • 2500 workflows / 30 seconds per reconcilliation = 83 workflow workers needed

Can you try:

  • DEFAULT_REQUEUE_TIME=30s
  • DEFAULT_REQUEUE_TIME=60s

I think you should get more workflow throughput (i.e. the queue should not grow).

@acj
Copy link

acj commented Dec 11, 2020

I think it is best to test one thing at a time :)

Completely agree. I was trying to get clarity on what you meant here:

Then can you try the new env vars as listed in PR please?

Whether we should apply all of those new env vars at once, or by group, or one at a time, etc.

@alexec
Copy link
Contributor

alexec commented Dec 11, 2020

Whether we should apply all of those new env vars at once, or by group, or one at a time, e

I've updated the PR description to make this clearer.

@acj
Copy link

acj commented Dec 11, 2020

We see a pretty big difference between :ratel with the defaults vs. :ratel with INLINE_POD_GC, COMPLETED_PODS_QUEUE, and GC_PODS_QUEUE set to the values I mentioned above. With the defaults, there's steep growth in queue depth (especially for wf_ttl_queue), succeeded workflow count, and time in queue (again for wf_ttl_queue more than workflow_queue). CPU usage is also noticeably higher, though still much lower than what we were seeing with :latest. With the customized env var values, we see much slower queue and pending workflow growth (though still some). In both cases, workflow throughput is unsteady until about the 10-minute mark when it levels off. The k8s API request rate (per the new metric) was steady after the first couple of minutes. Still not able to keep up with the workflow queue.

The customized env vars from the previous test seemed to give us the best results, so we carried those into the next tests. Changing DEFAULT_REQUEUE_TIME to 30s and then 60s didn't seem to make a difference. The queue growth rate was slightly slower with requeue time set to 60s.

On a whim, we reran that last test (requeue time at 60s) with a modified script that submits 2 WF/sec instead of 3. The queue depth, time in queue, and pending workflow count all stayed near zero. It took around 30 seconds to submit a workflow (sleeps for 1s and exits) using the CLI and see it fully complete, which is a little sluggish, but much better than we saw in our pre-:ratel tests.

We also tried making the script submit 5 WF/sec. Notably, the k8s API request rate seemed to plateau at the same point whether we submitted 3/sec or 5/sec, which makes me think we're being throttled either by the API server or by the rate limiters in the controller (e.g. workqueue or similar). Maybe that's our next bottleneck?

@alexec
Copy link
Contributor

alexec commented Dec 11, 2020

Can I ask you to execute another test? INLINE_POD_GC=false instead of on. To back-up this solution, you should see a metric name gcPod and it should top out at 2048 (or 512). That would be strong evidence of this being a change we should pursue.

@alexec
Copy link
Contributor

alexec commented Dec 12, 2020

@acj I think we're getting closer to a solution, so as a result I've created a series of PRs with one fix (:ratel is stuffed with many fixes). This should make it easier to understand the impact of each change in isolation.

Assumptions:

  • You're not getting zombie workflows.
  • You're not seeing "pod deleted" errors.

Rejected hypothesis:

  • Too many Kubernetes API requests (shown by k8s_request_total).
  • JSON marshaling (as :easyjson did not help)`.
  • Significant pod changes missed (as :no-sig did not help).

Accepted hypothesis:

  • Rate-limiting the throughput of workflow reconciliations (DEFAULT_REQUEUE_TIME)
  • We're not GCing pods fast enough.

I've created dev builds specifically to address these items:

  • argoproj/workflow-controller:rate adds configurable rate-limiting (a short 2s by default)
  • argoproj/workflow-controller:gc-queue has 4 as opposed to 1 worker to GC pods.

Would you be able to get onto a Zoom early next week please?

@acj
Copy link

acj commented Dec 14, 2020

Can I ask you to execute another test? INLINE_POD_GC=false instead of on.

Yep, we'll give this a try tomorrow

Rejected hypothesis:

  • Too many Kubernetes API requests (shown by k8s_request_total).

Given the plateauing behavior we were seeing last week (bottom of my last comment), is it safe to reject this hypothesis? I'm still wondering about a possible bottleneck there

Would you be able to get onto a Zoom early next week please?

Sure. We'll ping you in slack once we get our schedule sorted

@tomgoren
Copy link
Contributor Author

We ran a test with INLINE_POD_GC set to true, (fig. 1) and observed relatively steady processing of workflows for over an hour. It was the smoothest run thus far.

  • CPU usage was nominal.
  • Completed pods accumulated
  • Pending workflows accumulated slowly

When we ran with INLINE_POD_GC: false, (fig. 2) this was the outcome:

  • CPU usage climbed throughout the run
  • workflows in the Succeeded state accumulated and never cleared
  • the controller crashed after around 40 minutes.

Stack trace from the controller crash:

E1215 17:38:02.451933       1 leaderelection.go:307] Failed to release lock: Lease.coordination.k8s.io "workflow-controller-lease" is invalid: spec.leaseDurationSeconds: Invalid value: 0: must be greater than 0
 12:38:02.514 test-cluster-2 workflow-controller time="2020-12-15T17:38:02.514Z" level=info msg="stopped leading" id=workflow-controller-688877bc7c-cjqrr
 12:38:02.527 test-cluster-2 workflow-controller panic: http: Server closed
 12:38:02.527 test-cluster-2 workflow-controller  app='workflow-controller'  logfile='/containerd/workflow-controller.log'  pod_uid='a50c89d2-97b2-41c8-ba8f-e3274fd7e682'  monitor='agentKubernetes'  message=''  serverHost='scalyr-agent-2-rfrrn'  pod_name='workflow-controller-688877bc7c-cjqrr'  pod_namespace='argo'  scalyr-category='log'  parser='docker'  stream='stderr'  pod-template-hash='688877bc7c'  k8s_node='ip-10-101-110-41.ec2.internal'  raw_timestamp='1608053882527506081'  container_id='04bbaed6801942e9a7a2aee2b8050d527882c876697fbb2d795184656a2d2027'
 12:38:02.527 test-cluster-2 workflow-controller goroutine 708 [running]:
 12:38:02.527 test-cluster-2 workflow-controller github.com/argoproj/argo/workflow/metrics.runServer.func1(0x1, 0x1aac957, 0x8, 0x2382, 0x0, 0x0, 0xc001072000)
 12:38:02.527 test-cluster-2 workflow-controller 	/Users/acollins8/go/src/github.com/argoproj/argo/workflow/metrics/server.go:53 +0x117
 12:38:02.527 test-cluster-2 workflow-controller created by github.com/argoproj/argo/workflow/metrics.runServer
 12:38:02.527 test-cluster-2 workflow-controller 	/Users/acollins8/go/src/github.com/argoproj/argo/workflow/metrics/server.go:50 +0x246

fig. 1

Screen Shot 2020-12-15 at 12 16 03 PM

fig. 2

Screen Shot 2020-12-15 at 12 39 39 PM

@tomgoren
Copy link
Contributor Author

Test results from using the :gc-queue image:

  • CPU usage grew as the test went on, and maxed out (32 cores) after ~30 minutes
  • "Succeeded" workflow tally accumulated throughout the test, only clearing after we killed the script that submits new workflows
  • Pending workflows steadily grew at the same pace as the running workflows
  • workflow_ttl_queue grew throughout and its behavior mirrored the "Succeeded" workflow count

Screen Shot 2020-12-16 at 2 35 12 PM

@alexec alexec linked a pull request Dec 16, 2020 that will close this issue
1 task
@alexec
Copy link
Contributor

alexec commented Dec 16, 2020

#4761

@alexec
Copy link
Contributor

alexec commented Dec 16, 2020

I'm pretty sure we have a new bug in TTL.

@tomgoren
Copy link
Contributor Author

Testing with :latest (argoproj/workflow-controller@sha256:4bccc84a407d275b6b7f0e5072341cdfd293fd098b8d2d10465ecb85c6265e49) - fig. 1:

  • CPU climbed steadily
  • Succeeded count kept going up
  • workflow_ttl_queue kept going up
  • Running and pending relatively steady - both grew slowly in tandem

Testing with :rate - fig. 2:

  • Essentially identical to the first test

Testing with:latest (same image SHA as above), with --workflow-ttl-workers set to 16 instead of the default 4 - fig. 3:

  • Initially the tally of "Succeeded" workflows went up, but cleared after ~8 minutes and stayed down
  • CPU usage was nominal throughout the test
  • workflow_queue grew consistently
  • number of running workflows grew consistently, and the pending count followed suit
  • pods in the "Completed" state accumulated (at the end of the test run the count was 1100+)

fig. 1

Screen Shot 2020-12-16 at 3 20 08 PM

fig. 2

Screen Shot 2020-12-16 at 4 04 08 PM

fig. 3

Screen Shot 2020-12-16 at 4 43 26 PM

@alexec
Copy link
Contributor

alexec commented Dec 16, 2020

I've been doing some exploratory testing and it is clear TTL often just does not happen. This is a functional bug, not in fact a scaling issue. I'll fix this and get back to you as I'm currently trying to improve the performance of something fundamentally broken.

@alexec alexec reopened this Dec 17, 2020
@alexec
Copy link
Contributor

alexec commented Dec 17, 2020

Accidentally close.

@alexec
Copy link
Contributor

alexec commented Dec 17, 2020

@tomgoren I've merged #4728 master. I'd like you to test my bug fix for TTL :fix-ttl (based on master).

@alexec
Copy link
Contributor

alexec commented Dec 17, 2020

I should note I've run submitting 600 workflows at once on my Macbook. That peaked at 60 concurrent a second. I'm going to launch this on a test cluster tomorrow.

@tomgoren
Copy link
Contributor Author

With :fix-ttl (still with --workflow-ttl-workers at 16) we observed the following:

  • Initially (for the first ~2.5 minutes) the workflow_ttl_queue and Succeeded count stayed very low, but then again began to climb
  • Following that the results were much the same as before
  • We are still seeing an accumulation of pods lingering in the Completed state

Screen Shot 2020-12-17 at 12 31 40 PM

As per our conversation, we are going to run a similar test, this time skipping Argo to verify that the cluster can sustain the desired capacity.

alexec added a commit to alexec/argo-workflows that referenced this issue Dec 18, 2020
Signed-off-by: Alex Collins <alex_collins@intuit.com>
@tomgoren
Copy link
Contributor Author

Using the latest images and recommendations:

spec:
      containers:
      - args:
          - --configmap
          - workflow-controller-configmap
          - --executor-image
          - argoproj/argoexec:fix-ttl
          # --qps >= 2x pods created/per seconds
          - --qps=1024
          - --pod-cleanup-workers=32
          - --workflow-workers=256
          - --workflow-ttl-workers=16
        command:
          - workflow-controller
        # WISTIA: pinned to `:fix-ttl` as of 2020-12-21
        image: argoproj/workflow-controller@sha256:847b5117cb83f02d9bef9d17b459e733f9761d279ce64edc012ebe3c7a634f38
        name: workflow-controller
        resources:
          requests:
            cpu: 7000m
            memory: 1G
        env:
          - name: LEADER_ELECTION_IDENTITY
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.name

We observed the following:

Screen Shot 2020-12-22 at 2 27 01 PM

We are seeing far more consistently running workflows, but now we are also seeing a lot of failed workflows as well. Unfortunately the failed pods were reaped immediately so we have only the following information:

❯ argo -n argo-workflows get sleep-test-template-tjh27
Name:                sleep-test-template-tjh27
Namespace:           argo-workflows
ServiceAccount:      default
Status:              Failed
Message:             child 'sleep-test-template-tjh27-4134794983' failed
Conditions:
 Completed           True
Created:             Tue Dec 22 14:19:04 -0500 (3 minutes ago)
Started:             Tue Dec 22 14:19:04 -0500 (3 minutes ago)
Finished:            Tue Dec 22 14:20:19 -0500 (2 minutes ago)
Duration:            1 minute 15 seconds
ResourcesDuration:   3m4s*(1 cpu),11m18s*(100Mi memory)
Parameters:
  step-count:        1
  sleep-seconds:     60

STEP                          TEMPLATE         PODNAME                               DURATION  MESSAGE
 ✖ sleep-test-template-tjh27  sleep                                                            child 'sleep-test-template-tjh27-4134794983' failed
 ├---✔ generate               gen-number-list  sleep-test-template-tjh27-2499020563  5s
 └---✖ sleep(0:0)             snooze           sleep-test-template-tjh27-4134794983  1m        failed with exit code 1

Thank you!

@alexec
Copy link
Contributor

alexec commented Dec 22, 2020

Good news! I have a branch name test Can I ask you to try :test for both the executor and the controller? We'll need to understand why the pods failed. Is it pod deleted? Can you try

      DEFAULT_REQUEUE_TIME:      20s // defaults to 2s, but causes `pod deleted` under high load
      MAX_OPERATION_TIME:        1m // defaults to 30s

@tomgoren
Copy link
Contributor Author

tomgoren commented Jan 5, 2021

Running on :test:

Screen Shot 2021-01-05 at 1 36 02 PM

  • More workflows running concurrently than in previous tests
  • No errors or prematurely deleted pods
  • Workflows in the "succeeded" state linger as do Complete/Terminated pods
  • It seems that at a certain threshold running workflows decrease and then plateau, succeeded workflow tally continues to grow
  • CPU and memory usage are low throughout

Adding the environment variables as recommended:

Screen Shot 2021-01-05 at 2 28 15 PM

  • Largely the same results, though with perhaps a bit more throughput

We'll likely try out v2.12.3 as well and report back.

@tomgoren
Copy link
Contributor Author

tomgoren commented Jan 6, 2021

Very similar results with 2.12.3:

Screen Shot 2021-01-06 at 11 21 26 AM

@alexec
Copy link
Contributor

alexec commented Jan 6, 2021

After a bit of testing, I've found that "running workflows" != "workflows actually doing work". You can have a lot of running workflows, but pods are actually pending.

You charts you shared don't have legends on them so I don't know what metrics they are, but there is a new metric called (I think) pod_count and one called workflow_conditions that shed light on this.

https://github.com/argoproj/argo/pull/4811/files#diff-1ad57e035683c4ad698d4b2b6840add43236bfaefbe4b705f0a81cf28959c4b9R68

Alex

@alexec alexec removed their assignment Jan 14, 2021
@alexec
Copy link
Contributor

alexec commented Jan 15, 2021

@tomgoren we should get on a call again. When are you free?

@no-response
Copy link

no-response bot commented Jan 29, 2021

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment