Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase max reconciliation time #4239

Closed
alexec opened this issue Oct 8, 2020 · 14 comments · Fixed by #4562
Closed

Increase max reconciliation time #4239

alexec opened this issue Oct 8, 2020 · 14 comments · Fixed by #4562
Assignees
Labels
type/feature Feature request

Comments

@alexec
Copy link
Contributor

alexec commented Oct 8, 2020

Summary

We'd want to support 10k+ node workflows, and these tend to fail to schedule all their nodes within in the 10s deadline. In fact, it is possible for a workflow to get so large it can never fully schedule its nodes.

In #2705 (May) we increased the default number of workflow-workflows from 8 to 32.

Let`s increase the reconciliation time (or make it configurable)

Use Cases

10k+ node workflows.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@alexec alexec added type/feature Feature request epic/scaling labels Oct 8, 2020
@alexec
Copy link
Contributor Author

alexec commented Oct 20, 2020

I've done some experiments and I can't see the benefit on 250 node workflows. I strongly expect benefits for larger numbers.

@akloss-cibo
Copy link

I pulled the workflow nodes out of the postgresql for one of our workflows that gets stuck in Running while making no progress. Postgresql takes a while just to process the row in a trivial explain:

postgres=# select count(*) from argo_workflows;
 count
-------
     2
(1 row)

postgres=# explain analyze select json_each(nodes) as node from argo_workflows where uid = 'b66dd218-9555-4745-8cb4-2e38fd014762';
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 ProjectSet  (cost=0.00..2.04 rows=200 width=32) (actual time=1788.320..1929.525 rows=187695 loops=1)
   ->  Seq Scan on argo_workflows  (cost=0.00..1.02 rows=2 width=18) (actual time=0.024..0.028 rows=1 loops=1)
         Filter: ((uid)::text = 'b66dd218-9555-4745-8cb4-2e38fd014762'::text)
         Rows Removed by Filter: 1
 Planning Time: 0.070 ms
 Execution Time: 1936.627 ms
(6 rows)

postgres=#

I did pull down the nodes and:

% time jq . ~/tmp/parsed.json > /dev/null
jq . ~/tmp/parsed.json > /dev/null  19.24s user 0.76s system 97% cpu 20.409 total
% du -h parsed.json
224M	parsed.json
%

@alexec
Copy link
Contributor Author

alexec commented Nov 2, 2020

You should not be querying from argo_workflows using uid. There is no index, so you'll end up sequentially scanning the table.

primary key(clustername,uid,version)

In fact, you should never be querying from that table. It is internal to Argo Workflows.

Can you try argoproj/workflow-controller:latest please?

@akloss-cibo
Copy link

akloss-cibo commented Nov 2, 2020

Since there are only two rows in the table, I think the overhead from the table scan is nominal compared to reading the json.

Yep, we'll give it a go.

Based on another issue, I also set ALL_POD_CHANGES_SIGNIFICANT=true in the workflow-controller's environment.

@akloss-cibo
Copy link

We did have some success with the :latest image in that our workflow made it to Failed status. When trying to retry through the web UI, there's an error. The log says

time="2020-11-04T15:33:10Z" level=error msg="finished unary call with code Unknown" error="upper: no more rows in this result set" grpc.code=Unknown grpc.method=RetryWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2020-11-04T15:33:10Z" grpc.time_ms=25.877 span.kind=server system=grpc

Any ideas?

@alexec
Copy link
Contributor Author

alexec commented Nov 4, 2020

huh - that error is related to offloading - have you turned it on and the off?

@akloss-cibo
Copy link

AFAIK, we have not. Are you suggesting that we try that?

@alexec
Copy link
Contributor Author

alexec commented Nov 4, 2020

No, you should never see that error - it indicates that the data was deleted from the database, but not from etcd. I don't know how this can happen and could be a separate unrelated bug.

Have you ever seen this before?

@akloss-cibo
Copy link

I think that was the first time we've gotten a workflow to finish without lots of manual intervention. We did run it again overnight and have the same behavior, by which I mean the workflow runs, fails, and then cannot be retried. The tables I think are relevant are empty:

postgres=# select count(*) from argo_workflows;
 count
-------
     0
(1 row)

postgres=# select count(*) from argo_archived_workflows;
 count
-------
     0
(1 row)

postgres=#

@alexec alexec self-assigned this Nov 5, 2020
@akloss-cibo
Copy link

We've given another large workflow a try and it has ended up stuck Running with no pods for the workflow. At this point, we have

    - name: OFFLOAD_NODE_STATUS_TTL
      value: 4320m
    - name: ALL_POD_CHANGES_SIGNIFICANT
      value: "true"

set for the workflow controller. We periodically see

2020-11-09 19:18:54.182 UTC [8383] FATAL:  connection to client lost
2020-11-09 19:18:54.182 UTC [8383] STATEMENT:  SELECT
	        "uid", "version", "nodes"
	        FROM "argo_workflows"
	      WHERE ("clustername" = $1 AND "namespace" = $2)

(empty lines removed) in the postgresql server's stdout. I think this may be coming from argo-server, not the workflow-controller. We went up to a pretty big pod for argo-server (8 CPU,s 32 GI memory) to no avail; attempting to use the UI times out and occasionally causes the argo-server pod to either OOM (when it was configured a little smaller) or just fail its health check and restart.

Are there other things we could be trying?

@alexec
Copy link
Contributor Author

alexec commented Nov 9, 2020

Could we have a Zoom?

@akloss-cibo
Copy link

Sure... I just joined the argo slack; is that a good way to get linked up?

@alexec
Copy link
Contributor Author

alexec commented Nov 10, 2020

Ping me on Slack.

@akloss-cibo
Copy link

We've switched to a workflow-of-workflows for this particular task. I've spent a little time trying to create a simpler way to reproduce issues without much luck, but we'll reach out if we find something that we're stuck on.

@alexec alexec linked a pull request Nov 19, 2020 that will close this issue
1 task
alexec added a commit that referenced this issue Nov 21, 2020
…4562)

Signed-off-by: Alex Collins <alex_collins@intuit.com>
alexec added a commit that referenced this issue Nov 21, 2020
…4562)

Signed-off-by: Alex Collins <alex_collins@intuit.com>
brabster pushed a commit to brabster/argo that referenced this issue Nov 24, 2020
…j#4239 (argoproj#4562)

Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Paul Brabban <paul.brabban@gmail.com>
alexcapras pushed a commit to alexcapras/argo that referenced this issue Dec 2, 2020
Signed-off-by: github@finnesand.no <github@finnesand.no>

feat(ui): Add Template/Cron workflow filter to workflow page. Closes argoproj#4532 (argoproj#4543)

Signed-off-by: Tianchu Zhao <evantczhao@gmail.com>

feat(executor): Auto create s3 bucket if not present.

Signed-off-by: Alex Capras <alexcapras@gmail.com>

Apply codegen

Signed-off-by: Alex Capras <alexcapras@gmail.com>

Add argo-e2e label to test wf

Signed-off-by: Alex Capras <alexcapras@gmail.com>

chore: Updated stress test YAML (argoproj#4569)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

docs: Updated kubectl apply command in manifests README (argoproj#4577)

Signed-off-by: Stefan Gloutnikov <stefan@gloutnikov.com>

feat(controller): Make MAX_OPERATION_TIME configurable. Close argoproj#4239 (argoproj#4562)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

docs: Fix a typo in example (argoproj#4590)

Signed-off-by: Takayoshi Nishida <takayoshi.nishida@gmail.com>

feat(controller): Retry transient offload errors. Resolves argoproj#4464 (argoproj#4482)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

fix(server): use the correct name when downloading artifacts (argoproj#4579)

Signed-off-by: Daniel Herman <dherman@factset.com>

fix(server): serve artifacts directly from disk to support large artifacts (argoproj#4589)

Signed-off-by: Daniel Herman <dherman@factset.com>

fix(executor): Handle sidecar killing in a process-namespace-shared pod (argoproj#4575)

Signed-off-by: Daisuke Taniwaki <daisuketaniwaki@gmail.com>

docs: Add JSON schema for IDE validation (argoproj#4581)

Signed-off-by: Paul Brabban <paul.brabban@gmail.com>

refactor: Use polling model for workflow phase metric (argoproj#4557)

Signed-off-by: Simon Behar <simbeh7@gmail.com>

Addressing reviewers comments

Signed-off-by: Alex Capras <alexcapras@gmail.com>

Addressing reviewers comments

docs: Minor typo fix (argoproj#4610)

Signed-off-by: Paavo Pokkinen <paavo.pokkinen@vaimo.com>

fix(controller): Prevent tasks with names starting with digit to use either 'depends' or 'dependencies' (argoproj#4598)

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

fix(docs): Bring minio chart instructions up to date (argoproj#4586)

Signed-off-by: Ranga Krishnan <ranga@bei.re>

fix(executor): Fixed waitMainContainerStart returning prematurely. Closes argoproj#4599 (argoproj#4601)

Signed-off-by: fsiegmund <siegmund@slb.com>

feat(controller): Enhanced artifact repository ref. See argoproj#3184 (argoproj#4458)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

fix: Null check pagination variable (argoproj#4617)

Signed-off-by: Simon Behar <simbeh7@gmail.com>

fix: Perform fields filtering server side (argoproj#4595)

Signed-off-by: Simon Behar <simbeh7@gmail.com>

fix(server): Correct webhook event payload marshalling. Fixes argoproj#4572 (argoproj#4594)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

feat(ui): Add columns--narrower-height to AttributeRow (argoproj#4371)

fix: Fix TestCleanFieldsExclude (argoproj#4625)

Signed-off-by: Simon Behar <simbeh7@gmail.com>

fix(argo-server): fix global variable validation error with reversed dag.tasks (argoproj#4369)

Signed-off-by: chenyu.zheng <chenyu.zheng@hulu.com>

fix: derive jsonschema and fix up issues, validate examples dir… (argoproj#4611)

Signed-off-by: Paul Brabban <paul.brabban@gmail.com>

fix(ui): Reference secrets in EnvVars. Fixes argoproj#3973  (argoproj#4419)

Signed-off-by: Alejandro Tejera <aletepe@gmail.com>

fix(ui): Fix Snyk issues (argoproj#4631)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

feat(executor): More informative log when executors do not support output param from base image layer (argoproj#4620)

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

Codegen patch. Signed off by alexcapras@gmail.com

Codegen patch. Signed off by alexcapras@gmail.com

Delete test.patch
alexcapras pushed a commit to alexcapras/argo that referenced this issue Dec 2, 2020
Signed-off-by: github@finnesand.no <github@finnesand.no>

feat(ui): Add Template/Cron workflow filter to workflow page. Closes argoproj#4532 (argoproj#4543)

Signed-off-by: Tianchu Zhao <evantczhao@gmail.com>

feat(executor): Auto create s3 bucket if not present.

Signed-off-by: Alex Capras <alexcapras@gmail.com>

Apply codegen

Signed-off-by: Alex Capras <alexcapras@gmail.com>

Add argo-e2e label to test wf

Signed-off-by: Alex Capras <alexcapras@gmail.com>

chore: Updated stress test YAML (argoproj#4569)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

docs: Updated kubectl apply command in manifests README (argoproj#4577)

Signed-off-by: Stefan Gloutnikov <stefan@gloutnikov.com>

feat(controller): Make MAX_OPERATION_TIME configurable. Close argoproj#4239 (argoproj#4562)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

docs: Fix a typo in example (argoproj#4590)

Signed-off-by: Takayoshi Nishida <takayoshi.nishida@gmail.com>

feat(controller): Retry transient offload errors. Resolves argoproj#4464 (argoproj#4482)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

fix(server): use the correct name when downloading artifacts (argoproj#4579)

Signed-off-by: Daniel Herman <dherman@factset.com>

fix(server): serve artifacts directly from disk to support large artifacts (argoproj#4589)

Signed-off-by: Daniel Herman <dherman@factset.com>

fix(executor): Handle sidecar killing in a process-namespace-shared pod (argoproj#4575)

Signed-off-by: Daisuke Taniwaki <daisuketaniwaki@gmail.com>

docs: Add JSON schema for IDE validation (argoproj#4581)

Signed-off-by: Paul Brabban <paul.brabban@gmail.com>

refactor: Use polling model for workflow phase metric (argoproj#4557)

Signed-off-by: Simon Behar <simbeh7@gmail.com>

Addressing reviewers comments

Signed-off-by: Alex Capras <alexcapras@gmail.com>

Addressing reviewers comments

docs: Minor typo fix (argoproj#4610)

Signed-off-by: Paavo Pokkinen <paavo.pokkinen@vaimo.com>

fix(controller): Prevent tasks with names starting with digit to use either 'depends' or 'dependencies' (argoproj#4598)

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

fix(docs): Bring minio chart instructions up to date (argoproj#4586)

Signed-off-by: Ranga Krishnan <ranga@bei.re>

fix(executor): Fixed waitMainContainerStart returning prematurely. Closes argoproj#4599 (argoproj#4601)

Signed-off-by: fsiegmund <siegmund@slb.com>

feat(controller): Enhanced artifact repository ref. See argoproj#3184 (argoproj#4458)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

fix: Null check pagination variable (argoproj#4617)

Signed-off-by: Simon Behar <simbeh7@gmail.com>

fix: Perform fields filtering server side (argoproj#4595)

Signed-off-by: Simon Behar <simbeh7@gmail.com>

fix(server): Correct webhook event payload marshalling. Fixes argoproj#4572 (argoproj#4594)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

feat(ui): Add columns--narrower-height to AttributeRow (argoproj#4371)

fix: Fix TestCleanFieldsExclude (argoproj#4625)

Signed-off-by: Simon Behar <simbeh7@gmail.com>

fix(argo-server): fix global variable validation error with reversed dag.tasks (argoproj#4369)

Signed-off-by: chenyu.zheng <chenyu.zheng@hulu.com>

fix: derive jsonschema and fix up issues, validate examples dir… (argoproj#4611)

Signed-off-by: Paul Brabban <paul.brabban@gmail.com>

fix(ui): Reference secrets in EnvVars. Fixes argoproj#3973  (argoproj#4419)

Signed-off-by: Alejandro Tejera <aletepe@gmail.com>

fix(ui): Fix Snyk issues (argoproj#4631)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

feat(executor): More informative log when executors do not support output param from base image layer (argoproj#4620)

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

Codegen patch. Signed off by alexcapras@gmail.com

Codegen patch. Signed off by alexcapras@gmail.com

Delete test.patch

Signed-off-by: Alex Capras <alexcapras@gmail.com>
alexec added a commit that referenced this issue Dec 9, 2020
…4562)

Signed-off-by: Alex Collins <alex_collins@intuit.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature Feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants